Model For Building & Packaging Secure Clients

With respect to running a MapReduce job against data contained in a secure store, a particularly important issue to address is related to the communication of user credentials to the tasks run on each of the DataNodes on which the Hadoop infrastructure executes the job. Recall from above that when using the MapReduce programming model defined by Apache Hadoop the tasks executed by a MapReduce job each act as a client of the store. Thus, if the store is configured for secure access, in order to retrieve the desired data from the store, each task must have access to the credentials of the user associated with that data. The typical mechanism for providing the necessary credentials to a client of a secure store is to manually install the credentials on the client's local file system; for example, by employing a utility such as scp.

Although the manual mechanism is practical for most clients of a secure store, it is extremely impractical for a MapReduce job. This is because a MapReduce job consists of multiple tasks running in parallel, in separate address spaces, each with a separate file system that is generally not under the control of the user. Assuming then, that write access is granted by the Hadoop administrator (a problem in and of itself), this means that manual installation of the client credentials for every possible user known to the given secure store would need to occur on the file system of each of the many nodes in the Hadoop cluster; something that may be very difficult to achieve.

To address this issue, a model will be presented that developers and deployers can employ to facilitate the communication of each user's credentials to a given MapReduce job from the client side of the job; that is, from the address space controlled by the job's client process, owned by the user.

This model will consist of two primary components: a programming model for executing MapReduce jobs that retrieve and process data contained in tables located in a secure store; and a set of "best practices" for building, packaging, and deploying those jobs. Although there is nothing preventing a user from manually installing the necessary security credentials on all nodes in a given cluster, doing so is not only impractical, but may result in various security vulnerabilities. Combining this programming model with the deployment best practices that are presented here should help developers and deployers not only avoid the need to manually pre-install credentials on the DataNodes of the Hadoop cluster, but should also prevent the sort of security vulnerabilities that can occur with manual installation.

Programming Model For MapReduce with Oracle NoSQL Database Security

Recall that when executing a MapReduce job, the client application uses mechanisms provided by the Hadoop infrastructure to initiate the job from a node (referred to as the Hadoop cluster's access node) that has network access to the node running the Hadoop cluster's ResourceManager. If the job will be run against a secure store, then prior to initiating the job, the client must initialize the job's TableInputFormat with the following three pieces of information:

The name of the file that specifies the transport properties the client will use when connecting to the store; which, for the purposes of this document, will be referred to as the login properties file (or login file).
The PasswordCredentials containing the username and password the client will present to the store during authentication.
The name of the file containing the public keys and/or certificates needed for authentication; which, for the purposes of this document, will be referred to as, the client trust file (or trust file).

To perform this initialization of the MapReduce client application, CountTableRows in this case, invokes the setKVSecurity method defined in TableInputFormat. Once this initialization has been performed and the job has been initiated, the job uses that TableInputFormat to create and assign a TableInputSplit (a split) to each of the Mapper tasks that will run on one of the DataNodes in the cluster. The TableInputFormat needs the information initialized by the setKVSecurity method for two reasons:

To connect to the secure store from the access node and retrieve the information needed to create the splits.
To initialize each split with that same security information, so that each such split can connect to the secure store from its DataNode host and retrieve the particular table data the split will process.

In addition to requiring that the MapReduce application use the mechanism just described to initialize and configure the job's TableInputFormat (and thus, its splits) with the information listed above, the model also requires that the public and private security credentials referenced by that information be communicated to the TableInputFormat, as well as the splits, securely. How this is achieved depends on whether that information is being communicated to the TableInputFormat on the client side of the application, or to the splits on the server side.

Communicating Security Credentials to the Server Side Splits

To facilitate communication of the user's security credentials to the splits distributed to each of the DataNodes of the cluster, the model presented here separates public security information from the private information (the username and password), and then stores the private information as part of each split's internal state, rather than on the local file system of each associated DataNode; which may be vulnerable or difficult/impossible to secure. For communication of the public contents of the login and trust files to each such split, the model supports an (optional) mechanism that allows the application to communicate that information as Java resources that each split retrieves from the classpath of the split's Java VM. This avoids the need to manually transfer the contents of those files to each DataNode's local file system, and also avoids the potential security vulnerabilities that can result from manual installation on those nodes. Note that when an application wishes to employ this mechanism, it will typically include the necessary information in a JAR file that is specified to the MapReduce job via the Hadoop command line directive -libjars.

The intent of the mechanism just described is to allow applications to exploit the Hadoop infrastructure to automatically distribute the public login and trust information to each split belonging to the job via a JAR file added to the classpath on each remote DataNode. But it is important to note that although this mechanism is used to distribute the application's public credentials, it must not be used to distribute any of the private information related to authentication; specifically, the username and password. This is important because a JAR file that is distributed to the DataNodes in the manner described may be cached on the associated DataNode's local file system; which might expose a vulnerability. As a result, private authentication information is only communicated as part of each split's internal state.

The separation of public and private credentials supported by this model not only prevents caching the private credentials on each DataNode, but also facilitates the ability to guarantee the confidentiality of that information, via whatever external third party secure communication mechanism the current Hadoop implementation happens to employ. This capability is also important to support the execution of Hive queries against a secure store.

Communicating Security Credentials to the `TableInputFormat`

With respect to the job's TableInputFormat, the programming model supports different options for communicating the user's security information. This is because the TableInputFormat operates only on the access node, on the client side of the job; which means that there is only one file system that needs to be secured. Additionally, unlike the splits, the TableInputFormat is not sent on the wire. Thus, as long as only the user is granted read privileges, both the public and private security information can be installed on the access node's file system without fear of compromise. For this case, the application would typically use system properties on the command line to specify the fully-qualified paths to the login, trust, and password files (or Oracle Wallet); which the TableInputFormat would then read from the local file system, retrieving the necessary public and private security information.

A second option for communicating the user's security credentials to the TableInputFormat is to include the public and private information as resources in the client side classpath of the Java VM in which the TableInputFormat runs. This is the option employed by the example presented in this document, and is similar to what was described above for the splits. This option demonstrates how an application's build model can be exploited to simplify not only the applications's command line, but also the deployment of secure MapReduce jobs in general. As was the case with the splits, applications will typically communicate the necessary security information as Java resources by including that information in a JAR file. But rather than using the Hadoop command line directive -libjars to specify the JAR file to the server side of the MapReduce job, in this case, because the TableInputFormat operates on only the client side access node, the JAR file would simply be added to the HADOOP_CLASSPATH environment variable.

Best Practices: MapReduce Application Packaging for Oracle NoSQL Security

To help users achieve the sort of separation of public and private security information described in previous sections, a set of (optional) best practices related to packaging the client application and its necessary artifacts is presented in this section, and are employed by the example featured in this document. Although the use of these packaging practices is optional, you are encouraged to employ them when working with any MapReduce jobs of your own that will interact with a secure store.

Rather than manually installing the necessary security artifacts (login file, trust file, password file or Oracle Wallet) on each DataNode in the cluster, user's should instead install those artifacts only on the cluster's single access node; the node from which the client application is executed. The client application can then retrieve each artifact from the local environment, repackage the necessary information, and then employ mechanisms provided by the Hadoop infrastructure to transfer that information to the appropriate components of the MapReduce job that will be executed.

For example, as described in the previous section, your client application can be designed to retrieve the username and location of the password from the command line, a configuration file, or a resource in the client classpath; where the location of the user's password is a locally installed password file or Oracle Wallet that can only be read by the user. After retrieving the username from the command line and the password from the specified location, the client uses that information to create the user's PasswordCredentials, which are transferred to each MapReduce task via the splits that are created by the job's TableInputFormat. Using this model, the user's PasswordCredentials, are never written to the file systems of the cluster's DataNodes. They are only held in each task's memory. As a result, the integrity and confidentiality of those credentials only need to be provided when on the wire, which can be achieved by using whatever external third party secure communication mechanism the current Hadoop implementation happens to employ.

With respect to the transfer of the public login and trust artifacts, the client application can exploit the mechanisms provided by the Hadoop infrastructure to automatically transfer classpath (JAR) artifacts to the job's tasks. As demonstrated by the CountTableRows example presented in the body of this document, the client application's build process can be designed to separate the application's class files from its public security artifacts. Specifically, the application's class files and optionally, the public and private credentials, can be placed in a local JAR file on the access node for inclusion in the classpath of the client itself; while only the public login properties and client trust information are placed in a separate JAR file that can be added to the hadoop command line specification of -libjars for inclusion in the classpath of each MapReduce task.

Application Packaging for the Non-Secure Case

To understand how the packaging model discussed here can be employed when executing an application against a secure store, it may be helpful to first review how the CountTableRows example is executed against a non-secure store. Recall from the previous sections, for the non-secure case, the following command was executed to produce a JAR file containing only the class files needed by CountTableRows.

cd /opt/oracle/nosql/apps/kv/examples
jar cvf CountTableRows.jar hadoop/table/CountTableRows*.class

which produced the file CountTableRows.jar, whose contents look like:

META-INF/
META-INF/MANIFEST.MF
hadoop/table/CountTableRows.class
hadoop/table/CountTableRows$Map.class
hadoop/table/CountTableRows$Reduce.class

and the following commands were then be used to execute the CountTableRows example MapReduce job against a non-secure store:

export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:\
    /opt/ondb/kv/lib/kvclient.jar

cd /opt/ondb/kv 
hadoop jar examples/non_secure_CountTableRows.jar \
    hadoop.table.CountTableRows \
    -libjars \
    /opt/oracle/kv-ee/lib/kvclient.jar,\
    /opt/oracle/kv-ee/lib/sklogger.jar,\
    /opt/oracle/kv-ee/lib/commonutil.jar,\ 
    /opt/oracle/kv-ee/lib/failureaccess.jar,\ 
    /opt/oracle/kv-ee/lib/antlr4-runtime-nosql-shaded.jar,\
    /opt/oracle/kv-ee/lib/jackson-core.jar,\
    /opt/oracle/kv-ee/lib/jackson-databind.jar,\
    /opt/oracle/kv-ee/lib/jackson-annotations.jar \
    example-store \
    kv-host-1:5000 \
    vehicleTable \
    /user/example-user/CountTableRows/vehicleTable/0001

Observe that there are three classpaths that must be set when a MapReduce job is executed. First, the jar specification to the Hadoop command interpreter makes the class files of the main program (CountTableRows in this case) accessible to the hadoop launcher mechanism, so that the program can be loaded and executed. Next, the HADOOP_CLASSPATH environment variable must be set to include any third party libraries that the program or the Hadoop framework, running on the local access node, may need to load. For the example above, only kvclient.jar is added to HADOOP_CLASSPATH, so that the Hadoop framework's job initiation mechanism on the access node can access TableInputFormat and its related classes. Compare this with the specification of the -libjars argument which is the third classpath that must be specified. As described below, the -libjars argument must include not only kvclient.jar, but also a number of other third party libraries that may not be available in the remote Hadoop environment.

The Hadoop command interpreter's -libjars argument is used to specify the classpath needed by each MapReduce task executing on the Hadoop cluster's DataNodes. The -libjars argument must include all of the libraries needed to run the desired application that are not already available via the Hadoop platform. For the case above, kvclient.jar, sklogger.jar, commonutil.jar, failureaccess.jar, antlr4-runtime-nosql-shaded.jar, jackson-core.jar, jackson-databind.jar, and jackson-annotations.jar are each specified via the -libjars argument so that each MapReduce task can access classes such as, TableInputSplit and TableRecordReader, as well as the logging related classes and JSON utility classes provided by Oracle NoSQL Database and other support classes that are not generally provided by the Hadoop platform.

Application Packaging and Execution for the Secure Case

Compare the non-secure case described in the previous section with what would be done to run the CountTableRows MapReduce job against a secure store. For the secure case, two JAR files are built; one for the classpath on the client side, and one for the classpaths of the DataNodes on the server side. The first JAR file will be added to the client side classpath and includes not only the class files for the application but also the public and private credentials the client will need to interact with the secure store. Including the public and private credentials in the client side JAR file avoids the inconvenience of having to specify that information on the command line.

The second JAR file will be added to the DataNode classpaths on the server side via the -libjars argument, and will include only the user's public credentials.

As described in the Deploying a Secure Store appendix, the user's password can be stored in either a clear text password file or an Oracle Wallet. As a result, how the first JAR is generated is dependent on whether a password file or an Oracle Wallet is used.

Application Packaging for the Secure Case Using a Password File

If you wish to execute CountTableRows using a password file instead of an Oracle Wallet, and if you have used KVSecurityCreation to generate the user's security artifacts in the manner presented in the Deploying a Secure Store appendix, then both the client side and server side JAR files for the CountTableRows example application are generated by typing the following on the command line:

cd /opt/oracle/nosql/apps/kv/examples
jar cvf CountTableRows-pwdClient.jar \
    hadoop/table/CountTableRows*.class \
    hadoop/table/KVSecurityUtil*.class

cd /tmp
jar uvf \
    /opt/oracle/nosql/apps/kv/examples/CountTableRows-pwdClient.jar \
    client.trust
jar uvf \
    /opt/oracle/nosql/apps/kv/examples/CountTableRows-pwdClient.jar \
    example-user-client-pwdfile.login
jar uvf \
    /opt/oracle/nosql/apps/kv/examples/CountTableRows-pwdClient.jar \
    example-user.passwd

jar cvf \
    /opt/oracle/nosql/apps/kv/examples/CountTableRows-pwdServer.jar /
    client.trust
jar uvf \
    /opt/oracle/nosql/apps/kv/examples/CountTableRows-pwdServer.jar \
    example-user-server.login

The first four commands above produce the client side JAR file named CountTableRows-pwdClient.jar, where the contents of that JAR look like:

META-INF/
META-INF/MANIFEST.MF
hadoop/table/CountTableRows.class
hadoop/table/CountTableRows$Map.class
hadoop/table/CountTableRows$Reduce.class
hadoop/table/KVSecurityUtil.class
client.trust
example-user-client-pwdfile.login
example-user.passwd

The following files in the above code correspond to security artifacts that should remain private to the client.

example-user-client-pwdfile.login
example-user.passwd

The last two commands above produce the server side JAR file named CountTableRows-pwdServer.jar, with contents that look like:

META-INF/
META-INF/MANIFEST.MF
client.trust
example-user-server.login

The last two files from the above list correspond to the client's security artifacts that can be shared publicly.

Application Execution for the Secure Case Using a Password File

If you wish to execute the CountTableRows MapReduce job against a secure store where a password file rather than an Oracle Wallet is used to store the client application's password, then after packaging the application for password file based execution as described in the previous section, you would then type the following on the command line:

export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:\
    /opt/oracle/kv-ee/kv/lib/kvclient.jar:\
    /opt/oracle/nosql/apps/kv/examples/CountTableRows-pwdServer.jar

cd /opt/oracle/nosql/apps/kv

hadoop jar examples/CountTableRows-pwdClient.jar \
    hadoop.table.CountTableRows \
    -libjars \
    /opt/oracle/kv-ee/kv/lib/kvclient.jar,\
    /opt/oracle/kv-ee/kv/lib/sklogger.jar,\
    /opt/oracle/kv-ee/kv/lib/commonutil.jar,\ 
    /opt/oracle/kv-ee/kv/lib/failureaccess.jar,\ 
    /opt/oracle/kv-ee/kv/lib/antlr4-runtime-nosql-shaded.jar,\
    /opt/oracle/kv-ee/kv/lib/jackson-core.jar,\
    /opt/oracle/kv-ee/kv/lib/jackson-databind.jar,\
    /opt/oracle/kv-ee/kv/lib/jackson-annotations.jar,\
    /opt/oracle/nosql/apps/examples/CountTableRows-pwdServer.jar \
    example-store \
    kv-host-1:5000 \
    vehicleTable \
    /user/example-user/CountTableRows/vehicleTable/0001 \
    example-user-client-pwdfile.login \
    example-user-server.login

Application Packaging for the Secure Case Using an Oracle Wallet

Rather than using a file in which to store the client’s password, you may choose to use an Oracle Wallet to store the password in obfuscated form. When an Oracle Wallet will be used and the KVSecurityCreation convenience program was used to generate the wallet based artifacts for CountTableRows in the manner presented in the Deploying a Secure Store appendix, then both the client side and server side JAR files for the wallet based CountTableRows example application are generated by typing the following on the command line:

cd /opt/oracle/nosql/apps/kv/examples
jar cvf CountTableRows-walletClient.jar \
    hadoop/table/CountTableRows*.class \
    hadoop/table/KVSecurityUtil*.class

cd /tmp
jar uvf \
    /opt/oracle/nosql/apps/kv/examples/CountTableRows-walletClient.jar \
    client.trust
jar uvf \
    /opt/oracle/nosql/apps/kv/examples/CountTableRows-walletClient.jar \
    example-user-client-walletfile.login
jar uvf \
    /opt/oracle/nosql/apps/kv/examples/CountTableRows-walletClient.jar \
    example-user-wallet.dir

jar cvf \
    /opt/oracle/nosql/apps/kv/examples/CountTableRows-walletServer.jar /
    client.trust
jar uvf \
    /opt/oracle/nosql/apps/kv/examples/CountTableRows-walletServer.jar \
    example-user-server.login

The first four commands above produce the client side JAR file named CountTableRows-walletClient.jar, where the contents of that JAR look like:

META-INF/
META-INF/MANIFEST.MF
hadoop/table/CountTableRows.class
hadoop/table/CountTableRows$Map.class
hadoop/table/CountTableRows$Reduce.class
hadoop/table/KVSecurityUtil.class
client.trust
example-user-client-wallet.login
example-user-wallet.dir/
example-user-wallet.dir/cwallet.sso

Similarly, the last two commands produce the server side JAR file named CountTableRows-walletServer.jar, with contents:

META-INF/
META-INF/MANIFEST.MF
client.trust
example-user-server.login

Application Execution for the Secure Case Using an Oracle Wallet

If you wish to execute the CountTableRows MapReduce job against a secure store using an Oracle Wallet to store the client application's password, then after packaging the application for wallet based execution as described in the previous section, you would type the following on the command line:

export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:\
    /opt/oracle/kv-ee/kv/lib/kvclient.jar:\
    /opt/oracle/nosql/apps/kv/examples/CountTableRows-walletServer.jar

cd /opt/oracle/nosql/apps/kv

hadoop jar examples/CountTableRows-walletClient.jar \
    hadoop.table.CountTableRows \
    -libjars \
    /opt/oracle/kv-ee/kv/lib/kvclient.jar,\
    /opt/oracle/kv-ee/kv/lib/sklogger.jar,\
    /opt/oracle/kv-ee/kv/lib/commonutil.jar,\ 
    /opt/oracle/kv-ee/kv/lib/failureaccess.jar,\ 
    /opt/oracle/kv-ee/kv/lib/antlr4-runtime-nosql-shaded.jar,\
    /opt/oracle/kv-ee/kv/lib/jackson-core.jar,\
    /opt/oracle/kv-ee/kv/lib/jackson-databind.jar,\
    /opt/oracle/kv-ee/kv/lib/jackson-annotations.jar,\
    /opt/oracle/nosql/apps/examples/CountTableRows-walletServer.jar \
    example-store \
    kv-host-1:5000 \
    vehicleTable \
    /user/example-user/CountTableRows/vehicleTable/0001 \
    example-user-client-walletfile.login \
    example-user-server.login

Secure Versus Non-Secure Command Lines

When examining how the application is executed using either a wallet based or a password file based password storage mechanism, you should first notice that, unlike the non-secure case, the HADOOP_CLASSPATH and -libjars argument have both been augmented with the JAR file that contains only the public credentials for login and trust; that is, either CountTableRows-pwdServer.jar or CountTableRows-walletServer.jar. Because those JAR files contain only public information, they can be safely transmitted to the server side remote address spaces.

Compare this with the value to which the application's local classpath is set, via the jar directive. Rather than including the application's server based JAR file, the local classpath instead is set to include the application's client based JAR file; either CountTableRows-pwdClient.jar or CountTableRows-walletClient.jar. The application's client based JAR file includes both the application's public and private credentials. Those JAR files contain security artifacts which should remain private to the application's address space; that is, the client side of the application. As a result, those JAR files must never be included in the HADOOP_CLASSPATH or -libjars specifications. They should be included only in the client's local classpath.

Finally, the only other difference between the command lines for secure execution and non-secure execution, is the two additional arguments at the end of the argument list for the secure case; specifically, example-user-server.login and either example-user-client-pwdfile.login or example-user-client-wallet.login.

The values of those arguments specify, respectively, the names of the client side and server side login files, whose contents will be retrieved as resources from the corresponding JAR file.

Observe that when you package and execute your MapReduce application in a manner like that shown here, there is no need to specify the username or password file (or wallet) on the command line; as that information is included as part of the client side JAR file. Additionally, the server side JAR file that is transferred from the Hadoop cluster’s access node to the job's DataNodes does not include that private information. This is important because that transferred JAR file will be cached in the file system of each of those DataNodes.

Summary

As the sections above demonstrate, the programming model for MapReduce and Oracle NoSQL Database Security supports (even encourages) the best practices presented in this section for building, packaging, and deploying any given MapReduce job that employs the Oracle NoSQL Database Table API to retrieve and process data in a given Oracle NoSQL Database store, either secure or non-secure. As a result, simply generating separate JAR files a set of JAR files for the secure case, and one for the non-secure case allows deployers to conveniently run the job with or without security.

Note:

This model for separating public and private user credentials will also play an important role when executing Hive queries against table data in a secure store.