hadoop.table (Oracle NoSQL Database Examples)

Class Summary
Class	Description
CountTableRows	A basic example demonstrating how to use the class `oracle.kv.hadoop.table.TableInputFormat` to access the rows of a table in an Oracle NoSQL Database from within a Hadoop MapReduce job for the purpose of counting the number of records in the table.
CountTableRows.Map	Nested class used to map input key/value pairs to a set of intermediate key/value pairs in preparation for the reduce phase of the associated MapReduce job.
CountTableRows.Reduce	Nested class used to reduce to a smaller set of values, the set of intermediate values, sharing a key, which were produced by the mapping phase of the job.
KVSecurityCreation	Standalone program that creates (or deletes) a password file or Oracle Wallet for use when running the associated example MapReduce job with a secure KVStore.
KVSecurityUtil	Utility class that provides convenience methods related to the Oracle NoSQL Database security model.
LoadVehicleTable	Class that creates an example table in a given NoSQL Database store and then uses the Table API to populate the table with sample records.

Package hadoop.table Description

The Table API MapReduce Cookbook: example code for a MapReduce job, along with supporting code and scripts, that can be run against data written via the Oracle NoSQL Database Table API.

Introduction

Prior to the introduction of the Table API to Oracle NoSQL Database, in order to run a Hadoop MapReduce job against data in an Oracle NoSQL Database KVStore, one would employ the interfaces and classes that integrate the Oracle NoSQL Database key/value API (the KV API) with Hadoop MapReduce as described here. With the addition of the Table API it became necessary to specify new interfaces and classes that would allow Hadoop MapReduce jobs to be run against so-called table data; that is, against data written to a KVStore via the Table API rather than the KV API. The information presented below describes how to run such a Hadoop MapReduce job against table data in a given KVStore. In addition to describing the core interfaces and classes involved in this process, this document also walks through the example currently provided with the Oracle NoSQL Database software distribution; which demonstrates how to use the Table API Hadoop integration classes with Hadoop MapReduce.

Prerequisites

Before attempting to execute the example that demonstrates the concepts presented in this document, you should first satisfy the following prerequisites:

Become familiar with Apache Hadoop, MapReduce, and its programming model; that is, become familiar with how to write and deploy a MapReduce job.
Deploy a Hadoop cluster with 3 DataNodes running on machines with host names, dn-host-1, dn-host-2, and dn-host-3.
Become familiar with Oracle NoSQL Database and then install, start, and configure an Oracle NoSQL Database — with <KVHOME> equal to /opt/ondb/kv — that is network reachable from the nodes of the Hadoop cluster.
Deploy a KVStore — named example-store, with <KVROOT> equal to /opt/ondb/example-store — to 3 machines (real or virtual) with host names, kv-host-1, kv-host-2, and kv-host-3; where an admin service, listening on port 5000, is deployed on each host.
Become familiar with the Oracle NoSQL Database Security model and be able to configure the deployed store for secure access (optional).
If the deployed store is configured for secure access, become familiar with the Oracle NoSQL Database Administrative CLI, start it, and securely connect to the KVStore's admin service; and then create a KVStore user named example-user along with the appropriate security artifacts (login file, trust file, and either password file or Oracle Wallet [Enterprise Edition only]).
Be able to compile and execute a Java program, and package it and any associated resources in a JAR file.
Install the Hadoop JAR files required to compile the example program so that they are available for inclusion in the example program's classpath (see below).

Using specific values for items such as the <KVHOME> and <KVROOT> environment variables, the store name, host names, and admin port described above should allow you to more easily follow the example that is presented. Combined with the information contained in the Oracle NoSQL Database Getting Started Guide, as well as the Oracle NoSQL Database Admin Guide and Oracle NoSQL Database Security Guide, you should then be able to generalize and extend the example to your own particular development scenario; substituting the values specific to the given environment where necessary.

Detailed instructions for deploying a non-secure KVStore are provided in Appendix A. Similarly, Appendix B provides instructions for deploying a KVStore configured for security.

A Brief Hadoop Primer

Hadoop can be thought of as consisting of two primary components:

The Hadoop Distributed File System (referred to as, HDFS).
The MapReduce programming model; which consists of a Map Phase that includes a mapping step and a shuffle-and-sort step that together perform filtering and sorting, and a Reduce Phase that performs a summary operation on the mapped and sorted results.

The various Hadoop distributions that are available (for example, Cloudera or Hortonworks) each provide an infrastructure which orchestrates the MapReduce processing that is performed. It does this by marshalling the distributed servers that run the various tasks in parallel, by managing all communications and data transfers between the various parts of the system, and by providing for redundancy and fault tolerance.

In addition, the Hadoop infrastructure provides a number of interactive tools — such as a command line interface (the Hadoop CLI) — that provide access to the data stored in HDFS. But the typical way application developers read, write, and process data stored in HDFS is via MapReduce jobs; which are programs that adhere to the Hadoop MapReduce programming model. For more detailed information on Hadoop HDFS and MapReduce, consult the Hadoop MapReduce tutorial.

As indicated above, with the introduction of the Table API, a new set of interfaces and classes that satisfy the Hadoop MapReduce programming model have been provided which support writing MapReduce jobs that can be run against table data contained in a KVStore. These new classes are located in the oracle.kv.hadoop.table package, and consist of the following types:

A subclass of the Hadoop class, org.apache.hadoop.mapreduce.InputFormat, which specifies how the associated MapReduce job reads its input data (using a Hadoop RecordReader), and splits up the input data into logical sections, each referred to as an InputSplit.
A subclass of the Hadoop class, org.apache.hadoop.mapreduce.OutputFormat, which specifies how the associated MapReduce job writes its output data (using a Hadoop RecordWriter).
A subclass of the Hadoop class, org.apache.hadoop.mapreduce.RecordReader, which specifies how the mapped keys and values are located and retrieved during MapReduce processing.
A subclass of the Hadoop class, org.apache.hadoop.mapreduce.InputSplit, which represents the data to be processed by an individual MapReduce Mapper (one Mapper per InputSplit).

As described below, it is through the specific implementation of the InputFormat class provided in the Oracle NoSQL Database distribution that the Hadoop MapReduce infrastructure obtains access to a given KVStore and the desired table data that the store contains.

The Oracle NoSQL Database Table API Hadoop MapReduce Integration Classes

When writing a MapReduce job to process data stored in an Oracle NoSQL Database table, you should employ the following classes:

For more detail about the semantics of the classes listed above, refer to the javadoc of each respective class.

Currently, Oracle NoSQL Database does not define a subclass of the Hadoop OutputFormat class. This means that it is not currently possible to write data from a MapReduce job into a KVStore. That is, from within a MapReduce job, you can only retrieve data from a desired KVStore table and process that data.

The CountTableRows Example

The hadoop.table example package is contained in the following location within your Oracle NoSQL Database distribution:

  /opt/ondb/kv/examples/
      hadoop/table/
        CountTableRows.java
        LoadVehicleTable.java
        KVSecurityCreation.java
        KVSecurityUtil.java

In order to run the CountTableRows example MapReduce job, a KVStore — either secure or non-secure — must first be deployed, and a table must be created and populated with data. Thus, before attempting to execute CountTableRows, either deploy a non-secure KVStore using the steps outlined in Appendix A, or start a KVStore configured for security using the steps presented in Appendix B.

Once a KVStore has been deployed as described in either Appendix A, or Appendix B, the standalone Java program LoadVehicleTable can be run against either type of store to create a table with the name and schema expected by CountTableRows, and populate it with rows of data consistent with the table's schema. Once the table is created and populated with example data, CountTableRows can then be executed to run a MapReduce job that counts the number of rows of data in the table.

In addition to the LoadVehicleTable program, the example package also contains the classes KVSecurityCreation and KVSecurityUtil; which are provided to support running CountTableRows against a secure KVStore. The standalone Java program KVSecurityCreation is provided as a convenience, and can be run to create (or delete) a password file or Oracle Wallet — along with associated client side and server side login files — that CountTableRows will need to interact with a secure store. And the KVSecurityUtil class provides convenient utility methods that CountTableRows uses to create and process the various security artifacts it uses for secure access.

The next sections explain how to compile and execute LoadVehicleTable to create and populate the required example table in the deployed store; how to compile and execute KVSecurityCreation to create or delete any security credentials that may be needed by CountTableRows; and finally, how to compile, build (JAR), and then execute the CountTableRows MapReduce job on the Hadoop cluster that was deployed for this example.

Schema for the Example Table 'vehicleTable'

To execute the CountTableRows MapReduce job, a table named vehicleTable — having the schema shown in the table below — must be created in the KVStore that was deployed for this example; where the data types specified in the schema are defined by the Oracle NoSQL Database Table API (see oracle.kv.table.FieldDef.Type) .

'vehicleTable' Schema
Field Name	Field Type
type	STRING
make	STRING
model	STRING
class	STRING
color	STRING
price	DOUBLE
count	INTEGER
Primary Key Field Names
type	make	model	class	color
Shard Key Field Names
type	make	model

Thus, vehicleTable consists of rows representing a particular vehicle a dealer might have in stock for purchase. Each such row contains fields specifying the "type" of vehicle (for example, car, truck, SUV, etc.), the "make" of the vehicle (Ford, GM, Chrysler, etc.), the "model" (Explorer, Camaro, Lebaron, etc.), the vehicle "class" (4WheelDrive, FrontWheelDrive, etc.), "color", "price", and finally the number of vehicles in stock (the "count").

Although you can enter individual commands in the admin CLI to create a table with the above schema, the preferred approach is to employ the Oracle NoSQL Database Data Definition Language (DDL) to create the desired table. One way to accomplish this is to follow the instructions presented in the next sections to compile and execute the LoadVehicleTable program; which populates the desired table after using the DDL to create it.

Create and Populate 'vehicleTable' with Example Data

After the KVStore — either non-secure or secure — has been deployed, the LoadVehicleTable program that is supplied with the example as a convenience can be executed to create and populate the table named vehicleTable. Before executing LoadVehicleTable though, that program must first be compiled. To do this, type the following command from the OS command line:

  > cd /opt/ondb/kv
  > javac -classpath lib/kvstore.jar:examples examples/hadoop/table/LoadVehicleTable.java

which should produce the file:

  /opt/ondb/kv/examples/hadoop/table/LoadVehicleTable.class

— Creating and Populating 'vehicleTable' with Example Data in a Non-Secure KVStore —

To execute LoadVehicleTable to create and then populate the table named vehicleTable with example data in a KVStore configured for non-secure access, type the following at the command line of a node that has network connectivity with a node running the admin service (for example, kv-host-1 itself):

  > cd /opt/ondb/kv
  > java -classpath lib/kvstore.jar:examples hadoop.table.LoadVehicleTable \
             -store example-store -host kv-host-1 -port 5000 -nops 79 [-delete]

where the parameters -store, -host, -port, and -nops are required.

In the example command line above, the argument -nops 79 specifies that 79 rows be written to the vehicleTable. If more or less than that number of rows is desired, then the value of the -nops parameter should be changed.

If LoadVehicleTable is executed a second time and the optional -delete parameter is specified, then all rows added by any previous executions of LoadVehicleTable are deleted from the table prior to adding the new rows. Otherwise, all pre-existing rows are left in place, and the number of rows in the table will be increased by the specified -nops number of new rows.

— Creating and Populating 'vehicleTable' with Example Data in a Secure KVStore —

To execute LoadVehicleTable against a secure KVStore deployed and provisioned with a non-administrative user employing the steps presented in Appendix B, an additonal parameter must be added to the command line above. That is, type the following:

  > cd /opt/ondb/kv
  > javac -classpath lib/kvclient.jar:LoadVehicleTable examples/hadoop/table/LoadVehicleTable.java
  > cp /opt/ondb/example-store/security/client.trust /tmp
  > java -classpath lib/kvstore.jar:examples hadoop.table.LoadVehicleTable \
             -store example-store -host kv-host-1 -port 5000 -nops 79 \
             -security /tmp/example-user-client-pwdfile.login
             [-delete]

where the single additonal -security parameter specifies the location of the login properties file (associated with a password file rather than an Oracle Wallet) for the given user (the alias); and all other parameters are as described for the non-secure case.

To understand the -security parameter for this example, recall from Appendix B that a non-administrative user named example-user was created, and password file based credential files (prefixed with that user name) were generated for that user and placed under the /tmp system directory. That is, the example login and password files generated in Appendix B are:

  /tmp
    client.trust
    example-user-client-pwdfile.login
    example-user-server.login
    example-user.passwd

Note that for this example, the user credential files must be co-located; where it doesn't matter which directory they are located in, as long as they all reside in the same directory accessible by the user. It is for this reason that the shared trust file (client.trust) is copied into /tmp above. Co-locating client.trust and example-user.passwd with the login file (example-user-client-pwdfile.login) allows relative paths to be used for the values of the oracle.kv.ssl.trustStore and oracle.kv.auth.pwdfile.file system properties that are specified in the login file (or oracle.kv.auth.wallet.dir if a wallet is used to store the user password). If those files are not co-located with the login file, then absolute paths must be used for those properties.

At this point, the vehicleTable created in the specified KVStore (non-secure or secure) should be populated with the desired example data. And the CountTableRows example MapReduce job can be run to count the number of rows in that table.

Compile and Build the CountTableRows Example Program

After the example vehicleTable has been created and populated, but before the example MapReduce job can be executed, the CountTableRows program must first be compiled and built for deployment to the Hadoop infrastructure. In order to compile the CountTableRows program, a number of Hadoop JAR files must be installed and available in the build environment for inclusion in the program classpath. Those JAR files are:

commons-logging-<version>.jar
hadoop-common-<version>.jar
hadoop-mapreduce-client-core-<version>.jar
hadoop-annotations-<version>.jar

where the <version> token represents the particular version number of the corresponding JAR file contained in the Hadoop distribution installed in the build environment.

For example, suppose that the 2.3.0 version of Hadoop that is delivered via the 5.1.0 package provided by Cloudera (cdh) is installed under the <HADOOPROOT> base directory. And suppose that the classes from that version of Hadoop use the 1.1.3 version of commons-logging. Then, to compile the CountTableRows program, type the following at the command line (with the <HADOOPROOT> token replaced with the appropriate directory path for your system):

  > cd /opt/ondb/kv
  > javac -classpath <HADOOPROOT>/hadoop/share/hadoop/common/lib/commons-logging-1.1.3.jar: \
                     <HADOOPROOT>/hadoop/share/hadoop/common/hadoop-common-2.3.0-cdh5.1.0.jar: \
                     <HADOOPROOT>/hadoop/share/hadoop/mapreduce2/hadoop-mapreduce-client-core-2.3.0-cdh5.1.0.jar: \
                     <HADOOPROOT>/hadoop/share/hadoop/common/lib/hadoop-annotations-2.3.0-cdh5.1.0.jar: \
                     lib/kvclient.jar:examples \
       examples/hadoop/table/CountTableRows.java

which produces the following files:

  /opt/ondb/kv/examples/hadoop/table/
        CountTableRows.class
        CountTableRows$Map.class
        CountTableRows$Reduce.class

If your specific environment has a different, compatible Hadoop distribution installed, then simply replace the versions referenced in the example command line above with the specific versions that are installed.

If you will be running CountTableRows against a non-secure KVStore, then this is all you need; and the resulting class files should be placed in a JAR file so that the program can be deployed to the example Hadoop cluster. For example, to create a JAR file containing the class files needed to run CountTableRows against a non-secure KVStore like that deployed in Appendix A, do the following:

  > cd /opt/ondb/kv/examples
  > jar cvf CountTableRows.jar hadoop/table/CountTableRows*.class

which should produce the file CountTableRows.jar in the /opt/ondb/kv/examples directory, with contents that look like:

     0 Fri Feb 20 12:53:24 PST 2015 META-INF/
    68 Fri Feb 20 12:53:24 PST 2015 META-INF/MANIFEST.MF
  3842 Fri Feb 20 12:49:16 PST 2015 hadoop/table/CountTableRows.class
  2623 Fri Feb 20 12:49:16 PST 2015 hadoop/table/CountTableRows$Map.class
  3842 Fri Feb 20 12:49:16 PST 2015 hadoop/table/CountTableRows$Reduce.class

Note that when the command above is used to generate CountTableRows.jar, the utility class KVSecurityUtil (see below) will not be included in the resulting JAR file. Since CountTablesRows does not use that utility class in the non-secure case, including it in the JAR file is optional.

— Building the Example for a Secure Environment —

If you will be running CountTableRows against a secure KVStore such as that deployed in Appendix B, then in addition to compiling CountTableRows as described above, additional security related artifacts need to be generated and included in the build; where the additional artifacts include not only compiled class files, but security credential files as well.

To support the secure version of CountTableRows, the utilitity class KVSecurityUtil and the standalone program KVSecurityCreation should also be compiled. That is,

  > cd /opt/ondb/kv
  > javac -classpath lib/kvstore.jar:examples examples/hadoop/table/KVSecurityCreation.java
  > javac -classpath lib/kvstore.jar:examples examples/hadoop/table/KVSecurityUtil.java
  > javac -classpath <HADOOPROOT>/hadoop/share/hadoop/common/lib/commons-logging-1.1.3.jar: \
                     <HADOOPROOT>/hadoop/share/hadoop/common/hadoop-common-2.3.0-cdh5.1.0.jar: \
                     <HADOOPROOT>/hadoop/share/hadoop/mapreduce2/hadoop-mapreduce-client-core-2.3.0-cdh5.1.0.jar: \
                     <HADOOPROOT>/hadoop/share/hadoop/common/lib/hadoop-annotations-2.3.0-cdh5.1.0.jar: \
                     lib/kvclient.jar:examples \
       examples/hadoop/table/CountTableRows.java

which produces the files:

  /opt/ondb/kv/examples/hadoop/table/
        CountTableRows.class
        CountTableRows$Map.class
        CountTableRows$Reduce.class
        KVSecurityUtil.class
        KVSecurityCreation.class

Unlike the non-secure case, the build artifacts needed to deploy CountTableRows in a secure environment include more than just a single JAR file containing the generated class files. For the secure case, it will be important to package some artifacts for deployment to the client side of the application that communicates with the KVStore; whereas other artifacts will need to be packaged for deployment to the server side. Although there are different ways to achieve this "separation of concerns" when deploying a given application, Appendix C presents one particular model you can use to package and deploy the artifacts of an application (such as CountTableRows) that will interact with a secure KVStore. With this in mind, the sections below related to executing CountTableRows against a secure KVStore each assume that the application has been built and packaged according to the instructions presented in Appendix C.

Executing the CountTableRows Example

If you will be running CountTableRows against a non-secure KVStore deployed in the manner described in Appendix A, and have compiled and built CountTableRows in the manner presented in the previous section, then the MapReduce job initiated by CountTableRows can be deployed and executed by typing the following at the command line of the Hadoop cluster's access node (where line breaks are used only for readability):

  > cd /opt/ondb/kv
  > hadoop jar examples/CountTableRows.jar \
               hadoop.table.CountTableRows \
               -libjars /opt/ondb/kv/lib/kvclient.jar \
               example-store \
               kv-host-1:5000 \
               vehicleTable \
               /user/example-user/CountTableRows/vehicleTable/<000N>

where the hadoop command interpreter's -libjars argument is used to include the third party library kvclient.jar in the classpath of each MapReduce task executing on the cluster's DataNodes; so that those tasks can access classes such as, TableInputSplit and TableRecordReader.

Note that in the last argument, the example-user directory component corresponds to a directory under the HDFS /user top-level directory, and typically corresponds to the user who has initiated the MapReduce job. This directory is usually created in HDFS by the Hadoop cluster administrator. Additionally, the <000N> token in that argument represents a string such as 0000, 0001, 0002, etc. Although any string can be used for this token, using a different number for "N" on each execution of the job makes it easier to keep track of results when multiple executions of the job occur.

— Executing CountTableRows Against a Secure KVStore —

If you will be running CountTableRows against a secure KVStore deployed in the manner presented in Appendix B, and if you have compiled, built, and packaged CountTableRows and all the necessary artifacts in the manner described in Appendix C, then CountTableRows can be run against the secure KVStore by typing the following at the command line of the Hadoop cluster's access node (where line breaks are used only for readability):

  > export=HADOOP_CLASSPATH $HADOOP_CLASSPATH:/opt/ondb/kv/lib/kvclient.jar:/opt/ondb/kv/examples/CountTableRows-pwdServer.jar
  > cd /opt/ondb/kv
  > hadoop jar examples/CountTableRows-pwdClient.jar \
               hadoop.table.CountTableRows \
               -libjars /opt/ondb/kv/lib/kvclient.jar,/opt/ondb/kv/examples/CountTableRows-pwdServer.jar \
               example-store \
               kv-host-1:5000 \
               vehicleTable \
               /user/example-user/CountTableRows/vehicleTable/<000N> \
               example-user-client-pwdfile.login \
               example-user-server.login

where the mechanism used for storing the user password is a password file; and the client side artifacts are highlighted in red, and the server side artifacts are highlighted in purple.

Similarly, if the mechanism used for storing the user password is an Oracle Wallet (available only with the Oracle NoSQL Database Enterprise Edition), you would type the following at the access node's command line:

  > export=HADOOP_CLASSPATH $HADOOP_CLASSPATH:/opt/ondb/kv/lib/kvclient.jar:/opt/ondb/kv/examples/CountTableRows-walletServer.jar
  > cd /opt/ondb/kv
  > hadoop jar examples/CountTableRows-walletClient.jar \
               hadoop.table.CountTableRows \
               -libjars /opt/ondb/kv/lib/kvclient.jar,/opt/ondb/kv/examples/CountTableRows-walletServer.jar \
               example-store \
               kv-host-1:5000 \
               vehicleTable \
               /user/example-user/CountTableRows/vehicleTable/<000N> \
               example-user-client-wallet.login \
               example-user-server.login

In both cases above — password file and Oracle Wallet — notice the additional JAR file (CountTableRows-pwdServer.jar or CountTableRows-walletServer.jar) specified for both the HADOOP_CLASSPATH environment variable and -libjars paramenter. For a detailed explanation of the use and purpose of that server side JAR file, as well as a description of the client side JAR file and the two additional arguments at the end of the command line, refer to Appendix C; specifically, the section on packaging for a secure KVStore.

Whether running against a secure or non-secure store, as the job runs, assuming no errors, the output from the job will look like the following:

  ...
  2014-12-04 08:59:47,996 INFO  [main] mapreduce.Job (Job.java:monitorAndPrintJob(1344)) - Running job: job_1409172332346_0024
  2014-12-04 08:59:54,107 INFO  [main] mapreduce.Job (Job.java:monitorAndPrintJob(1372)) -  map 0% reduce 0%
  2014-12-04 09:00:16,148 INFO  [main] mapreduce.Job (Job.java:monitorAndPrintJob(1372)) -  map 7% reduce 0%
  2014-12-04 09:00:17,368 INFO  [main] mapreduce.Job (Job.java:monitorAndPrintJob(1372)) -  map 26% reduce 0%
  2014-12-04 09:00:18,596 INFO  [main] mapreduce.Job (Job.java:monitorAndPrintJob(1372)) -  map 56% reduce 0%
  2014-12-04 09:00:19,824 INFO  [main] mapreduce.Job (Job.java:monitorAndPrintJob(1372)) -  map 100% reduce 0%
  2014-12-04 09:00:23,487 INFO  [main] mapreduce.Job (Job.java:monitorAndPrintJob(1372)) -  map 100% reduce 100%
  2014-12-04 09:00:23,921 INFO  [main] mapreduce.Job (Job.java:monitorAndPrintJob(1383)) - Job job_1409172332346_0024 completed successfully
  2014-12-04 09:00:24,117 INFO  [main] mapreduce.Job (Job.java:monitorAndPrintJob(1390)) - Counters: 49
        File System Counters
                FILE: Number of bytes read=2771
                FILE: Number of bytes written=644463
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=2660
                HDFS: Number of bytes written=32
                HDFS: Number of read operations=15
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=2
        Job Counters 
                Launched map tasks=6
                Launched reduce tasks=1
                Rack-local map tasks=6
                Total time spent by all maps in occupied slots (ms)=136868
                Total time spent by all reduces in occupied slots (ms)=2103
                Total time spent by all map tasks (ms)=136868
                Total time spent by all reduce tasks (ms)=2103
                Total vcore-seconds taken by all map tasks=136868
                Total vcore-seconds taken by all reduce tasks=2103
                Total megabyte-seconds taken by all map tasks=140152832
                Total megabyte-seconds taken by all reduce tasks=2153472
        Map-Reduce Framework
                Map input records=79
                Map output records=79
                Map output bytes=2607
                Map output materialized bytes=2801
                Input split bytes=2660
                Combine input records=0
                Combine output records=0
                Reduce input groups=1
                Reduce shuffle bytes=2801
                Reduce input records=79
                Reduce output records=1
                Spilled Records=158
                Shuffled Maps =6
                Failed Shuffles=0
                Merged Map outputs=6
                GC time elapsed (ms)=549
                CPU time spent (ms)=9460
                Physical memory (bytes) snapshot=1888358400
                Virtual memory (bytes) snapshot=6424895488
                Total committed heap usage (bytes)=1409286144
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        File Input Format Counters 
                Bytes Read=0
        File Output Format Counters 
                Bytes Written=32

To see the results of the job — to verify that the program actually counted the correct number of rows in the table — use the Hadoop CLI to display the contents of the MapReduce results file located in HDFS. To do this, type the following at the command line of the Hadoop cluster's access node:

  > hadoop fs -cat /user/example-user/CountTableRows/vehicleTable/<000N>/part-r-00000

where example-user and the <000N> token should be replaced with the values you used when the job was run; appropriate to your particular system. Assuming the table was populated with 79 rows, if the job was successful, then the output should look like the following:

  /type/make/model/class/color  79

where /type/make/model/class/color are the names of the fields making up the PrimaryKey of the vehicleTable; and 79 is the number of rows in the table.

Appendix A: Deploying & Configuring a Non-Secure Oracle NoSQL Database Store

The Oracle NoSQL Database Getting Started Guide, as well as the Oracle NoSQL Database Admin Guide, each present a number of different ways to deploy and configure a KVStore that does not require secure access. For convenience, this section describes one particular set of steps you can take to deploy and configure such a store. Whether you prefer the technique presented here or one of the other techniques presented in the Getting Started documents, a non-secure KVStore must be deployed and configured in order to run the example presented in the sections above in a non-secure environment. For each of the steps presented in the sub-sections below, assume the following:

The Oracle NoSQL Database distribution is installed under the directory /opt/ondb/kv.
A store named example-store will be deployed to three hosts.
The hosts are named, kv-host-1, kv-host-2, and kv-host-3 respectively.
An admin service, listening on port 5000, is deployed on each of the three hosts.
The contents of the shards managed by the store will be located under the storage directory /disk1/shard for host kv-host-1, /disk2/shard for host kv-host-2, and /disk3/shard for host kv-host-3.

where the values used in the assumptions above should be replaced with comparable values specific to your particular environment.

Step 1: Generate Configuration Files for each Storage Node (SN) of the Non-Secure KVStore

Login to each host — kv-host-1, kv-host-2, kv-host-3 — and, from each respective command line, type commands like those shown below. That is, from kv-host-1, type:

  > java -jar /opt/ondb/kv/lib/kvstore.jar makebootconfig \
         -root /opt/ondb/example-store \
         -config config.xml \
         -port 5000 \
         -admin 5001 \
         -host kv-host-1 \
         -harange 5002,5005 \
         -num_cpus 0 \
         -memory_mb 0 \
         -capacity 3 \
         -storagedir /disk1/shard

Then from kv-host-2, type:

  > java -jar /opt/ondb/kv/lib/kvstore.jar makebootconfig \
         -root /opt/ondb/example-store \
         -config config.xml \
         -port 5000 \
         -admin 5001 \
         -host kv-host-2 \
         -harange 5002,5005 \
         -num_cpus 0 \
         -memory_mb 0 \
         -capacity 3 \
         -storagedir /disk2/shard

And finally from kv-host-3, type:

  > java -jar /opt/ondb/kv/lib/kvstore.jar makebootconfig \
         -root /opt/ondb/example-store \
         -config config.xml \
         -port 5000 \
         -admin 5001 \
         -host kv-host-3 \
         -harange 5002,5005 \
         -num_cpus 0 \
         -memory_mb 0 \
         -capacity 3 \
         -storagedir /disk3/shard

Step 2: Start Each Storage Node Agent (SNA) of the Non-Secure KVStore

  > nohup java -jar /opt/ondb/kv/lib/kvstore.jar start -root /opt/ondb/example-store -config config.xml &

which will start both an SNA and an admin service on the associated host.

Step 3: Configure and Deploy the Store

From the command line of a host that has network connectivity to the admin services started above (for example, from kv-host-1 itself), type the following command to enter the command line interface (CLI) to the store's admin:

  > java -jar /opt/ondb/kv/lib/kvstore.jar runadmin -host kv-host-1 -port 5000
  kv->

Next, deploy the store by entering the following commands — either in succession, from the CLI prompt; or from a script, using the CLI command 'load -file <flnm>'.

  configure -name kvstore-db
  plan deploy-zone -name zn1 -rf 3 -wait

  plan deploy-sn -znname zn1 -host kv-host-1 -port 5000 -wait
  plan deploy-admin -sn 1 -port 5001 -wait
  pool create -name snpool
  pool join -name snpool -sn sn1

  plan deploy-sn -znname zn1 -host kv-host-2 -port 5000 -wait
  plan deploy-admin -sn 2 -port 5001 -wait
  pool join -name snpool -sn sn2

  plan deploy-sn -znname zn1 -host kv-host-3 -port 5000 -wait
  plan deploy-admin -sn 3 -port 5001 -wait
  pool join -name snpool -sn sn3

  change-policy -params "loggingConfigProps=oracle.kv.level=INFO;"

  topology create -name store-layout -pool snpool -partitions 300
  plan deploy-topology -name store-layout -plan-name store-deploy-plan -wait

Appendix B: Deploying & Configuring a Secure Oracle NoSQL Database Store

The Oracle NoSQL Database Security Guide presents a number of different ways to deploy and configure a KVStore for secure access. For convenience, this section describes one particular set of steps you can take to deploy and configure such a store. Whether you prefer the technique presented here or one of the other techniques presented in the Oracle NoSQL Database Security Guide, a secure KVStore must be deployed and configured in order to run the example presented in the sections above in a secure environment. For each of the steps presented in the sub-sections below, assume the following:

The Oracle NoSQL Database distribution is installed under the directory /opt/ondb/kv.
A store named example-store will be deployed to three hosts.
The hosts are named, kv-host-1, kv-host-2, and kv-host-3 respectively.
An admin service, listening on port 5000, is deployed on each of the three hosts.
The contents of the shards managed by the store will be located under the storage directory /disk1/shard for host kv-host-1, /disk2/shard for host kv-host-2, and /disk3/shard for host kv-host-3.
For convenience, the mechanism (password manager) the store will use to store and retrieve passwords needed for access to keystores etc. will be a password file (available in both the Community Edition and the Enterprise Edition) rather than an Oracle Wallet (available in only the Enterprise Edition).
For simplicity, all passwords will be set to 123456 (must be at least 6 characters).
The name of the user is example-user.

where the values used in the assumptions above should be replaced with comparable values specific to your particular environment.

Step 1: Generate Configuration Files for each Storage Node (SN) of the Secure KVStore

  > java -jar /opt/ondb/kv/lib/kvstore.jar makebootconfig \
         -root /opt/ondb/example-store \
         -config config.xml \
         -port 5000 \
         -admin 5001 \
         -host kv-host-1 \
         -harange 5002,5005 \
         -num_cpus 0 \
         -memory_mb 0 \
         -capacity 3 \
         -storagedir /disk1/shard \
         -store-security configure \
         -pwdmgr pwdfile

  Enter a password for the Java KeyStore:123456<RETURN>
  Re-enter the KeyStore password for verification:123456<RETURN>

  Created files
      /opt/ondb/example-store/security/store.trust
      /opt/ondb/example-store/security/store.keys
      /opt/ondb/example-store/security/store.passwd
      /opt/ondb/example-store/security/client.trust
      /opt/ondb/example-store/security/security.xml
      /opt/ondb/example-store/security/client.security

Note the value of the -store-security parameter for kv-host-1 is configure. After executing the command above, use a utility such as scp to copy the resulting security directory to the other SN hosts; that is, kv-host-2 and kv-host-3. For example,

  > scp -r /opt/ondb/example-store/security example-user@kv-host-2:/opt/ondb/example-store
  > scp -r /opt/ondb/example-store/security example-user@kv-host-3:/opt/ondb/example-store

    store.trust         100%  508     0.5KB/s   00:00    
    store.keys          100% 1215     1.2KB/s   00:00    
    store.passwd        100%   39     0.0KB/s   00:00    
    client.trust        100%  508     0.5KB/s   00:00    
    security.xml        100% 2216     2.2KB/s   00:00    
    client.security     100%  255     0.3KB/s   00:00

After generating and distributing the security configuration files by executing the commands shown above from the first SN host (kv-host-1), login to each of the remaining SN hosts and enable security by typing the commands shown below from the respective host's command line. That is, from kv-host-2, type the following:

  > java -jar /opt/ondb/kv/lib/kvstore.jar makebootconfig \
         -root /opt/ondb/example-store \
         -config config.xml \
         -port 5000 \
         -admin 5001 \
         -host kv-host-2 \
         -harange 5002,5005 \
         -num_cpus 0 \
         -memory_mb 0 \
         -capacity 3 \
         -storagedir /disk2/shard \
         -store-security enable \
         -pwdmgr pwdfile

And then from kv-host-3, type:

  > java -jar /opt/ondb/kv/lib/kvstore.jar makebootconfig \
         -root /opt/ondb/example-store \
         -config config.xml \
         -port 5000 \
         -admin 5001 \
         -host kv-host-3 \
         -harange 5002,5005 \
         -num_cpus 0 \
         -memory_mb 0 \
         -capacity 3 \
         -storagedir /disk3/shard \
         -store-security enable \
         -pwdmgr pwdfile

For both commands above, note the value of the -store-security parameter is enable rather than configure; as was used with the first host.

Step 2: Start Each Storage Node Agent (SNA) of the Secure KVStore

  > nohup java -jar /opt/ondb/kv/lib/kvstore.jar start -root /opt/ondb/example-store -config config.xml &

which will start both an SNA and an admin service on the associated host.

Step 3: Configure & Deploy the Store with an Admin User

  > java -jar /opt/ondb/kv/lib/kvstore.jar runadmin \
         -host kv-host-1 \
         -port 5000 \
         -security /opt/ondb/example-store/security/client.security

  Logged in admin as anonymous
  kv->

Next, deploy the store by entering the following commands — either in succession, from the CLI prompt; or from a script, using the CLI command 'load -file <flnm>'.

  configure -name kvstore-db
  plan deploy-zone -name zn1 -rf 3 -wait

  plan deploy-sn -znname zn1 -host kv-host-1 -port 5000 -wait
  plan deploy-admin -sn 1 -port 5001 -wait
  pool create -name snpool
  pool join -name snpool -sn sn1

  plan deploy-sn -znname zn1 -host kv-host-2 -port 5000 -wait
  plan deploy-admin -sn 2 -port 5001 -wait
  pool join -name snpool -sn sn2

  plan deploy-sn -znname zn1 -host kv-host-3 -port 5000 -wait
  plan deploy-admin -sn 3 -port 5001 -wait
  pool join -name snpool -sn sn3

  change-policy -params "loggingConfigProps=oracle.kv.level=INFO;"

  topology create -name store-layout -pool snpool -partitions 300
  plan deploy-topology -name store-layout -plan-name store-deploy-plan -wait

After configuring the store and deploying the topology as shown above, create a user named root, having administrative privileges. To do this, type the following command at the CLI prompt:

  plan create-user -name root -admin -wait

  Enter the new password: 123456<RETURN>
  Re-enter the new password: 123456<RETURN>

Step 4: Generate Root User and Regular User with Credentials

The previous sections in the body of this document describe how to create and populate a table with data and how to execute the provided example to demonstrate running against table data contained in a secure Oracle NoSQL Database store. But before creating and populating that table and running the example, an administrative user — named root for the purposes of this document — must be created, along with the necessary security credentials. This must be done so that the store can be administered if necessary.

In addition to creating the root user, a non-administrative user should also be created, along with the necessary credentials. This second user is created and provisioned to demonstrate how an application can be run against a table using only the minimum required privileges; as opposed to full, "root" privileges.

— Generate "Root" User and Credentials —

To generate the root user along with the necessary security credentials, login to one of the hosts running the admin service (kv-host-1 for example) and type the following commands at the command line:

  > java -jar /opt/ondb/kv/lib/kvstore.jar securityconfig pwdfile create \
             -file /opt/ondb/example-store/security/root.passwd

  Created

  > java -jar /opt/ondb/kv/lib/kvstore.jar securityconfig pwdfile secret \
             -file /opt/ondb/example-store/security/root.passwd -set -alias root

  Enter the secret value to store: 123456<RETURN>
  Re-enter the secret value for verification: 123456<RETURN>
  Secret created
  OK

  > cp /opt/ondb/example-store/security/client.security /opt/ondb/example-store/security/root.login

  > echo oracle.kv.auth.username=root >> /opt/ondb/example-store/security/root.login
  > echo oracle.kv.auth.pwdfile.file=/opt/ondb/example-store/security/root.passwd >> /opt/ondb/example-store/security/root.login

Note that the contents of the client.security properties file are copied to the file named root.login. The contents of that file are used when a client that wishes to connect to the secure KVStore started above must authenticate as the user named root. For the purposes of this document, this authentication process will be referred to as logging in to the KVStore; and thus, the properties file is referred to as a login file (or login properties file). For convenience, the system properties oracle.kv.auth.username and oracle.kv.auth.pwdfile.file are inserted into root.login; which allows one to connect to the store as the root user without having to specify those properties on the command line.

— Generate Non-Administrative User with Privileges —

To create the non-administrative user, along with the necessary credentials, first login to the admin CLI by typing the following at the command line of a node that has network connectivity with the admin service:

  > java -jar /opt/ondb/kv/lib/kvstore.jar runadmin \
             -host kv-host-1 \
             -port 5000 \
             -security /opt/ondb/example-store/security/root.login

  Logged in admin as root
  kv->

Next, create a custom role (named readwritemodifytables for example) consisting of the privileges a user would need to create and populate a table. After creating the necessary role, create a user named example-user and then grant the new role to that user. To do this, enter the following commands — either in succession, from the CLI prompt; or from a script, using the CLI command 'load -file <flnm>'.

  execute 'CREATE ROLE readwritemodifytables'
  execute 'GRANT SYSDBA TO readwritemodifytables'
  execute 'GRANT READ_ANY TO readwritemodifytables'
  execute 'GRANT WRITE_ANY TO readwritemodifytables'
  execute 'CREATE USER example-user IDENTIFIED BY \"123456\"'
  execute 'GRANT readwritemodifytables TO USER example-user'

Note that the name of the user created above is not required to be the same as the OS user name under which the example is executed. The name above and its associated credentials are registered with the KVStore for the purposes of authenticating to the store, and so can be any value you wish to use.

— Generate Non-Administrative User Credentials —

Once the KVStore user example-user and its password have been created, the KVSecurityCreation convenience program can be used to generate the public and private credentials needed by that user to connect to the KVStore. To do this, login to one of the hosts running the admin service (kv-host-1 for example) and type the following at the command line:

  > cd /opt/ondb/kv
  > javac -classpath lib/kvstore.jar:examples examples/hadoop/table/KVSecurityCreation.java

which produces the following files:

  /opt/ondb/kv/examples/hadoop/table/
        KVSecurityUtil.class
        KVSecurityCreation.class

Once KVSecurityCreation has been compiled, it can be executed to generate the desired credential artifacts. If you want to store the password in a clear text password file, then type the following at the command line:

  > cd /opt/ondb/kv
  > java -classpath lib/kvstore.jar:examples hadoop.table.KVSecurityCreation -pwdfile example-user.passwd -set -alias example-user

  May 04, 2015 11:23:32 AM hadoop.table.KVSecurityUtil removeDir
  INFO: removed file [/tmp/example-user.passwd]
  May 04, 2015 11:23:32 AM hadoop.table.KVSecurityUtil removeDir
  INFO: removed file [/tmp/example-user-client-pwdfile.login]
  created login properties file [/tmp/example-user-client-pwdfile.login]
  created login properties file [/tmp/example-user-server.login]
  created credentials store [/tmp/example-user.passwd]
  Enter the secret value to store: 123456<RETURN>
  Re-enter the secret value for verification: 123456<RETURN>
  Secret created

On the other hand, if you are using an Oracle Wallet (Enterprise Edition only) to store the user's password, then type the following:

  > cd /opt/ondb/kv
  > java -classpath lib/kvstore.jar:examples hadoop.table.KVSecurityCreation -wallet example-user-wallet.dir -set -alias example-user

  May 04, 2015 11:30:54 AM hadoop.table.KVSecurityUtil removeDir
  INFO: removed file [/tmp/example-user-wallet.dir/cwallet.sso]
  May 04, 2015 11:30:55 AM hadoop.table.KVSecurityUtil removeDir
  INFO: removed directory [/tmp/example-user-wallet.dir]
  May 04, 2015 11:30:55 AM hadoop.table.KVSecurityUtil removeDir
  INFO: removed file [/tmp/example-user-client-wallet.login]
  created login properties file [/tmp/example-user-client-wallet.login]
  created login properties file [/tmp/example-user-server.login]
  created credentials store [/tmp/example-user-wallet.dir]
  Enter the secret value to store: 123456<RETURN>
  Re-enter the secret value for verification: 123456<RETURN>
  Secret created

Compare the artifacts generated when a password file is specified with the artifacts generated when a wallet is specified. When a password file is specified, you should see the following files:

  /tmp
    example-user-client-pwdfile.login
    example-user-server.login
    example-user.passwd

And when wallet storage is specified, you should see:

  /tmp
    example-user-client-wallet.login
    example-user-server.login
    /example-user-wallet.dir
      cwallet.sso

Note that because this is an example for demonstration purposes, the credential files generated by KVSecurityCreation are placed in the system's /tmp directory. For your own applications, you may want to place the credential files you generate in a more permanent location.

Note also that for both cases — password or wallet — two login properties files are generated; one for client side connections, and one for server side connections. The only difference between the client side login file and the server side login file is that the client side login file specifies the username (the alias) along with the location of the user's password — specified by either the oracle.kv.auth.pwdfile or oracle.kv.auth.wallet.dir property. Although optional, the reason for using two login files is to avoid passing private security information to the server side; as explained in more detail in Appendix C. Additionally, observe that the server side login file (example-user-server.login) is identical for both cases. This is because whether a password file or a wallet is used to store the password, both use the same publicly visible communication transport information.

At this point, the KVStore has been deployed, configured for secure access, and provisioned with the necessary users and credentials; so that the table can be created and populated, and the example can be executed by a user whose password is stored either in a clear text password file or an Oracle Wallet (Enterprise Edition only) to demonstrate running against table data contained in a secure Oracle NoSQL Database store.

A final, important point to note is that the storage mechanism used for the example application's user password (password file or Oracle Wallet) does not depend on the password storage mechanism used by the KVStore with which that application will communicate. That is, although Appendix B (for convenience) deployed a secure KVStore using a password file rather than a wallet, the fact that the KVStore placed the passwords it manages in a password file does not prevent the developer/deployer of a client of that store from storing the client's user password in an Oracle Wallet; or vice-versa. You should therefore view the use of an Oracle Wallet or a password file by any client application as simply a "safe" place (for some value of "safe") where the user password can be stored; which can be accessed by only the user who owns the wallet or password file. This means that the choice of password storage mechanism is at the discretion of the application developer/deployer; no matter what mechanism is used by the KVStore itself.

Appendix C: A Model for Building and Packaging a Secure KVStore Client

With respect to running a MapReduce job against data contained in a secure KVStore, a particularly important issue to address is related to the communication of user credentials to the tasks run on each of the DataNodes on which the Hadoop infrastructure executes the job. Recall from above that when using the MapReduce programming model defined by Apache Hadoop the tasks executed by a MapReduce job each act as a client of the KVStore. Thus, if the store is configured for secure access, in order to retrieve the desired data from the store, each task must have access to the credentials of the user associated with that data. As described in the Oracle NoSQL Database Security Guide, the typical mechanism for providing the necessary credentials to a client of a secure store is to manually install the credentials on the client's local file system; for example, by employing a utility such as scp. Although that mechanism is practical for most clients of a secure KVStore, it is extremely impractical for a MapReduce job. This is because a MapReduce job consists of multiple tasks running in parallel, in separate address spaces, each with a separate file system that is generally not under the control of the user. Assuming then, that write access is granted by the Hadoop administrator (a problem in and of itself), this means that manual installation of the client credentials for every possible user known to the KVStore would need to occur on the file system of each of the multiple nodes in the Hadoop cluster; something that may be very difficult to achieve.

To address this issue, the sections below present a model that developers and deployers can employ to facilitate the communication of each user's credentials to a given MapReduce job from the client side of the job; that is, from the address space controlled by the job's client process, owned by the user. As described below, this model will consist of two primary components: a programming model for executing MapReduce jobs that retrieve and process data contained in tables located in a secure KVStore; and a set of "best practices" for building, packaging, and deploying those jobs. Although there is nothing preventing a user from manually installing the necessary security credentials on all nodes in a given cluster, doing so is not only impractical, but may result in various security vulnerabilitites. Combining this programming model with the deployment best practices that are presented will help developers and deployers not only avoid the need to manually pre-install credentials on the DataNodes of the Hadoop cluster, but will also prevent the sort of security vulnerabilities that can occur with manual installation.

The Programming Model for MapReduce and Oracle NoSQL Database Security

Recall that when executing a MapReduce job, the client application uses mechanisms provided by the Hadoop infrastructure to initiate the job from a node (referred to as the Hadoop cluster's access node) that has network access to the node running the Hadoop cluster's ResourceManager. If the job will be run against a secure KVStore, then prior to initiating the job, the client must initialize the job's TableInputFormat with the following three pieces of information:

The name of the file that specifies the transport properties the client will use when connecting to the store; which, for the purposes of this document, will be referred to as the login properties file (or login file).
The PasswordCredentials containing the username and password the client will present to the store during authentication.
The name of the file containing the public keys and/or certificates needed for authentication; which, for the purposes of this document, will be referred to as, the client trust file (or trust file).

To perform this initialization, the MapReduce client application — CountTableRows in this case — invokes the setKVSecurity method defined in TableInputFormat. Once this initialization has been performed and the job has been initiated, the job uses that TableInputFormat to create and assign a TableInputSplit (a split) to each of the Mapper tasks that will run on one of the DataNodes in the cluster. The TableInputFormat needs the information initialized by the setKVSecurity method for two reasons:

To connect to the secure store from the access node and retrieve the information needed to create the splits.
To initialize each split with that same security information, so that each such split can connect to the secure store from its DataNode host and retrieve the particular table data the split will process.

In addition to requiring that the MapReduce application use the mechanism just described to initialize and configure the job's TableInputFormat (and thus, it splits) with the information listed above, the model also requires that the public and private security credentials referenced by that information be communicated to the TableInputFormat, as well as the splits, in a secure fashion. How this is achieved depends on whether that information is being communicated to the TableInputFormat on the client side of the application, or to the splits on the server side.

— Communicating Security Credentials to the Splits —

To facilitate communication of the user's security credentials to the splits distributed to each of the DataNodes of the cluster, the model separates public security information from the private information (the username and password), and then stores the private information as part of each split's internal state, rather than on the local file system of each associated DataNode; which may be vulnerable or difficult/impossible to secure. For communication of the public contents of the login and trust files to each such split, the model supports an (optional) mechanism that allows the application to communicate that information as Java resources that each split retrieves from the classpath of the split's Java VM. This avoids the need to manually transfer the contents of those files to each DataNode's local file system, and also avoids the potential security vulnerabilities that can result from manual installation on those nodes. Note that when an application wishes to employ this mechanism, it will typically include the necessary information in a JAR file that is specified to the MapReduce job via the -libjars hadoop command line directive.

The intent of the mechanism just described is to allow applications to exploit the Hadoop infrastructure to automatically distribute the public login and trust information to each of the job's splits via a JAR file added to the classpath on each remote DataNode. But it is important to note that although this mechanism is used to distribute the application's public credentials, it must not be used to distribute any of the private information related to authentication; specifically, the username and password. This is important because a JAR file that is distributed to the DataNodes in the manner described may be cached on the associated DataNode's local file system; which might expose a vulnerability. As a result, private authentication information is only communicated as part of each split's internal state.

The separation of public and private credentials supported by this model not only prevents caching the private credentials on each DataNode, but also facilitates the ability to guarantee the confidentiality of that information; via whatever external third party secure communication mechanism the current Hadoop implementation happens to employ. This capability is also important to support the execution of Hive queries against a secure store.

— Communicating Security Credentials to the TableInputFormat —

With respect to the job's TableInputFormat, the programming model supports different options for communicating the user's security information. This is because the TableInputFormat operates only on the access node, on the client side of the job; which means that there is only one file system that needs to be secured. Additionally, unlike the splits, the TableInputFormat is not sent on the wire. Thus, as long as only the user is granted read privileges, both the public and private security information can be installed on the access node's file system without fear of compromise. For this case, the application would typically use system properties (on the command line) to specify the fully-qualified paths to the login, trust, and password files (or Oracle Wallet); which the TableInputFormat would then read from the local file system, retrieving the necessary public and private security information.

A second option for communicating the user's security credentials to the TableInputFormat is to include the public and private information as resources in the client side classpath of the Java VM in which the TableInputFormat runs. This is the option employed by the example presented in this document, and is similar to what was described above for the splits. This option demonstrates how an application's build model can be exploited to simplify not only the applications's command line, but also the deployment of secure MapReduce jobs in general. As was the case with the splits, applications will typically communicate the necessary security information as Java resources by including that information in a JAR file. But rather than using the -libjars hadoop command line directive to specify the JAR file to the server side of the MapReduce job, in this case, because the TableInputFormat operates on only the client side access node, the JAR file would simply be added to the HADOOP_CLASSPATH environment variable.

Best Practices: MapReduce Application Packaging for Oracle NoSQL Database Security

To help users achieve the sort of separation of public and private security information described in previous sections, a set of (optional) best practices related to packaging the client application and its necessary artifacts is presented in this section; and are employed by the example featured in this document. Although the use of these packaging practices is optional, you are encouraged to employ them when working with any MapReduce jobs of your own that will interact with a KVStore configured for secure access.

Rather than manually installing the necessary security artifacts (login file, trust file, password file or Oracle Wallet) on each DataNode in the cluster, user's should instead install those artifacts only on the cluster's single access node; the node from which the client application is executed. The client application can then retrieve each artifact from the local environment, repackage the necessary information, and then employ mechanisms provided by the Hadoop infrastructure to transfer that information to the appropriate components of the MapReduce job that will be executed.

For example, as described in the previous section, your client application can be designed to retrieve the username and location of the password from the command line, a configuration file, or a resource in the client classpath; where the location of the user's password is a locally installed password file or Oracle Wallet (Enterprise Edition only) that can only be read by the user. After retrieving the username from the command line and the password from the specified location, the client uses that information to create the user's PasswordCredentials, which are transferred to each MapReduce task via the splits that are created by the job's TableInputFormat. Using this model, the user's PasswordCredentials, are never written to the file systems of the cluster's DataNodes. They are only held in each task's memory. As a result, the integrity and confidentiality of those credentials only needs to be provided when on the wire; which can be achieved by using whatever external third party secure communication mechanism the current Hadoop implementation happens to employ.

With respect to the transfer of the public login and trust artifacts, the client application can exploit the mechanisms provided by the Hadoop infrastructure to automatically transfer classpath (JAR) artifacts to the job's tasks. As demonstrated by the CountTableRows example presented in the body of this document, the client application's build process can be designed to separate the application's class files from its public security artifacts. Specifically, the application's class files (and optionally, the public and private credentials) can be placed in a local (to the access node) JAR file for inclusion in the classpath of the client itself; while only the public security artifacts (the public login properties and client trust information) are placed in a separate JAR file that can be added to the -libjars specification of the hadoop command line for inclusion in the classpath of each MapReduce task.

— Review: Application Packaging for the Non-Secure Case —

To understand how the packaging model discussed here can be employed when executing an application against a secure KVStore, it may be helpful to first review how the CountTableRows example is executed against a non-secure store. Recall from the previous sections, for the non-secure case, the following command was executed to produce a JAR file containing only the class files needed by CountTableRows.

  > cd /opt/ondb/kv/examples
  > jar cvf CountTableRows.jar hadoop/table/CountTableRows*.class

which produces the file CountTableRows.jar, whose contents look like:

     0 Fri Feb 20 12:53:24 PST 2015 META-INF/
    68 Fri Feb 20 12:53:24 PST 2015 META-INF/MANIFEST.MF
  3842 Fri Feb 20 12:49:16 PST 2015 hadoop/table/CountTableRows.class
  2623 Fri Feb 20 12:49:16 PST 2015 hadoop/table/CountTableRows$Map.class
  3842 Fri Feb 20 12:49:16 PST 2015 hadoop/table/CountTableRows$Reduce.class

Then the following commands can be used to execute the CountTableRows example MapReduce job against a non-secure KVStore:

  > export=HADOOP_CLASSPATH $HADOOP_CLASSPATH:/opt/ondb/kv/lib/kvclient.jar
  > cd /opt/ondb/kv 
  > hadoop jar examples/non_secure_CountTableRows.jar hadoop.table.CountTableRows \
                   -libjars /opt/ondb/kv/lib/kvclient.jar \
                   example-store \
                   kv-host-1:5000 \
                   vehicleTable \
                   /user/example-user/CountTableRows/vehicleTable/0001

Note that there are three classpaths that must be set when a MapReduce job is executed. First, the jar specification to the hadoop command interpreter makes the class files of the main program (CountTableRows in this case) accessible to the hadoop launcher mechanism; so that the program can be loaded and executed. Next, the HADOOP_CLASSPATH environment variable must be set to include any third party libraries that the program or the Hadoop framework (running on the local access node) may need to load. For the example above, kvclient.jar is added to HADOOP_CLASSPATH so that the Hadoop framework's job initiation mechanism on the access node can access TableInputFormat and its related classes.

Finally, the hadoop command interpreter's -libjars argument is used to include any third party libraries in the classpath of each MapReduce task executing on the cluster's DataNodes. Again, for the case above, kvclient.jar is specified in -libjars so that each MapReduce task can access classes such as, TableInputSplit and TableRecordReader.

— Application Packaging for the Secure Case —

Compare the non-secure case above with what would be done to run the CountTableRows MapReduce job against a secure KVStore. For the secure case, two JAR files are built; one for the classpath on the client side, and one for the classpaths of the DataNodes on the server side. The first JAR file will be added to the client side classpath and includes not only the class files for the application but also the public and private credentials the client will need to interact with the secure KVStore; where including the public and private credentials in the client side JAR file avoids the inconvenience of having to specify that information on the command line. The second JAR file will be added (via the -libjars argument) to the DataNode classpaths on the server side, and will include only the user's public credentials.

As described in Appendix B, the user's password can be stored in either a clear text password file or an Oracle Wallet. As a result, how the first JAR is generated is dependent on whether a password file is used or a wallet. For example, assuming that a password file is used and the user's security artifacts are generated using the KVSecurityCreation program in the manner presented in Appendix B, to generate both the client side and server side JAR files for the CountTableRows example application, type the following:

  > cd /opt/ondb/kv/examples
  > jar cvf CountTableRows-pwdClient.jar hadoop/table/CountTableRows*.class hadoop/table/KVSecurityUtil*.class

  > cd /opt/ondb/example-store/security
  > jar uvf /opt/ondb/kv/examples/CountTableRows-pwdClient.jar client.trust

  > cd /tmp
  > jar uvf /opt/ondb/kv/examples/CountTableRows-pwdClient.jar example-user-client-pwdfile.login
  > jar uvf /opt/ondb/kv/examples/CountTableRows-pwdClient.jar example-user.passwd

  > cd /opt/ondb/example-store/security
  > jar cvf /opt/ondb/kv/examples/CountTableRows-pwdServer.jar client.trust

  > cd /tmp
  > jar uvf /opt/ondb/kv/examples/CountTableRows-pwdServer.jar example-user-server.login

which produces the client side JAR file named CountTableRows-pwdClient.jar, with contents that look like:

     0 Mon May 04 13:01:04 PDT 2015 META-INF/
    68 Mon May 04 13:01:04 PDT 2015 META-INF/MANIFEST.MF
  3650 Mon May 04 13:00:52 PDT 2015 hadoop/table/CountTableRows.class
  2623 Mon May 04 13:00:52 PDT 2015 hadoop/table/CountTableRows$Map.class
   437 Mon May 04 13:00:52 PDT 2015 hadoop/table/CountTableRows$Reduce.class
  6628 Mon May 04 13:00:52 PDT 2015 hadoop/table/KVSecurityUtil.class
   508 Wed Apr 22 12:23:32 PDT 2015 client.trust
   322 Mon May 04 11:23:32 PDT 2015 example-user-client-pwdfile.login
    34 Mon May 04 11:23:38 PDT 2015 example-user.passwd

and produces the server side JAR file named CountTableRows-pwdServer.jar, with contents that look like:

     0 Mon May 04 13:01:04 PDT 2015 META-INF/
    68 Mon May 04 13:01:04 PDT 2015 META-INF/MANIFEST.MF
   508 Wed Apr 22 12:23:32 PDT 2015 client.trust
   255 Mon May 04 11:30:54 PDT 2015 example-user-server.login

Alternatively, if KVSecurityCreation was used to generate wallet based artifacts for CountTableRows, then the client side and server side JAR files would be generated by typing:

  > cd /opt/ondb/kv/examples
  > jar cvf CountTableRows-walletClient.jar hadoop/table/CountTableRows*.class hadoop/table/KVSecurityUtil*.class

  > cd /opt/ondb/example-store/security
  > jar uvf /opt/ondb/kv/examples/CountTableRows-walletClient.jar client.trust
    
  > cd /tmp
  > jar uvf /opt/ondb/kv/examples/CountTableRows-walletClient.jar example-user-client-wallet.login
  > jar uvf /opt/ondb/kv/examples/CountTableRows-walletClient.jar example-user-wallet.dir

  > cd /opt/ondb/example-store/security
  > jar cvf /opt/ondb/kv/examples/CountTableRows-walletServer.jar client.trust
    
  > cd /tmp
  > jar uvf /opt/ondb/kv/examples/CountTableRows-walletServer.jar example-user-server.login

each with contents identical or analogous to the contents of the JAR files for the password case. That is,

     0 Mon May 04 13:22:36 PDT 2015 META-INF/
    68 Mon May 04 13:22:36 PDT 2015 META-INF/MANIFEST.MF
  3650 Mon May 04 13:00:52 PDT 2015 hadoop/table/CountTableRows.class
  2623 Mon May 04 13:00:52 PDT 2015 hadoop/table/CountTableRows$Map.class
   437 Mon May 04 13:00:52 PDT 2015 hadoop/table/CountTableRows$Reduce.class
  6628 Mon May 04 13:00:52 PDT 2015 hadoop/table/KVSecurityUtil.class
   508 Wed Apr 22 12:23:32 PDT 2015 client.trust
   324 Mon May 04 11:30:54 PDT 2015 example-user-client-wallet.login
     0 Mon May 04 11:30:54 PDT 2015 example-user-wallet.dir/
  3677 Mon May 04 11:31:00 PDT 2015 example-user-wallet.dir/cwallet.sso

and

     0 Mon May 04 13:01:04 PDT 2015 META-INF/
    68 Mon May 04 13:01:04 PDT 2015 META-INF/MANIFEST.MF
   508 Wed Apr 22 12:23:32 PDT 2015 client.trust
   255 Mon May 04 11:30:54 PDT 2015 example-user-server.login

Finally, in a fashion similar to that described for the non-secure case above, to execute the CountTableRows MapReduce job — using a password file — against a secure KVStore, you would type the following:

  > export=HADOOP_CLASSPATH $HADOOP_CLASSPATH:/opt/ondb/kv/lib/kvclient.jar:/opt/ondb/kv/examples/CountTableRows-pwdServer.jar
  > cd /opt/ondb/kv
  > hadoop jar examples/CountTableRows-pwdClient.jar \
               hadoop.table.CountTableRows \
               -libjars /opt/ondb/kv/lib/kvclient.jar,/opt/ondb/kv/examples/CountTableRows-pwdServer.jar \
               example-store \
               kv-host-1:5000 \
               vehicleTable \
               /user/example-user/CountTableRows/vehicleTable/0001 \
               example-user-client-pwdfile.login \
               example-user-server.login

Similarly, if the application stores its password in an Oracle Wallet, then you would type:

  > export=HADOOP_CLASSPATH $HADOOP_CLASSPATH:/opt/ondb/kv/lib/kvclient.jar:/opt/ondb/kv/examples/CountTableRows-walletServer.jar
  > cd /opt/ondb/kv
  > hadoop jar examples/CountTableRows-walletClient.jar \
               hadoop.table.CountTableRows \
               -libjars /opt/ondb/kv/lib/kvclient.jar,/opt/ondb/kv/examples/CountTableRows-walletServer.jar \
               example-store \
               kv-host-1:5000 \
               vehicleTable \
               /user/example-user/CountTableRows/vehicleTable/0001 \
               example-user-client-wallet.login \
               example-user-server.login

When comparing the command lines above with the command line used for the non-secure case, you should notice that HADOOP_CLASSPATH and -libjars both have been augmented with the JAR file that contains only the public login and trust credentials (CountTableRows-pwdServer.jar or CountTableRows-walletServer.jar); whereas the local classpath of the client side of the application is augmented — via the jar directive — with the JAR file that includes both the public and private credentials (CountTableRows-pwdClient.jar or CountTableRows-walletClient.jar). The only other difference with the non-secure case is the two additional arguments at the end of the argument list; example-user-client-pwdfile.login (or example-user-client-wallet.login) and example-user-server.login. The values of those arguments specify, respectively, the names of the client side and server side login files; which will be retrieved as resources from the corresponding JAR file.

Observe that when you package and execute your MapReduce application in a manner like that shown in the example above, there is no need to specify the username or password file (or wallet) on the command line; as that information is included as part of the client side JAR file. Additionally, the server side JAR file that is transferred from the access node to the job's DataNodes does not include that private information; which is important because that transferred JAR file will be cached in the file system of each of those DataNodes.

— Summary —

As the example above demonstrates, the programming model for MapReduce and Oracle NoSQL Database Security supports (even encourages) the best practices presented in this section for building, packaging, and deploying any given MapReduce job that employs the Oracle NoSQL Database Table API to retrieve and process data in a given KVStore — either secure or non-secure. As a result, simply generating separate JAR files — a set of JAR files for the secure case, and one for the non-secure case — allows deployers to conveniently run the job with or without security.

Note that this model for separating public and private user credentials will play an important role when executing Hive queries against table data in a secure KVStore.