org.apache.hadoop.mapreduce.InputFormat<K,V>

oracle.kv.hadoop.table.TableInputFormatBase<K,V>

Direct Known Subclasses:: TableInputFormat

public abstract class TableInputFormatBase<K,V> extends InputFormat<K,V>

This is the base class for Oracle NoSQL Database InputFormat classes that can be used to run MapReduce against data stored via the Table API. Keys are of type PrimaryKey. Values are always of type Row.

Parameters may be passed using either the static setters on this class or through the Hadoop JobContext configuration parameters. The following parameters are recognized:

oracle.kv.kvstore - the KV Store name for this InputFormat to operate on. This is equivalent to the setKVStoreName(java.lang.String) method.
oracle.kv.hosts - one or more hostname:port pairs separated by commas naming hosts in the KV Store. This is equivalent to the setKVHelperHosts(java.lang.String[]) method.
oracle.kv.hadoop.hosts - one or more hostname strings separated by commas naming the Hadoop data node hosts in the Hadoop cluster that will support MapReduce jobs. This is equivalent to the setKVHadoopHosts(java.lang.String[]) method. The value(s) specified by this property will be returned by the getLocations method of TableInputSplit. If this property is not specified, or if the setKVHadoopHosts(java.lang.String[]) method is not called, then the values specified via the oracle.kv.hosts property (or the setKVHelperHosts(java.lang.String[]) method) will be used instead.
oracle.kv.tableName - the name of the table in the store from which data will be retrieved. This is equivalent to the setTableName(java.lang.String) method.
oracle.kv.primaryKey - Property whose value consists of the components to use when constructing the key to employ when iterating the table. The format of this property's value must be a list of name:value pairs in JSON FORMAT like the following: -Doracle.kv.primaryKey="{\"name\":\"stringVal\",\"name\":floatVal}" where the list itself is enclosed in un-escaped double quotes and corresponding curly brace; and each field name component -- as well as each STRING type field value component -- is enclosed in ESCAPED double quotes.
In addition to the JSON format requirement above, the values referenced by the various fieldValue components of this Property must satisfy the semantics of PrimaryKey for the given table; that is, they must represent a first-to-last subset of the table's primary key fields, and they must be specified in the same order as those primary key fields. If the components of this property do not satisfy these requirements, a full primary key wildcard will be used when iterating the table.
This is equivalent to the setPrimaryKeyProperty(java.lang.String) method.
oracle.kv.fieldRange - Property whose value consists of the components to use when constructing the field range to employ when iterating the table. The format of this property's value must be a list of name:value pairs in JSON FORMAT like the following: -Doracle.kv.fieldRange="{\"name\":\"fieldName\", \"start\":\"startVal\",[\"startInclusive\":true|false], \"end\"\"endVal\",[\"endInclusive\":true|false]}" where for the given field over which to range, the 'start', and 'end' components are required, and the 'startInclusive' and 'endInclusive' components are optional; defaulting to 'true' if not included. Note that the list itself is enclosed in un-escaped double quotes and corresponding curly brace; and each name component and string type value component is enclosed in ESCAPED double quotes.
In addition to the JSON format requirement above, the values referenced by the components of this Property's value must also satisfy the semantics of FieldRange; that is,
- the values associated with the target key must correspond to a valid primary key in the table
- the value associated with the fieldName must be the name of a valid field of the primary key over which iteration will be performed
- the values associated with the start and end of the range must correspond to valid values of the given fieldName
- the value associated with either of the inclusive components must be either 'true' or 'false'
If the components of this property do not satisfy these requirements, then table iteration will be performed over the full range of values of the PrimaryKey iteration rather than a sub-range.
This is equivalent to the setFieldRangeProperty(java.lang.String) method.
oracle.kv.consistency - Specifies the read consistency associated with the lookup of the child KV pairs. Version- and Time-based consistency may not be used. If null, the default consistency is used.
This is equivalent to the setConsistency(oracle.kv.Consistency) method.
oracle.kv.timeout - Specifies an upper bound on the time interval for processing a particular KV retrieval. A best effort is made to not exceed the specified limit. If zero, the default request timeout is used. This value is always in milliseconds.
This is equivalent to the setTimeout(long) and setTimeoutUnit(java.util.concurrent.TimeUnit) methods.
oracle.kv.maxRequests - Specifies the maximum number of client side threads to use when running an iteration; where a value of 1 causes the iteration to be performed using only the current thread, and a value of 0 causes the client to base the number of threads to employ on the current store topology.
This is equivalent to the setMaxRequests(int) method.
oracle.kv.batchSize - Specifies the suggested number of keys to fetch during each network round trip by the InputFormat. If 0, an internally determined default is used. This is equivalent to the setBatchSize(int) method.
oracle.kv.maxBatches - Specifies the maximum number of result batches that can be held in memory on the client side before processing on the server side pauses. This parameter can be used to prevent the client side memory from being exceeded if the client cannot consume results as fast as they are generated by the server side.
This is equivalent to the setMaxBatches(int) method.

Internally, the TableInputFormatBase class utilizes the method oracle.kv.table.TableIterator<oracle.kv.table.Row> TableAPI.tableIterator to retrieve records. You should refer to the javadoc for that method for information about the various parameters.

TableInputFormatBase dynamically generates a number of splits, each encapsulating a list of sets in which the elements of each set are partition ids over which can be retrieved in parallel; to optimize retrieval performance. The "size" of each split that is generated -- which will be the value returned by the getLength method of TableInputSplit -- is the number of that encapsulated list of partition id sets. If the consistency passed to TableInputFormatBase is NONE_REQUIRED (the default), then InputSplit.getLocations() will return an array of the names of the master and the replica(s) which contain the partition. Alternatively, if the consistency is ABSOLUTE, then the array returned will contain only the name of the master. This means that if Hadoop job trackers are running on the nodes named in the returned location array, Hadoop will generally attempt to run the subtasks for a particular partition on those nodes where the data is stored and replicated. Hadoop and Oracle NoSQL DB administrators should be careful about co-location of Oracle NoSQL DB and Hadoop processes since they may compete for resources.

InputSplit.getLength() always returns 1.

A simple example demonstrating the Oracle NoSQL DB Hadoop oracle.kv.hadoop.table.TableInputFormat class reading data from Hadoop in a MapReduce job and counting the number of rows in a given table in the store can be found in the KVHOME/examples/hadoop/table directory. The javadoc for that program describes the simple MapReduce processing as well as how to invoke the program in Hadoop.

Since:: 3.1

Nested Class Summary

Nested Classes

Modifier and Type

Class

Description

static class

TableInputFormatBase.TopologyLocatorWrapper

Special class that wraps the static get method of the TopologyLocator utility.
Field Summary

Fields

Modifier and Type

Field

Description

protected TableInputFormatBase.TopologyLocatorWrapper

topologyLocator

protected static final String

USER_SECURITY_DIR
Method Summary

Modifier and Type

Method

Description

static void

clearLocalKVSecurity()

Convenience method for testing.

static void

setBatchSize(int batchSize)

Specifies the suggested number of keys to fetch during each network round trip by the InputFormat.

static void

setConsistency(Consistency consistency)

Specifies the read consistency associated with the lookup of the child KV pairs.

static void

setDirection(Direction newDirection)

Specifies the order in which records are returned by the InputFormat.

static void

setFieldRangeProperty(String newProperty)

Sets the String to use for the property value whose contents are used to construct the field range to employ when iterating the table.

static void

setKVHadoopHosts(String[] newHadoopHosts)

Set the KV Hadoop data node host name(s) for this InputFormat to operate on.

static void

setKVHelperHosts(String[] newHelperHosts)

Set the KV Helper host:port pair(s) for this InputFormat to operate on.

static void

setKVSecurity(String loginFile, PasswordCredentials userPasswordCredentials, String trustFile)

Sets the login properties file and the public trust file (keys and/or certificates), as well as the PasswordCredentials for authentication.

static void

setKVStoreName(String newStoreName)

Set the KV Store name for this InputFormat to operate on.

static void

setMaxBatches(int newMaxBatches)

Specifies the maximum number of result batches that can be held in memory on the client side before processing on the server side pauses.

static void

setMaxRequests(int newMaxRequests)

Specifies the maximum number of client side threads to use when running an iteration; where a value of 1 causes the iteration to be performed using only the current thread, and a value of 0 causes the client to base the number of threads to employ on the current store topology.

static void

setPrimaryKeyProperty(String newProperty)

Sets the String to use for the property value whose contents are used to construct the primary key to employ when iterating the table.

void

setQueryInfo(int newQueryBy, String newWhereClause, Integer newPartitionId)

static void

setTableName(String newTableName)

Set the name of the table in the KV store that this InputFormat will operate on.

static void

setTimeout(long timeout)

Specifies an upper bound on the time interval for processing a particular KV retrieval.

static void

setTimeoutUnit(TimeUnit timeoutUnit)

Specifies the unit of the timeout parameter.

Methods inherited from class org.apache.hadoop.mapreduce.InputFormat
createRecordReader

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Details
- USER_SECURITY_DIR
  
  protected static final String USER_SECURITY_DIR
- topologyLocator
  
  protected TableInputFormatBase.TopologyLocatorWrapper topologyLocator
Method Details
- setKVStoreName
  
  public static void setKVStoreName(String newStoreName)
  
  Set the KV Store name for this InputFormat to operate on. This is equivalent to passing the oracle.kv.kvstore Hadoop property.
  
  Parameters:
  
  newStoreName - the new KV Store name to set
- setKVHelperHosts
  
  public static void setKVHelperHosts(String[] newHelperHosts)
  
  Set the KV Helper host:port pair(s) for this InputFormat to operate on. This is equivalent to passing the oracle.kv.hosts Hadoop property.
  
  Parameters:
  
  newHelperHosts - array of hostname:port strings of any hosts in the KV Store.
- setKVHadoopHosts
  
  public static void setKVHadoopHosts(String[] newHadoopHosts)
  
  Set the KV Hadoop data node host name(s) for this InputFormat to operate on. This is equivalent to passing the oracle.kv.hadoop.hosts property.
  
  Parameters:
  
  newHadoopHosts - array of hostname strings corresponding to the names of the Hadoop data node hosts in the Hadoop cluster that this InputFormat will use to support MapReduce jobs.
- setTableName
  
  public static void setTableName(String newTableName)
  
  Set the name of the table in the KV store that this InputFormat will operate on. This is equivalent to passing the oracle.kv.tableName property.
  
  Parameters:
  
  newTableName - the new table name to set.
- setPrimaryKeyProperty
  
  public static void setPrimaryKeyProperty(String newProperty)
  
  Sets the String to use for the property value whose contents are used to construct the primary key to employ when iterating the table. The format of the String input to this method must be a comma-separated String of the form: fieldName,fieldValue,fieldType,fieldName,fieldValue,fieldType,.. where the number of elements separated by commas must be a multiple of 3, and each fieldType must be 'STRING', 'INTEGER', 'LONG', 'FLOAT', 'DOUBLE', or 'BOOLEAN'. Additionally, the values referenced by the various fieldType and fieldValue components of this String must satisfy the semantics of PrimaryKey for the given table; that is, they must represent a first-to-last subset of the table's primary key fields, and they must be specified in the same order as those primary key fields. If the String referenced by this property does not satisfy these requirements, a full primary key wildcard will be used when iterating the table.
  This is equivalent to passing the oracle.kv.primaryKey Hadoop property.
  
  Parameters:
  
  newProperty - the new shard key property to set
- setFieldRangeProperty
  
  public static void setFieldRangeProperty(String newProperty)
  Sets the String to use for the property value whose contents are used to construct the field range to employ when iterating the table. The format of this property's value must be a list of name:value pairs in JSON FORMAT like the following: -Doracle.kv.fieldRange="{\"name\":\"fieldName\", \"start\":\"startVal\",[\"startInclusive\":true|false], \"end\"\"endVal\",[\"endInclusive\":true|false]}" where for the given field over which to range, the 'start', and 'end' components are required, and the 'startInclusive' and 'endInclusive' components are optional; defaulting to 'true' if not included. Note that the list itself is enclosed in un-escaped double quotes and corresponding curly brace; and each name component and string type value component is enclosed in ESCAPED double quotes.
  In addition to the JSON format requirement above, the values referenced by the components of this Property's value must also satisfy the semantics of FieldRange; that is,
  
  the values associated with the target key must correspond to a valid primary key in the table
  the value associated with the fieldName must be the name of a valid field of the primary key over which iteration will be performed
  the values associated with the start and end of the range must correspond to valid values of the given fieldName
  the value associated with either of the inclusive components must be either 'true' or 'false'
  If the components of this property do not satisfy these requirements, then table iteration will be performed over the full range of values of the PrimaryKey iteration rather than a sub-range.
  This is equivalent to passing the oracle.kv.fieldRange Hadoop property.
  Parameters:
  
  newProperty - the new field range property to set
- setDirection
  
  public static void setDirection(Direction newDirection)
  
  Specifies the order in which records are returned by the InputFormat. Note that when doing PrimaryKey iteration, only Direction.UNORDERED is allowed.
  
  Parameters:
  
  newDirection - the direction to retrieve data
- setConsistency
  
  public static void setConsistency(Consistency consistency)
  
  Specifies the read consistency associated with the lookup of the child KV pairs. Version- and Time-based consistency may not be used. If null, the default consistency is used. This is equivalent to passing the oracle.kv.consistency Hadoop property.
  
  Parameters:
  
  consistency - the consistency
- setTimeout
  
  public static void setTimeout(long timeout)
  
  Specifies an upper bound on the time interval for processing a particular KV retrieval. A best effort is made to not exceed the specified limit. If zero, the default request timeout is used. This is equivalent to passing the oracle.kv.timeout Hadoop property.
  
  Parameters:
  
  timeout - the timeout
- setTimeoutUnit
  
  public static void setTimeoutUnit(TimeUnit timeoutUnit)
  
  Specifies the unit of the timeout parameter. The value input may be null, only if timeout is currently zero; in which case an IllegalArgumentException is thrown. This is equivalent to passing the oracle.kv.timeout.unit Hadoop property.
  
  Parameters:
  
  timeoutUnit - the timeout unit
- setMaxRequests
  
  public static void setMaxRequests(int newMaxRequests)
  
  Specifies the maximum number of client side threads to use when running an iteration; where a value of 1 causes the iteration to be performed using only the current thread, and a value of 0 causes the client to base the number of threads to employ on the current store topology.
  This is equivalent to passing the oracle.kv.maxRequests Hadoop property.
  
  Parameters:
  
  newMaxRequests - the suggested number of threads to employ when an iteration.
- setBatchSize
  
  public static void setBatchSize(int batchSize)
  
  Specifies the suggested number of keys to fetch during each network round trip by the InputFormat. If 0, an internally determined default is used. This is equivalent to passing the oracle.kv.batchSize Hadoop property.
  
  Parameters:
  
  batchSize - the suggested number of keys to fetch during each network round trip.
- setMaxBatches
  
  public static void setMaxBatches(int newMaxBatches)
  
  Specifies the maximum number of result batches that can be held in memory on the client side before processing on the server side pauses. This parameter can be used to prevent the client side memory from being exceeded if the client cannot consume results as fast as they are generated by the server side.
  This is equivalent to passing the oracle.kv.maxBatches Hadoop property.
  
  Parameters:
  
  newMaxBatches - the suggested number of threads to employ when an iteration.
- setKVSecurity
  
  public static void setKVSecurity(String loginFile, PasswordCredentials userPasswordCredentials, String trustFile) throws IOException
  
  Sets the login properties file and the public trust file (keys and/or certificates), as well as the PasswordCredentials for authentication. The value of the loginFile and trustFile parameters must be either a fully qualified path referencing a file located on the local file system, or the name of a file (no path) whose contents can be retrieved as a resource from the current VM's classpath.
  Note that this class provides the getSplits method; which must be able to contact a secure store, and so will need access to local copies of the login properties and trust files. As a result, if the values input for the loginFile and trustFile parameters are simple file names rather than fully qualified paths, this method will retrieve the contents of each from the classpath and generate private, local copies of the associated file for availability to the getSplits method.
  
  Throws:
  
  IOException
- setQueryInfo
  
  public void setQueryInfo(int newQueryBy, String newWhereClause, Integer newPartitionId)
- clearLocalKVSecurity
  
  public static void clearLocalKVSecurity()
  
  Convenience method for testing. A test calls this method when a test run prior to the current test has set the security related fields to values that may interfere with the current test.

Class TableInputFormatBase<K,V>

Nested Class Summary

Field Summary

Method Summary

Methods inherited from class org.apache.hadoop.mapreduce.InputFormat

Methods inherited from class java.lang.Object

Field Details

USER_SECURITY_DIR

topologyLocator

Method Details

setKVStoreName

setKVHelperHosts

setKVHadoopHosts

setTableName

setPrimaryKeyProperty

setFieldRangeProperty

setDirection

setConsistency

setTimeout

setTimeoutUnit

setMaxRequests

setBatchSize

setMaxBatches

setKVSecurity

setQueryInfo

clearLocalKVSecurity