5.11 Perfect Balance Configuration Property Reference

This section describes the Perfect Balance configuration properties and a few generic Hadoop MapReduce properties that Perfect Balance reads from the job configuration:

See "About Configuring Perfect Balance" for a list of the properties organized into functional categories.

Note:

CDH5 deprecates many MapReduce properties and replaces them with new properties. Perfect Balance continues to work with the old property names, but Oracle recommends that you use the new names. For the new MapReduce property names, see the Cloudera website at:

http://archive.cloudera.com/cdh5/cdh/5/hadoop/hadoop-project-dist/hadoop-common/DeprecatedProperties.html

MapReduce Configuration Properties

Property Type, Default Value, Description

mapreduce.input.fileinputformat.inputdir

Type: String

Default Value: Not defined

Description: A comma-separated list of input directories.

mapreduce.inputformat.class

Type: String

Default Value: org.apache.hadoop.mapreduce.lib.input.TextInputFormat

Description: The full name of the InputFormat class.

mapreduce.map.class

Type: String

Default Value: org.apache.hadoop.mapreduce.Mapper

Description: The full name of the mapper class.

mapreduce.output.fileoutputformat.outputdir

Type: String

Default Value: Not defined

Description: The job output directory.

mapreduce.partitioner.class

Type: String

Default Value: org.apache.hadoop.mapreduce.lib.partition.HashPartitioner

Description: The full name of the partitioner class.

mapreduce.reduce.class

Type: String

Default Value: org.apache.hadoop.mapreduce.Reducer

Description: The full name of the reducer class.

Job Analyzer Configuration Properties

Property Type, Default Value, Description

oracle.hadoop.balancer.application_id

Type: String

Default Value: Not defined

Description: The job identifier of the job you want to analyze with Job Analyzer. This property is a parameter to the Job Analyzer utility in standalone mode on YARN clusters; it does not apply to MRv1 clusters. See "Running Job Analyzer as a Standalone Utility".

oracle.hadoop.balancer.tools.writeKeyBytes

Type: Boolean

Default Value: false

Description: Controls whether the counting reducer collects the byte representations of the reduce keys for the Job Analyzer. Set this property to true to represent the unique key values in Base64 encoding in the report. A string representation of the key, created using key.toString, is also provided in the report. This string value may not be unique for each key.

Perfect Balance Configuration Properties

Property Type, Default Value, Description

oracle.hadoop.balancer.choppingStrategy

Note that the choppingStrategy property takes precedence over the deprecated property oracle.hadoop.balancer.enableSorting. If the choppingStrategy property is not set, oracle.hadoop.balancer.enableSorting=true is equivalent to setting the choppingStrategy property to range. Likewise, setting oracle.hadoop.balancer.enableSorting=false is equivalent to setting the choppingStrategy property to hash.

Type: String

Default Value: hash

Description: This property controls the behavior of sampler when it needs to chop a key. The following values are valid:

  • range: Records of chopped keys are assigned to different reducers according to the total-order partitioning function specified by the map output key sorting comparator, so balancer will preserve a total order over the values of a chopped key.

  • hash: Records of chopped keys are assigned to different reducers according to the hashCode on the map output values. In most cases, this approach gives a balanced work load among reducers.

  • roundRobin: Records of chopped keys are assigned to different reducers in round-robin order. This is an alternative strategy when it is not required to preserve a total order over the value of a chopped key. If the load for a hash chopped key is unbalanced among reducers, try to use this chopping strategy.

See also the deprecated property: oracle.hadoop.balancer.enableSorting

oracle.hadoop.balancer.confidence

Type: Float

Default Value: 0.95

Description: The statistical confidence indicator for the load factor specified by the oracle.hadoop.balancer.maxLoadFactor property.

This property accepts values greater than or equal to 0.5 and less than 1.0 (0.5 <= value < 1.0). A value less than 0.5 resets the property to its default value. Oracle recommends a value greater than or equal to 0.9. Typical values are 0.95 and 0.99.

oracle.hadoop.balancer.enableSorting

Type: Boolean

Default Value: false

Description: This property is deprecated. To use the map output key sorting comparator as a total-order partitioning function, set oracle.hadoop.balancer.choppingStrategy to range.

When this property is false, map output keys will be chopped using a hash function. When this property is true, map output keys will be chopped using the map output key sorting comparator as a total-order partitioning function. When this property is true, balancer will preserve a total order over the values of a chopped key.

See also: oracle.hadoop.balancer.choppingStrategy

oracle.hadoop.balancer.inputFormat.mapred.map.tasks

Type: Integer

Default Value: 100

Description: Sets the Hadoop mapred.map.tasks property for the duration of sampling, just before calling the input format getSplits method. It does not change mapred.map.tasks for the actual job. The optimal number of map tasks is a trade-off between obtaining a good sample (larger number) and having finite memory resources (smaller number).

Set this property to a value greater than or equal to one (1). A value less than 1 disables the property.

Some input formats, such as DBInputFormat, use this property as a hint to determine the number of splits returned by getSplits. Higher values indicate that more chunks of data are sampled at random, which improves the sample.

You can increase the value for larger data sets, that is, more than a million rows of about 100 bytes per row. However, extremely large values can cause the input format's getSplits method to run out of memory by returning too many splits.

oracle.hadoop.balancer.inputFormat.mapred.max.split.size

Type: Long

Default Value: 1048576 (1 MB)

Description: Sets the Hadoop mapred.max.split.size property for the duration of sampling, just before calling the input format's getSplits method. It does not change mapred.max.split.size for the actual job.

Set this property to a value greater than or equal to one (1). A value less than 1 disables the property. The optimal split size is a trade-off between obtaining a good sample (smaller splits) and efficient I/O performance (larger splits).

Some input formats, such as FileInputFormat, use the maximum split size as a hint to determine the number of splits returned by getSplits. Smaller split sizes indicate that more chunks of data are sampled at random, which improves the sample. Set the value small enough for good sampling performance, but no smaller. Extremely small values can cause inefficient I/O performance, while not improving the sample.

You can increase the value for larger data sets (tens of terabytes) or if the input format's getSplits method throws an out of memory error. Large splits are better for I/O performance, but not for sampling.

oracle.hadoop.balancer.keyLoad.minChopBytes

Type: Long

Default Value: 0

Description: Controls whether Perfect Balance chops large map output keys into medium keys:

  • -1: Perfect Balance does not chop large map output keys.

  • 0: Perfect Balance chops large map output keys and determines the optimal size of each medium key.

  • Positive integer: Perfect Balance chops large map output keys into medium keys with a size greater than or equal to the specified integer.

oracle.hadoop.balancer.linearKeyLoad.byteWeight

Type: Float

Default Value: 0.05

Description: Weights the number of bytes per key in the linear key load model specified by the oracle.hadoop.balancer.KeyLoadLinear class.

oracle.hadoop.balancer.linearKeyLoad.feedbackDir

Type: String

Default Value: Not defined

Description: The path to a directory that contains the Job Analyzer report for a job that it previously analyzed. The sampler reads this report for feedback to use to optimize the current balancing plan. You can set this property to the Job Analyzer report directory of a job that is the same or similar to the current job, so that the feedback is directly applicable.

If the feedback directory contains a Job Analyzer report with recommended values for the Perfect Balance linear key load model coefficients, then Perfect Balance automatically reads and uses them. The recommended values take precedence over user-specified values in these configuration parameters:

Job Analyzer attempts to recommend good values for these coefficients. However, Perfect Balance reads the load model coefficients from this list of configuration properties under the following circumstances:

  • The feedbackDir property is not set.

  • The feedbackDir property is set, but the Job Analyzer report in the specified directory does not contain a good recommendation for the load model coefficients.

oracle.hadoop.balancer.linearKeyLoad.keyWeight

Type: Float

Default Value: 50.0

Description: Weights the number of medium keys per large key in the linear key load model specified by the oracle.hadoop.balancer.KeyLoadLinear class.

oracle.hadoop.balancer.linearKeyLoad.rowWeight

Type: Float

Default Value: 0.05

Description: Weights the number of rows per key in the linear key load model specified by the oracle.hadoop.balancer.KeyLoadLinear class.

oracle.hadoop.balancer.maxLoadFactor

Type: Float

Default Value: 0.05

Description: The target reducer load factor that you want the balancer's partition plan to achieve.

The load factor is the relative deviation from an estimated value. For example, if maxLoadFactor=0.05 and confidence=0.95, then with a confidence greater than 95%, the job's reducer loads should be, at most, 5% greater than the value in the partition plan.

The values of these two properties determine the sampler's stopping condition. The balancer samples until it can generate a plan that guarantees the specified load factor at the specified confidence level. This guarantee may not hold if the sampler stops early because of other stopping conditions, such as the number of samples exceeds oracle.hadoop.balancer.maxSamplesPct. The partition report logs the stopping condition.

See oracle.hadoop.balancer.confidence.

oracle.hadoop.balancer.maxSamplesPct

Type: Float

Default Value: 0.01 (1%)

Description: Limits the number of samples that Perfect Balance can collect to a fraction of the total input records. A value less than zero disables the property (no limit).

You may need to increase the value for Hadoop applications with very unbalanced reducer partitions or densely clustered map-output keys. The sampler needs to sample more data to achieve a good partitioning plan in these cases.

See oracle.hadoop.balancer.useClusterStats.

oracle.hadoop.balancer.minSplits

Type: Integer

Default Value: 5

Description: Sets the minimum number of splits that the sampler reads. If the total number of splits is less than this value, then the sampler reads all splits. Set this property to a value greater than or equal to one (1). A nonpositive number sets the property to 1.

oracle.hadoop.balancer.numThreads

Type: Integer

Default Value: 5

Description: Number of sampler threads. Set this value based on the processor and memory resources available on the node where the job is initiated. A higher number of sampler threads implies higher concurrency in sampling. Set this property to one (1) to disable multithreading in the sampler.

oracle.hadoop.balancer.report.overwrite

Type: Boolean

Default Value: false

Description: Controls whether Perfect Balance overwrites files in the location specified by the oracle.hadoop.balancer.reportPath property. By default, Perfect Balance does not overwrite files; it throws an exception. Set this property to true to allow partition reports to be overwritten.

oracle.hadoop.balancer.reportPath

Type: String

Default Value: directory/orabalancer_report-random_unique_string.json, where directory for HDFS is the home directory of the user who submits the job. For the local file system, it is the directory where the job is submitted.

Description: The path where Perfect Balance writes the partition report before the Hadoop job output directory is available, that is, before the MapReduce job finishes running. At the end of the job, Perfect Balance moves the file to job_output_dir/_balancer/orabalancer_report.json. In the API, the save method does this task.

oracle.hadoop.balancer.runMode

Type: String

Default Value: local

Description: Specifies how to run the Perfect Balance sampler. The following values are valid:

  • local: The sampler runs on the client node where the job is submitted.

  • distributed: The sampler runs as a Hadoop job. If the job uses the distributed cache, then Perfect Balance automatically sets this property to distributed.

If this property is set to an invalid string, Perfect Balance resets it to local.

oracle.hadoop.balancer.tmpDir

Type: String

Default Value: /tmp/orabalancer-user_name

Description: The path to a staging directory in the file system of the job output directory (HDFS or local). Perfect Balance creates the directory if it does not exist, and copies the partition report to it for loading into the Hadoop distributed cache.

oracle.hadoop.balancer.useClusterStats

Type: Boolean

Default Value: true

Description: Enables the sampler to use cluster sampling statistics. These statistics improve the accuracy of sampled estimates, such as the number of records in a map-output key, when the map-output keys are distributed in clusters across input splits, instead of being distributed independently across all input splits.

Set this property to false only if you are absolutely certain that the map-output keys are not clustered. This setting improves the sampler's estimates only when there is, in fact, no clustering. Oracle recommends leaving this property set to true, because the distribution of map-output keys is usually unknown.

oracle.hadoop.balancer.useMapreduceApi

Type: Boolean

Default Value: true

Description: Identifies the MapReduce API used in the Hadoop job:

  • true: The job uses the mapreduce API.

  • false: The job uses the mapred API.