5.11 Perfect Balance Configuration Property Reference

This section describes the Perfect Balance configuration properties and a few generic Hadoop MapReduce properties that Perfect Balance reads from the job configuration:

See "About Configuring Perfect Balance" for a list of the properties organized into functional categories.

Note:

CDH5 deprecates many MapReduce properties and replaces them with new properties. Perfect Balance continues to work with the old property names, but Oracle recommends that you use the new names. For the new MapReduce property names, see the Cloudera website at:

http://archive.cloudera.com/cdh5/cdh/5/hadoop/hadoop-project-dist/hadoop-common/DeprecatedProperties.html

MapReduce Configuration Properties

Property	Type, Default Value, Description
mapreduce.input.fileinputformat.inputdir	Type: String Default Value: Not defined Description: A comma-separated list of input directories.
mapreduce.inputformat.class	Type: String Default Value: `org.apache.hadoop.mapreduce.lib.input.TextInputFormat` Description: The full name of the `InputFormat` class.
mapreduce.map.class	Type: String Default Value: `org.apache.hadoop.mapreduce.Mapper` Description: The full name of the mapper class.
mapreduce.output.fileoutputformat.outputdir	Type: String Default Value: Not defined Description: The job output directory.
mapreduce.partitioner.class	Type: String Default Value: `org.apache.hadoop.mapreduce.lib.partition.HashPartitioner` Description: The full name of the partitioner class.
mapreduce.reduce.class	Type: String Default Value: `org.apache.hadoop.mapreduce.Reducer` Description: The full name of the reducer class.

Job Analyzer Configuration Properties

Property Type, Default Value, Description

Property	Type, Default Value, Description
oracle.hadoop.balancer.application_id	Type: String Default Value: Not defined Description: The job identifier of the job you want to analyze with Job Analyzer. This property is a parameter to the Job Analyzer utility in standalone mode on YARN clusters; it does not apply to MRv1 clusters. See "Running Job Analyzer as a Standalone Utility".
oracle.hadoop.balancer.tools.writeKeyBytes	Type: Boolean Default Value: `false` Description: Controls whether the counting reducer collects the byte representations of the reduce keys for the Job Analyzer. Set this property to `true` to represent the unique key values in Base64 encoding in the report. A string representation of the key, created using `key.toString`, is also provided in the report. This string value may not be unique for each key.

oracle.hadoop.balancer.application_id

Type: String

Default Value: Not defined

Description: The job identifier of the job you want to analyze with Job Analyzer. This property is a parameter to the Job Analyzer utility in standalone mode on YARN clusters; it does not apply to MRv1 clusters. See "Running Job Analyzer as a Standalone Utility".

oracle.hadoop.balancer.tools.writeKeyBytes

Type: Boolean

Default Value: false

Description: Controls whether the counting reducer collects the byte representations of the reduce keys for the Job Analyzer. Set this property to true to represent the unique key values in Base64 encoding in the report. A string representation of the key, created using key.toString, is also provided in the report. This string value may not be unique for each key.

Perfect Balance Configuration Properties

Property	Type, Default Value, Description
oracle.hadoop.balancer.choppingStrategy Note that the choppingStrategy property takes precedence over the deprecated property `oracle.hadoop.balancer.enableSorting`. If the choppingStrategy property is not set, `oracle.hadoop.balancer.enableSorting=true` is equivalent to setting the choppingStrategy property to `range`. Likewise, setting `oracle.hadoop.balancer.enableSorting=false` is equivalent to setting the choppingStrategy property to `hash`.	Type: String Default Value: `hash` Description: This property controls the behavior of sampler when it needs to chop a key. The following values are valid: `range`: Records of chopped keys are assigned to different reducers according to the total-order partitioning function specified by the map output key sorting comparator, so balancer will preserve a total order over the values of a chopped key. `hash`: Records of chopped keys are assigned to different reducers according to the hashCode on the map output values. In most cases, this approach gives a balanced work load among reducers. `roundRobin`: Records of chopped keys are assigned to different reducers in round-robin order. This is an alternative strategy when it is not required to preserve a total order over the value of a chopped key. If the load for a hash chopped key is unbalanced among reducers, try to use this chopping strategy. See also the deprecated property: oracle.hadoop.balancer.enableSorting
oracle.hadoop.balancer.confidence	Type: Float Default Value: `0.95` Description: The statistical confidence indicator for the load factor specified by the `oracle.hadoop.balancer.maxLoadFactor` property. This property accepts values greater than or equal to 0.5 and less than 1.0 (0.5 <= value < 1.0). A value less than 0.5 resets the property to its default value. Oracle recommends a value greater than or equal to 0.9. Typical values are 0.95 and 0.99.
oracle.hadoop.balancer.enableSorting	Type: Boolean Default Value: `false` Description: This property is deprecated. To use the map output key sorting comparator as a total-order partitioning function, set `oracle.hadoop.balancer.choppingStrategy` to `range`. When this property is false, map output keys will be chopped using a hash function. When this property is true, map output keys will be chopped using the map output key sorting comparator as a total-order partitioning function. When this property is true, balancer will preserve a total order over the values of a chopped key. See also: oracle.hadoop.balancer.choppingStrategy
oracle.hadoop.balancer.inputFormat.mapred.map.tasks	Type: Integer Default Value: `100` Description: Sets the Hadoop `mapred.map.tasks` property for the duration of sampling, just before calling the input format `getSplits` method. It does not change `mapred.map.tasks` for the actual job. The optimal number of map tasks is a trade-off between obtaining a good sample (larger number) and having finite memory resources (smaller number). Set this property to a value greater than or equal to one (1). A value less than 1 disables the property. Some input formats, such as `DBInputFormat`, use this property as a hint to determine the number of splits returned by `getSplits`. Higher values indicate that more chunks of data are sampled at random, which improves the sample. You can increase the value for larger data sets, that is, more than a million rows of about 100 bytes per row. However, extremely large values can cause the input format's `getSplits` method to run out of memory by returning too many splits.
oracle.hadoop.balancer.inputFormat.mapred.max.split.size	Type: Long Default Value: `1048576` (1 MB) Description: Sets the Hadoop `mapred.max.split.size` property for the duration of sampling, just before calling the input format's `getSplits` method. It does not change `mapred.max.split.size` for the actual job. Set this property to a value greater than or equal to one (1). A value less than 1 disables the property. The optimal split size is a trade-off between obtaining a good sample (smaller splits) and efficient I/O performance (larger splits). Some input formats, such as `FileInputFormat`, use the maximum split size as a hint to determine the number of splits returned by `getSplits`. Smaller split sizes indicate that more chunks of data are sampled at random, which improves the sample. Set the value small enough for good sampling performance, but no smaller. Extremely small values can cause inefficient I/O performance, while not improving the sample. You can increase the value for larger data sets (tens of terabytes) or if the input format's `getSplits` method throws an out of memory error. Large splits are better for I/O performance, but not for sampling.
oracle.hadoop.balancer.keyLoad.minChopBytes	Type: Long Default Value: `0` Description: Controls whether Perfect Balance chops large map output keys into medium keys: `-1`: Perfect Balance does not chop large map output keys. `0`: Perfect Balance chops large map output keys and determines the optimal size of each medium key. Positive integer: Perfect Balance chops large map output keys into medium keys with a size greater than or equal to the specified integer.
oracle.hadoop.balancer.linearKeyLoad.byteWeight	Type: Float Default Value: `0.05` Description: Weights the number of bytes per key in the linear key load model specified by the `oracle.hadoop.balancer.KeyLoadLinear` class.
oracle.hadoop.balancer.linearKeyLoad.feedbackDir	Type: String Default Value: Not defined Description: The path to a directory that contains the Job Analyzer report for a job that it previously analyzed. The sampler reads this report for feedback to use to optimize the current balancing plan. You can set this property to the Job Analyzer report directory of a job that is the same or similar to the current job, so that the feedback is directly applicable. If the feedback directory contains a Job Analyzer report with recommended values for the Perfect Balance linear key load model coefficients, then Perfect Balance automatically reads and uses them. The recommended values take precedence over user-specified values in these configuration parameters: `oracle.hadoop.balancer.linearKeyLoad.byteWeight` `oracle.hadoop.balancer.linearKeyLoad.keyWeight` `oracle.hadoop.balancer.linearKeyLoad.rowWeight` Job Analyzer attempts to recommend good values for these coefficients. However, Perfect Balance reads the load model coefficients from this list of configuration properties under the following circumstances: The `feedbackDir` property is not set. The `feedbackDir` property is set, but the Job Analyzer report in the specified directory does not contain a good recommendation for the load model coefficients.
oracle.hadoop.balancer.linearKeyLoad.keyWeight	Type: Float Default Value: 50.0 Description: Weights the number of medium keys per large key in the linear key load model specified by the `oracle.hadoop.balancer.KeyLoadLinear` class.
oracle.hadoop.balancer.linearKeyLoad.rowWeight	Type: Float Default Value: 0.05 Description: Weights the number of rows per key in the linear key load model specified by the `oracle.hadoop.balancer.KeyLoadLinear` class.
oracle.hadoop.balancer.maxLoadFactor	Type: Float Default Value: 0.05 Description: The target reducer load factor that you want the balancer's partition plan to achieve. The load factor is the relative deviation from an estimated value. For example, if `maxLoadFactor=0.05` and `confidence=0.95`, then with a confidence greater than 95%, the job's reducer loads should be, at most, 5% greater than the value in the partition plan. The values of these two properties determine the sampler's stopping condition. The balancer samples until it can generate a plan that guarantees the specified load factor at the specified confidence level. This guarantee may not hold if the sampler stops early because of other stopping conditions, such as the number of samples exceeds `oracle.hadoop.balancer.maxSamplesPct`. The partition report logs the stopping condition. See `oracle.hadoop.balancer.confidence`.
oracle.hadoop.balancer.maxSamplesPct	Type: Float Default Value: `0.01` (1%) Description: Limits the number of samples that Perfect Balance can collect to a fraction of the total input records. A value less than zero disables the property (no limit). You may need to increase the value for Hadoop applications with very unbalanced reducer partitions or densely clustered map-output keys. The sampler needs to sample more data to achieve a good partitioning plan in these cases. See `oracle.hadoop.balancer.useClusterStats`.
oracle.hadoop.balancer.minSplits	Type: Integer Default Value: `5` Description: Sets the minimum number of splits that the sampler reads. If the total number of splits is less than this value, then the sampler reads all splits. Set this property to a value greater than or equal to one (`1`). A nonpositive number sets the property to `1`.
oracle.hadoop.balancer.numThreads	Type: Integer Default Value: `5` Description: Number of sampler threads. Set this value based on the processor and memory resources available on the node where the job is initiated. A higher number of sampler threads implies higher concurrency in sampling. Set this property to one (`1`) to disable multithreading in the sampler.
oracle.hadoop.balancer.report.overwrite	Type: Boolean Default Value: `false` Description: Controls whether Perfect Balance overwrites files in the location specified by the `oracle.hadoop.balancer.reportPath` property. By default, Perfect Balance does not overwrite files; it throws an exception. Set this property to `true` to allow partition reports to be overwritten.
oracle.hadoop.balancer.reportPath	Type: String Default Value: `directory/orabalancer_report-random_unique_string.json`, where directory for HDFS is the home directory of the user who submits the job. For the local file system, it is the directory where the job is submitted. Description: The path where Perfect Balance writes the partition report before the Hadoop job output directory is available, that is, before the MapReduce job finishes running. At the end of the job, Perfect Balance moves the file to `job_output_dir/_balancer/orabalancer_report.json`. In the API, the `save` method does this task.
oracle.hadoop.balancer.runMode	Type: String Default Value: `local` Description: Specifies how to run the Perfect Balance sampler. The following values are valid: `local`: The sampler runs on the client node where the job is submitted. `distributed`: The sampler runs as a Hadoop job. If the job uses the distributed cache, then Perfect Balance automatically sets this property to `distributed`. If this property is set to an invalid string, Perfect Balance resets it to `local`.
oracle.hadoop.balancer.tmpDir	Type: String Default Value: `/tmp/orabalancer-user_name` Description: The path to a staging directory in the file system of the job output directory (HDFS or local). Perfect Balance creates the directory if it does not exist, and copies the partition report to it for loading into the Hadoop distributed cache.
oracle.hadoop.balancer.useClusterStats	Type: Boolean Default Value: `true` Description: Enables the sampler to use cluster sampling statistics. These statistics improve the accuracy of sampled estimates, such as the number of records in a map-output key, when the map-output keys are distributed in clusters across input splits, instead of being distributed independently across all input splits. Set this property to `false` only if you are absolutely certain that the map-output keys are not clustered. This setting improves the sampler's estimates only when there is, in fact, no clustering. Oracle recommends leaving this property set to `true`, because the distribution of map-output keys is usually unknown.
oracle.hadoop.balancer.useMapreduceApi	Type: Boolean Default Value: `true` Description: Identifies the MapReduce API used in the Hadoop job: `true`: The job uses the `mapreduce` API. `false`: The job uses the `mapred` API.