This section describes the Perfect Balance configuration properties and a few generic Hadoop MapReduce properties that Perfect Balance reads from the job configuration:
See "About Configuring Perfect Balance" for a list of the properties organized into functional categories.
Note:
CDH5 deprecates many MapReduce properties and replaces them with new properties. Perfect Balance continues to work with the old property names, but Oracle recommends that you use the new names. For the new MapReduce property names, see the Cloudera website at:
MapReduce Configuration Properties
Property | Type, Default Value, Description |
---|---|
mapreduce.input.fileinputformat.inputdir |
Type: String Default Value: Not defined Description: A comma-separated list of input directories. |
mapreduce.inputformat.class |
Type: String Default Value: Description: The full name of the |
mapreduce.map.class |
Type: String Default Value: Description: The full name of the mapper class. |
mapreduce.output.fileoutputformat.outputdir |
Type: String Default Value: Not defined Description: The job output directory. |
mapreduce.partitioner.class |
Type: String Default Value: Description: The full name of the partitioner class. |
mapreduce.reduce.class |
Type: String Default Value: Description: The full name of the reducer class. |
Job Analyzer Configuration Properties
Property | Type, Default Value, Description |
---|---|
oracle.hadoop.balancer.application_id |
Type: String Default Value: Not defined Description: The job identifier of the job you want to analyze with Job Analyzer. This property is a parameter to the Job Analyzer utility in standalone mode on YARN clusters; it does not apply to MRv1 clusters. See "Running Job Analyzer as a Standalone Utility". |
oracle.hadoop.balancer.tools.writeKeyBytes |
Type: Boolean Default Value: Description: Controls whether the counting reducer collects the byte representations of the reduce keys for the Job Analyzer. Set this property to |
Perfect Balance Configuration Properties
Property | Type, Default Value, Description |
---|---|
oracle.hadoop.balancer.choppingStrategy Note that the choppingStrategy property takes precedence over the deprecated property |
Type: String Default Value: Description: This property controls the behavior of sampler when it needs to chop a key. The following values are valid:
See also the deprecated property: oracle.hadoop.balancer.enableSorting |
oracle.hadoop.balancer.confidence |
Type: Float Default Value: Description: The statistical confidence indicator for the load factor specified by the This property accepts values greater than or equal to 0.5 and less than 1.0 (0.5 <= value < 1.0). A value less than 0.5 resets the property to its default value. Oracle recommends a value greater than or equal to 0.9. Typical values are 0.95 and 0.99. |
oracle.hadoop.balancer.enableSorting |
Type: Boolean Default Value: Description: This property is deprecated. To use the map output key sorting comparator as a total-order partitioning function, set When this property is false, map output keys will be chopped using a hash function. When this property is true, map output keys will be chopped using the map output key sorting comparator as a total-order partitioning function. When this property is true, balancer will preserve a total order over the values of a chopped key. See also: oracle.hadoop.balancer.choppingStrategy |
oracle.hadoop.balancer.inputFormat.mapred.map.tasks |
Type: Integer Default Value: Description: Sets the Hadoop Set this property to a value greater than or equal to one (1). A value less than 1 disables the property. Some input formats, such as You can increase the value for larger data sets, that is, more than a million rows of about 100 bytes per row. However, extremely large values can cause the input format's |
oracle.hadoop.balancer.inputFormat.mapred.max.split.size |
Type: Long Default Value: Description: Sets the Hadoop Set this property to a value greater than or equal to one (1). A value less than 1 disables the property. The optimal split size is a trade-off between obtaining a good sample (smaller splits) and efficient I/O performance (larger splits). Some input formats, such as You can increase the value for larger data sets (tens of terabytes) or if the input format's |
oracle.hadoop.balancer.keyLoad.minChopBytes |
Type: Long Default Value: Description: Controls whether Perfect Balance chops large map output keys into medium keys:
|
oracle.hadoop.balancer.linearKeyLoad.byteWeight |
Type: Float Default Value: Description: Weights the number of bytes per key in the linear key load model specified by the |
oracle.hadoop.balancer.linearKeyLoad.feedbackDir |
Type: String Default Value: Not defined Description: The path to a directory that contains the Job Analyzer report for a job that it previously analyzed. The sampler reads this report for feedback to use to optimize the current balancing plan. You can set this property to the Job Analyzer report directory of a job that is the same or similar to the current job, so that the feedback is directly applicable. If the feedback directory contains a Job Analyzer report with recommended values for the Perfect Balance linear key load model coefficients, then Perfect Balance automatically reads and uses them. The recommended values take precedence over user-specified values in these configuration parameters: Job Analyzer attempts to recommend good values for these coefficients. However, Perfect Balance reads the load model coefficients from this list of configuration properties under the following circumstances:
|
oracle.hadoop.balancer.linearKeyLoad.keyWeight |
Type: Float Default Value: 50.0 Description: Weights the number of medium keys per large key in the linear key load model specified by the |
oracle.hadoop.balancer.linearKeyLoad.rowWeight |
Type: Float Default Value: 0.05 Description: Weights the number of rows per key in the linear key load model specified by the |
oracle.hadoop.balancer.maxLoadFactor |
Type: Float Default Value: 0.05 Description: The target reducer load factor that you want the balancer's partition plan to achieve. The load factor is the relative deviation from an estimated value. For example, if The values of these two properties determine the sampler's stopping condition. The balancer samples until it can generate a plan that guarantees the specified load factor at the specified confidence level. This guarantee may not hold if the sampler stops early because of other stopping conditions, such as the number of samples exceeds |
oracle.hadoop.balancer.maxSamplesPct |
Type: Float Default Value: Description: Limits the number of samples that Perfect Balance can collect to a fraction of the total input records. A value less than zero disables the property (no limit). You may need to increase the value for Hadoop applications with very unbalanced reducer partitions or densely clustered map-output keys. The sampler needs to sample more data to achieve a good partitioning plan in these cases. |
oracle.hadoop.balancer.minSplits |
Type: Integer Default Value: Description: Sets the minimum number of splits that the sampler reads. If the total number of splits is less than this value, then the sampler reads all splits. Set this property to a value greater than or equal to one ( |
oracle.hadoop.balancer.numThreads |
Type: Integer Default Value: Description: Number of sampler threads. Set this value based on the processor and memory resources available on the node where the job is initiated. A higher number of sampler threads implies higher concurrency in sampling. Set this property to one ( |
oracle.hadoop.balancer.report.overwrite |
Type: Boolean Default Value: Description: Controls whether Perfect Balance overwrites files in the location specified by the |
oracle.hadoop.balancer.reportPath |
Type: String Default Value: Description: The path where Perfect Balance writes the partition report before the Hadoop job output directory is available, that is, before the MapReduce job finishes running. At the end of the job, Perfect Balance moves the file to |
oracle.hadoop.balancer.runMode |
Type: String Default Value: Description: Specifies how to run the Perfect Balance sampler. The following values are valid:
If this property is set to an invalid string, Perfect Balance resets it to |
oracle.hadoop.balancer.tmpDir |
Type: String Default Value: Description: The path to a staging directory in the file system of the job output directory (HDFS or local). Perfect Balance creates the directory if it does not exist, and copies the partition report to it for loading into the Hadoop distributed cache. |
oracle.hadoop.balancer.useClusterStats |
Type: Boolean Default Value: Description: Enables the sampler to use cluster sampling statistics. These statistics improve the accuracy of sampled estimates, such as the number of records in a map-output key, when the map-output keys are distributed in clusters across input splits, instead of being distributed independently across all input splits. Set this property to |
oracle.hadoop.balancer.useMapreduceApi |
Type: Boolean Default Value: Description: Identifies the MapReduce API used in the Hadoop job:
|