4 Optimizing MapReduce Jobs Using Perfect Balance

This chapter describes how you can shorten the run time of some MapReduce jobs by using Perfect Balance. It contains the following sections:

4.1 What is Perfect Balance?

The Perfect Balance feature of Oracle Big Data Appliance distributes the reducer load in a MapReduce application so that each reduce task does approximately the same amount of work. While the default Hadoop method of distributing the reduce load is appropriate for many jobs, it does not distribute the load evenly for jobs with significant data skew.

Data skew is an imbalance in the load assigned to different reduce tasks. The load is a function of:

  • The number of keys assigned to a reducer.

  • The number of records and the number of bytes in the values per key.

The total run time for a job is extended, to varying degrees, by the time that the reducer with the greatest load takes to finish. In jobs with a skewed load, some reducers complete the job quickly, while others take much longer. Perfect Balance can significantly shorten the total run time by distributing the load evenly, enabling all reducers to finish at about the same time.

Your MapReduce job can be written using either the mapred or mapreduce APIs; Perfect Balance supports both of them.

Perfect Balance was tested on MapReduce 1 (MRv1) CDH clusters, which is the default installation on Oracle Big Data Appliance.

4.1.1 About Balancing Jobs Across Map and Reduce Tasks

A typical Hadoop job has map and reduce tasks. Hadoop distributes the mapper workload uniformly across Hadoop Distributed File System (HDFS) and across map tasks while preserving the data locality. In this way, it reduces skew in the mappers.

Hadoop also hashes the map-output keys uniformly across all reducers. This strategy works well when there are many more keys than reducers, and each key represents a very small portion of the workload. However, it is not effective when the mapper output is concentrated into a small number of keys. Hashing these keys results in skew and does not work in applications like sorting, which require range partitioning.

Perfect Balance distributes the load evenly across reducers by first sampling the data, optionally chopping large keys into two or more smaller keys, and using a load-aware partitioning strategy to assign keys to reduce tasks.

4.1.2 Methods of Running Perfect Balance

You can choose from two methods of running Perfect Balance:

  • Perfect Balance Driver: You can run a job very easily using the hadoop command, without changing your application code.

  • Perfect Balance API: You can add the code to run Perfect Balance so that it runs after your application setup code.

Both methods are described in "Running a Balanced MapReduce Job."

4.1.3 Perfect Balance Components

Perfect Balance has these components:

  • Job Analyzer: Gathers and reports statistics about the MapReduce job so that you can determine whether to use Perfect Balance.

  • Counting Reducer: Provides additional statistics to help gauge the effectiveness of Perfect Balance.

  • Load Balancer: Runs before the MapReduce job to generate a static partition plan, and reconfigures the job to use the plan. The balancer includes a user-configurable, progressive sampler that stops sampling the data as soon as it can generate a good partitioning plan.

  • Driver: Enables you to run Perfect Balance in a hadoop command.

4.2 Getting Started with Perfect Balance

Take the following basic steps to use Perfect Balance:

  1. Ensure that your application meets the following requirements:

    • The job is distributive, so that splitting a group of records associated with a reduce key does not change the results.

      If it is not distributive, then you can still run Perfect Balance after disabling the key splitting feature. The job still benefits from using Perfect Balance, but the load is not as evenly balanced as when key splitting is in effect. See the oracle.hadoop.balancer.keyLoad.minChopBytes configuration property.

    • In this release, the mapper tasks cannot use symbolic links in the Hadoop distributed cache.

      As a general rule, a job that does not run successfully using a local job runner does not run successfully with Perfect Balance. For example, the local job runner in CDH 4.3 MRv1 clusters (like those configured on Oracle Big Data Appliance 2.2) does not support symbolic links in the mapper.

      To run a job with the local job runner, set the following configuration property:

      mapred.job.tracker=local
      
    • This release does not support combiners. The load balancing estimates may not be accurate when combiners are used.

  2. Log in to an Oracle Big Data Appliance server. You cannot run Perfect Balance from a remote Hadoop client.

  3. Run the examples provided with Perfect Balance to become familiar with the product. All examples shown in this chapter are based on the shipped examples and use the same data set. See "About the Perfect Balance Examples.".

  4. Set the following variables using the Bash export command:

    • HADOOP_CLASSPATH: Add ${BALANCER_HOME}/jlib/orabalancer.jar and ${BALANCER_HOME}/jlib/commons-math-2.2.jar to the existing value, if necessary. Also add the JAR files for your application.

    • HADOOP_USER_CLASSPATH_FIRST: Set to true so that Hadoop uses the Perfect Balance version of commons-math.jar instead of the default version.

    • BALANCER_HOME: Set to the Perfect Balance installation directory, which is /opt/oracle/orabalancer-1.0.0-h2 on Oracle Big Data Appliance (optional). The examples in this chapter use this variable, and you can also define it for your convenience. Perfect Balance does not use BALANCER_HOME.

  5. Run Job Analyzer without the balancer and use the generated report to decide whether the job is a good candidate for using Perfect Balance.

    See "Analyzing a Job for Imbalanced Reducer Loads."

  6. Choose a method of running Perfect Balance.

    See "Methods of Running Perfect Balance."

  7. Decide which configuration properties to set, if any. Create a configuration file or enter the settings individually in the hadoop command.

    See "About Configuring Perfect Balance."

  8. Run the job using Perfect Balance.

    See "Running a Balanced MapReduce Job."

  9. Use the Job Analyzer report to evaluate the effectiveness of using Perfect Balance. See "Reading the Job Analyzer Report."

  10. Modify the job configuration properties as desired before rerunning the job with Perfect Balance. See "About Configuring Perfect Balance."

4.3 About the Perfect Balance Examples

The Perfect Balance installation files include a full set of examples that you can run immediately. The InvertedIndex example is a MapReduce application that creates an inverted index on an input set of text files. The inverted index maps words to the location of the words in the text files. The input data is included.

4.3.1 About the Examples in this Chapter

The InvertedIndex example provides the basis for all examples in this chapter. They use the same data set and run the same MapReduce application. The modifications to the InvertedIndex example simply highlight the steps you must perform in running your own applications with Perfect Balance.

If you want to run the examples in this chapter, or use them as the basis for running your own jobs, then make the following changes:

  • If you are modifying the examples to run your own application, then add your application JAR files to HADOOP_CLASSPATH and -libjars.

  • Ensure that the value of mapred.input.dir identifies the location of your data.

    The invindx/input directory contains the sample data for the InvertedIndex example. To use this data, you must first set it up. See "Extracting the Example Data Set."

  • Replace jdoe with your Hadoop user name.

  • Set the -conf option to an existing configuration file.

    The jdoe_conf_invindx.xml file is a modification of the configuration file for the InvertedIndex examples. The modified file does not have performance optimizing settings. You can use the example configuration file as is or modify it. See /opt/oracle/orabalancer-1.0.0-h2/examples/invindx/conf_mapreduce.xml (or conf_mapred.xml).

  • Review the configuration settings in the file and in the shell script to ensure they are appropriate for your job.

  • You can run the browser from your laptop or connect to Oracle Big Data Appliance using a client that supports graphical interfaces, such as VNC.

4.3.2 Extracting the Example Data Set

To run the InvertedIndex examples or any of the examples in this chapter, you must first set up the data files.

To extract the InvertedIndex data files: 

  1. Log in to an Oracle Big Data Appliance server.

  2. Change to the examples/invindx subdirectory:

    cd /opt/oracle/orabalancer-1.0.0-h2/examples/invindx
    
  3. Unzip the data and copy it to the HDFS invindx/input directory:

    ./invindx -setup
    

For complete instructions for running the InvertedIndex example, see /opt/oracle/orabalancer-1.0.0-h2/examples/invindx/README.txt.

4.4 Analyzing a Job for Imbalanced Reducer Loads

Job Analyzer is a component of Perfect Balance that identifies imbalances in a load, and how effective Perfect Balance is in correcting the imbalance when actually running the job. This section contains the following topics:

4.4.1 About Job Analyzer

Job Analyzer uses the output logs of a MapReduce job to generate a simple report with statistics like the elapsed time and the load for each reduce task. By default, it uses the standard Hadoop counters displayed by the JobTracker user interface, but organizes the data to emphasize the relative performance and load of the reduce tasks, so that you can more easily interpret the results.

If the report shows that the reducers processed very different loads and the run times varied widely, then the application is a good candidate for Perfect Balance.

4.4.1.1 Methods of Running Job Analyzer

You can choose between two methods of running Job Analyzer:

  • As a standalone utility: Job Analyzer runs against existing job output logs. This is a good choice if you already ran your application on Oracle Big Data Appliance without using Perfect Balance.

  • Using Perfect Balance: Job Analyzer runs automatically against the output logs for the current job each time you use Perfect Balance. To see whether the job is a candidate for Perfect Balance, you can set a configuration property so that the job runs without the balancer.

4.4.2 Running Job Analyzer as a Standalone Utility

As a standalone utility, Job Analyzer provides a quick way to analyze the reduce load of a previously run job.

To run Job Analyzer as a standalone utility: 

  1. Log in to an Oracle Big Data Appliance server. You cannot run Job Analyzer from a remote Hadoop client.

  2. Locate the output logs from the job to analyze. Set mapred.output.dir to this directory.

  3. Run the Job Analyzer utility as described in "Job Analyzer Utility Syntax."

  4. View the Job Analyzer report in a browser.

4.4.2.1 Job Analyzer Utility Example

Example 4-1 runs a script that sets the required variables, uses the MapReduce job logs stored in jdoe_nobal_outdir, and creates the report in the default location. It then copies the HTML version of the report from HDFS to the /home/jdoe local directory and opens the report in a browser.

If you want to run this example, then replace jdoe_nobal_outdir with the HDFS path to the output logs from a previous job run. You can use the output logs from running the InvertedIndex example. The default HDFS output directory is in invindx/output. See "About the Perfect Balance Examples."

Example 4-1 Running the Job Analyzer Utility

$ cat runja.sh

BALANCER_HOME=/opt/oracle/orabalancer-1.0.0-h2
export HADOOP_CLASSPATH=${BALANCER_HOME}/jlib/orabalancer.jar:${BALANCER_HOME}/jlib/commons-math-2.2.jar:$HADOOP_CLASSPATH
export HADOOP_USER_CLASSPATH_FIRST=true

hadoop jar ${BALANCER_HOME}/jlib/orabalancer.jar oracle.hadoop.balancer.tools.JobAnalyzer \
-D mapred.output.dir=jdoe_nobal_outdir

$ sh ./runja.sh
$
$ hadoop fs -get jdoe_nobal_outdir/_balancer/jobanalyzer-report.html /home/jdoe
$ cd /home/jdoe
$ firefox jobanalyzer-report.html

4.4.2.2 Job Analyzer Utility Syntax

The following is the syntax to run the Job Analyzer utility:

bin/hadoop jar ${BALANCER_HOME}/jlib/orabalancer.jar oracle.hadoop.balancer.tools.JobAnalyzer \
-D mapred.output.dir=job_output_dir \
[ja_report_path]
job_output_dir

An HDFS directory where the job files are stored from previously executing the application.

ja_report_path

An HDFS directory where Job Analyzer creates its report (optional). The default directory is job_output_dir/_balancer.

4.4.3 Running Job Analyzer With the Perfect Balance Driver

Job Analyzer runs automatically each time you use Perfect Balance. If you are running your application on Oracle Big Data Appliance for the first time, then you may want to set up the job in Perfect Balance to run without the balancer, instead of using the Job Analyzer utility. The job runs using the default Hadoop partitioning strategy, and you can configure Job Analyzer to collect additional statistics. If the analysis shows that the load is skewed, then you can use the same basic setup in Perfect Balance for subsequent runs after turning the balancer on again.

To run Job Analyzer without the balancer: 

  1. Log in to an Oracle Big Data Appliance server. You cannot run Job Analyzer from a remote Hadoop client.

  2. To prevent the balancer from running, set the oracle.hadoop.balancer.driver.balance configuration property to false.

  3. To collect additional statistics and obtain recommendations for configuring future runs, set the oracle.hadoop.balancer.tools.useCountingReducer and oracle.hadoop.balancer.tools.printRecommendation configuration properties to true.

    See "Collecting Additional Metrics."

  4. Decide which additional configuration properties to set, if any.

    See "Job Analyzer Properties."

  5. Run the job using the Perfect Balance driver.

Note:

The Perfect Balance API does not use oracle.hadoop.balancer.driver.balance or oracle.hadoop.balancer.tools.useCountingReducer.

The job uses the default Hadoop distribution of the load to reduce tasks.

4.4.3.1 Job Analyzer Example

Example 4-2 runs a script that sets the required variables, uses the Perfect Balance driver to run a job without the balancer, and creates the report in the default location. It then copies the HTML version of the report from HDFS to the /home/jdoe local directory and opens the report in a browser.

If you want to run this example, then see "About the Examples in this Chapter." The output includes warnings, which you can ignore.

Example 4-2 Running Job Analyzer in Perfect Balance

$ cat ja_nobalance.sh

BALANCER_HOME=/opt/oracle/orabalancer-1.0.0-h2
export HADOOP_CLASSPATH=${BALANCER_HOME}/jlib/orabalancer.jar:${BALANCER_HOME}/jlib/commons-math-2.2.jar:$HADOOP_CLASSPATH
export HADOOP_USER_CLASSPATH_FIRST=true
 
hadoop jar ${BALANCER_HOME}/jlib/orabalancer.jar oracle.hadoop.balancer.BalancerDriver \
 -D mapred.input.dir=invindx/input \
 -D mapred.output.dir=jdoe_nobal_outdir \
 -D mapred.job.name=nobal \
 -D mapred.reduce.tasks=10 \
 -D oracle.hadoop.balancer.driver.balance=false \
 -D oracle.hadoop.balancer.tools.useCountingReducer=true \
 -D oracle.hadoop.balancer.tools.printRecommendation=true \
 -conf /home/jdoe/jdoe_conf_invindx.xml \
 -libjars ${BALANCER_HOME}/jlib/commons-math-2.2.jar,${BALANCER_HOME}/jlib/orabalancer.jar

]$ sh ja_nobalance.sh
13/06/26 16:07:12 INFO balancer.BalancerDriver: Submitting job
13/06/26 16:07:13 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
13/06/26 16:07:14 WARN mapred.JobClient: No job jar file set.  User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
13/06/26 16:07:14 INFO input.FileInputFormat: Total input paths to process : 5
13/06/26 16:07:14 INFO mapred.JobClient: Running job: job_201305031125_0016
13/06/26 16:07:16 INFO mapred.JobClient:  map 0% reduce 0%
13/06/26 16:07:34 INFO mapred.JobClient:  map 5% reduce 0%
     .
     .
     .
13/06/26 16:11:06 INFO mapred.JobClient:     Reduce input records=20000000
13/06/26 16:11:06 INFO mapred.JobClient:     Reduce output records=13871794
13/06/26 16:11:06 INFO mapred.JobClient:     Spilled Records=60000000
13/06/26 16:11:06 INFO mapred.JobClient:     CPU time spent (ms)=184330
13/06/26 16:11:06 INFO mapred.JobClient:     Physical memory (bytes) snapshot=3421863936
13/06/26 16:11:06 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=10147790848
13/06/26 16:11:06 INFO mapred.JobClient:     Total committed heap usage (bytes)=2158821376


$ hadoop fs -get jdoe_nobal_outdir/_balancer/jobanalyzer-report.html /home/jdoe
$ cd /home/jdoe
$ firefox jobanalyzer-report.html

4.4.3.2 Collecting Additional Metrics

If you set the oracle.hadoop.balancer.tools.useCountingReducer configuration property to true, then the Job Analyzer report includes the load metrics for each reduce key. This additional information provides a more detailed picture of the load for each reducer, with metrics that are not available in the standard Hadoop counters.

The Job Analyzer report also compares its predicted load with the actual load measured by CountingReducer. The difference between these values measures how effective Perfect Balance was in balancing the job.

If you also set oracle.hadoop.balancer.tools.printRecommendation to true, then Job Analyzer may recommend key load coefficients for the key load model. Use the recommended values to set the following configuration properties:

When you run the job again, these coefficients result in a more balanced job.

4.4.4 Reading the Job Analyzer Report

Job Analyzer writes its report in two formats: HTML for you, and XML for Perfect Balance. Copy the HTML version from HDFS to the local file system and open it in a browser, as shown in the previous examples.

When inspecting the Job Analyzer report, look for indicators of skew such as:

  • The execution time of some reducers is longer than others.

  • Some reducers process more records or bytes than others.

  • Some map output keys have more records than others.

  • Some map output records have more bytes than others.

Figure 4-1 shows a small portion of the analyzer report for the inverted index (invindx) example. It displays the key load coefficient recommendations, because this job ran with the appropriate configuration settings. See "Collecting Additional Metrics."

The task IDs are links to tables that show the analysis of specific tasks, which enables you to drill down for more details from the first, summary table.

This example uses an extremely small data set, but notice the differences between tasks 7 and 8: The input records range from 3% to 29%, and their corresponding elapsed times range from 9 to 15 seconds. This variation indicates skew. In contrast, tasks 0 and 4 have very different execution times, but process about the same amount of data. This variation is caused by some other factor, and not by skew.

Figure 4-1 Job Analyzer Report for Unbalanced Inverted Index Job

Description of Figure 4-1 follows
Description of "Figure 4-1 Job Analyzer Report for Unbalanced Inverted Index Job"

4.5 Running a Balanced MapReduce Job

You can use either the Perfect Balance driver or the API to run your application. See "Methods of Running Perfect Balance."

4.5.1 Using the Perfect Balance Driver

The following is the syntax to run a job using the Perfect Balance driver:

bin/hadoop jar ${BALANCER_HOME}/jlib/orabalancer.jar oracle.hadoop.balancer.BalancerDriver \
-D application_config_property \
-D perfect_balance_config_property \
-conf application_config_file.xml \
-conf perfect_balance_config_file.xml \
-libjars ${BALANCER_HOME}/jlib/commons-math-2.2.jar,${BALANCER_HOME}/jlib/orabalancer.jar,application_jar_path.jar...

You can include any generic hadoop command-line option. The driver implements the org.apache.hadoop.util.Tool interface and follows the standard Hadoop methods for building MapReduce applications. You can combine Perfect Balance properties and MapReduce properties in the same configuration file. See "About Configuring Perfect Balance."

Example 4-3 runs a script named pb_balance.sh, which uses the Perfect Balance driver to run the job. The key load metric properties are set to the values recommended in the Job Analyzer report shown in Figure 4-1.

Example 4-3 Running a Job Using the Perfect Balance Driver

$ cat runinvindx.sh

BALANCER_HOME=/opt/oracle/orabalancer-1.0.0-h2
export HADOOP_CLASSPATH=${BALANCER_HOME}/jlib/orabalancer.jar:${BALANCER_HOME}/jlib/commons-math-2.2.jar:$HADOOP_CLASSPATH
export HADOOP_USER_CLASSPATH_FIRST=true
 
 
hadoop jar ${BALANCER_HOME}/jlib/orabalancer.jar oracle.hadoop.balancer.BalancerDriver \
 -D mapred.input.dir=invindx/input \
 -D mapred.output.dir=jdoe_outdir \
 -D mapred.job.name=jdoe_invindx \
 -D mapred.reduce.tasks=10 \
 -D oracle.hadoop.balancer.tools.useCountingReducer=true \
 -D oracle.hadoop.balancer.tools.printRecommendation=true \
 -D oracle.hadoop.balancer.linearKeyLoad.keyWeight=107.35396359 \
 -D oracle.hadoop.balancer.linearKeyLoad.rowWeight=0.00148810 \
 -D oracle.hadoop.balancer.linearKeyLoad.byteWeight=0.0 \
 -conf /home/jdoe/jdoe_conf_invindx.xml \
 -libjars ${BALANCER_HOME}/jlib/commons-math-2.2.jar,${BALANCER_HOME}/jlib/orabalancer.jar

$ sh ./runinvindx.sh
13/06/27 09:01:53 INFO balancer.Balancer: Creating balancer
13/06/27 09:01:53 INFO input.FileInputFormat: Total input paths to process : 5
13/06/27 09:01:54 INFO balancer.Balancer: Starting Balancer
13/06/27 09:02:00 INFO balancer.Balancer: Balancer completed
13/06/27 09:02:00 INFO balancer.BalancerDriver: Submitting job
     .
     .
     .

4.5.2 Using the Perfect Balance API

The oracle.hadoop.balancer.Balancer class contains methods for creating a partitioning plan, saving the plan to a file, and running the MapReduce job using the plan. You only need to add the code to your Java class, not redesign the application. When you run a shell script to run the application, you can include Perfect Balance configuration settings.

4.5.2.1 Modifying Your Java Code to Use Perfect Balance

The Perfect Balance installation directory contains a complete example, including input data, of a Java MapReduce program that uses the Perfect Balance API.

For a description of the inverted index example and execution instructions, see orabalancer-1.0.0-h2/examples/invindx/README.txt.

To explore the modified Java code, see orabalancer-1.0.0-h2/examples/jsrc/oracle/hadoop/balancer/examples/invindx/InvertedIndexMapred.java or InvertedIndexMapreduce.java.

The modifications to run Perfect Balance include the following:

  • The oracle.hadoop.balancer.useMapreduceApi configuration property identifies the application's use of either the mapred or the mapreduce API.

  • The createBalancer method validates the configuration properties and returns a Balancer instance.

  • The waitForCompletion method samples the data and creates a partitioning plan.

  • The addBalancingPlan method adds the partitioning plan to the job configuration settings.

  • The configureCountingReducer method configures the application with a counting reducer that collects data to analyze the reducer load, if the oracle.hadoop.balancer.tools.useCountingReducer property is set to true.

  • The save method saves the partition report and generates the Job Analyzer report.

Example 4-4 shows fragments from the inverted index Java code.

Example 4-4 Running Perfect Balance in a MapReduce Job

     .
     .
     .
import oracle.hadoop.balancer.Balancer;
     .
     .
     .
///// BEGIN: CODE TO INVOKE BALANCER (PART-1, before job submission) //////
    Balancer balancer = null;
    
    // Specify the old mapred api
    job.setBoolean("oracle.hadoop.balancer.useMapreduceApi", false);
    
    boolean useBalancer = job.getBoolean(InvertedIndex.PROP_USE_BALANCER, 
        InvertedIndex.DEFAULT_USE_BALANCER);
    if(useBalancer)
    {
      balancer = Balancer.createBalancer(job);
      balancer.waitForCompletion();
      balancer.addBalancingPlan(job);
    }
    
    if(job.getBoolean("oracle.hadoop.balancer.tools.useCountingReducer", true))
    {
      Balancer.configureCountingReducer(job);
    }
    ////////////// END: CODE TO INVOKE BALANCER (PART-1) //////////////////////
    
    RunningJob rj = JobClient.runJob(job);
    
    ///////////////////////////////////////////////////////////////////////////
    // BEGIN: CODE TO INVOKE BALANCER (PART-2, after job completion, optional)
    // If balancer ran, this saves the partition file report into the _balancer
    // subdirectory of the job output directory. It also writes a JobAnalyzer
    // report.
    Balancer.save(rj, job);
    ////////////// END: CODE TO INVOKE BALANCER (PART-2) //////////////////////
     .
     .
     .
}

4.5.2.2 Running Your Modified Java Code with Perfect Balance

When you run your modified Java code, you can set the Perfect Balance properties, using the standard hadoop command syntax:

bin/hadoop jar application_jarfile.jar ApplicationClass \
-conf application_config.xml \
-conf perfect_balance_config.xml \
-D application_config_property \
-D perfect_balance_config_property \
-libjars /opt/oracle/orabalancer-1.0.0-h2/jlib/commons-math-2.2.jar,/opt/oracle/orabalancer-1.0.0-h2/jlib/orabalancer.jar,application_jar_path.jar...

Example 4-5 runs a script named pb_balanceapi.sh, which runs the InvertedIndexMapreduce class example packaged in the Perfect Balance JAR file. The key load metric properties are set to the values recommended in the Job Analyzer report shown in Figure 4-1.

To run the InvertedIndexMapreduce class example, see "About the Perfect Balance Examples."

Example 4-5 Running the InvertedIndexMapreduce Class

$ cat pb_balanceapi.sh
BALANCER_HOME=/opt/oracle/orabalancer-1.0.0-h2
APP_JAR_FILE=/opt/oracle/orabalancer-1.0.0-h2/jlib/orabalancer.jar
export HADOOP_CLASSPATH=${BALANCER_HOME}/jlib/orabalancer.jar:${BALANCER_HOME}/jlib/commons-math-2.2.jar:$HADOOP_CLASSPATH
export HADOOP_USER_CLASSPATH_FIRST=true

hadoop jar ${APP_JAR_FILE} oracle.hadoop.balancer.examples.invindx.InvertedIndexMapreduce \
 -D mapred.input.dir=invindx/input \
 -D mapred.output.dir=jdoe_outdir_api \
 -D mapred.job.name=jdoe_invindx_api \
 -D mapred.reduce.tasks=10 \
 -D oracle.hadoop.balancer.tools.useCountingReducer=true \
 -D oracle.hadoop.balancer.tools.printRecommendation=true \
 -D oracle.hadoop.balancer.linearKeyLoad.keyWeight=107.35396359 \
 -D oracle.hadoop.balancer.linearKeyLoad.rowWeight=0.00148810 \
 -D oracle.hadoop.balancer.linearKeyLoad.byteWeight=0.0 \
-libjars ${BALANCER_HOME}/jlib/commons-math-2.2.jar,${BALANCER_HOME}/jlib/orabalancer.jar

$ sh ./balanceapi.sh
13/06/24 17:21:47 INFO balancer.Balancer: Creating balancer
13/06/24 17:21:48 INFO input.FileInputFormat: Total input paths to process : 5
13/06/24 17:21:48 INFO balancer.Balancer: Starting Balancer
13/06/24 17:21:54 INFO balancer.Balancer: Balancer completed
     .
     .
     .

4.6 About Perfect Balance Reports

Perfect Balance generates these reports when it runs a job:

  • Job Analyzer report that contains various indicators about the distribution of the load in a job. The report is saved in HTML for you, and XML for Perfect Balance to use. The report is always named jobanalyzer-report.html and -.xml. See "Reading the Job Analyzer Report."

  • Partition report that identifies the keys that are assigned to the various mappers. This report is saved in XML for Perfect Balance to use; it does not contain information of use to you. The report is named ${mapred_output_dir}/_balancer/orabalancer_report.xml.

  • Reduce key metric reports that Perfect Balance generates for each file partition, when the appropriate configuration properties are set. The reports are saved in XML for Perfect Balance to use; they do not contain information of use to you. They are named ${mapred_output_dir}/_balancer/ReduceKeyMetricList-attempt_jobid_taskid_task_attemptid.xml.

    See "Collecting Additional Metrics."

The reports are stored by default in the job output directory (${mapred.output.dir}). Following is the structure of that directory:

job_output_directory
   /_SUCCESS
   /_balancer
      ReduceKeyMetricList-attempt_201305031125_0016_r_000000_0.xml
      ReduceKeyMetricList-attempt_201305031125_0016_r_000001_0.xml
        .
        .
        .
     jobanalyzer-report.html
     jobanalyzer-report.xml
   /_logs
      /history
         job_201305031125_0016_1372277234720_jdoe_invindx_nobal
         localhost.localdomain_1367594758596_job_201305031125_0016_conf.xml
   /part-r-00000
   /part-r-00001
     .
     .
     .

4.7 About Configuring Perfect Balance

Perfect Balance uses the standard Hadoop methods of specifying configuration properties in the command line. You can use the -conf option to identify a configuration file, or the -D option to specify individual properties. All Perfect Balance configuration properties have default values, and so setting them is optional.

"Perfect Balance Configuration Property Reference" lists the configuration properties in alphabetical order with a full description. The following are functional groups of properties.

BalancerDriver Properties 

Job Analyzer Properties 

Key Chopping Properties 

Load Balancing Properties 

Load Model Properties 

MapReduce-Related Properties 

Partition Report Properties 

Sampler Properties 

4.8 Perfect Balance Configuration Property Reference

This section describes the Perfect Balance configuration properties and a few generic Hadoop MapReduce properties that Perfect Balance reads from the job configuration

MapReduce Configuration Properties 

mapred.input.dir

Type: String

Default Value: Not defined

Description: A comma-separated list of input directories.

mapred.input.format.class

Type: String

Default Value: org.apache.hadoop.mapred.TextInputFormat

Description: The full name of the InputFormat class.

mapred.mapper.class

Type: String

Default Value: org.apache.hadoop.mapred.lib.IdentityMapper

Description: The full name of the mapper class.

mapred.output.dir

Type: String

Default Value: Not defined

Description: The job output directory.

mapred.ouput.format.class

Type: String

Default Value: org.apache.hadoop.mapred.TextOutputFormat

Description: The full name of the OutputFormat class.

mapred.partitioner.class

Type: String

Default Value: org.apache.hadoop.mapred.lib.HashPartitioner

Description: The full name of the partitioner class.

mapred.reducer.class

Type: String

Default Value: org.apache.hadoop.mapred.lib.IdentityReducer

Description: The full name of the reducer class.

mapreduce.inputformat.class

Type: String

Default Value: org.apache.hadoop.mapreduce.lib.input.TextInputFormat

Description: The full name of the InputFormat class.

mapreduce.map.class

Type: String

Default Value: org.apache.hadoop.mapreduce.Mapper

Description: The full name of the mapper class.

mapreduce.outputformat.class

Type: String

Default Value: org.apache.hadoop.mapreduce.lib.output.TextOutputFormat

Description: The full name of the OutputFormat class.

mapreduce.partitioner.class

Type: String

Default Value: org.apache.hadoop.mapreduce.lib.partition.HashPartitioner

Description: The full name of the partitioner class.

mapreduce.reduce.class

Type: String

Default Value: org.apache.hadoop.mapreduce.Reducer

Description: The full name of the reducer class.

Perfect Balance Configuration Properties 

oracle.hadoop.balancer.confidence

Type: Float

Default Value: 0.95

Description: The statistical confidence indicator for the load factor specified by the oracle.hadoop.balancer.maxLoadFactor property.

This property accepts values greater than or equal to 0.5 and less than 1.0 (0.5 <= value < 1.0). A value less than 0.5 resets the property to its default value. Oracle recommends a value greater than or equal to 0.9. Typical values are 0.95 and 0.99.

oracle.hadoop.balancer.driver.balance

Type: Boolean

Default Value: true

Description: Controls whether the BalancerDriver class runs the balancer. Set to false to turn off balancing. The Perfect Balance driver uses this property; the API does not.

oracle.hadoop.balancer.enableSorting

Type: Boolean

Default Value: false

Description: Controls how the map output keys are chopped, that is, split into smaller keys:

  • false: Uses a hash function.

  • true: Uses the map output key sorting comparator as a total-order partitioning function. The balancer preserves the total order over the values of the chopped keys.

oracle.hadoop.balancer.inputFormat.mapred.map.tasks

Type: Integer

Default Value: 100

Description: Sets the Hadoop mapred.map.tasks property for the duration of sampling, just before calling the input format getSplits method. It does not change mapred.map.tasks for the actual job. The optimal number of map tasks is a trade-off between obtaining a good sample (larger number) and having finite memory resources (smaller number).

Set this property to a value greater than or equal to one (1). A value less than 1 disables the property.

Some input formats, such as DBInputFormat, use this property as a hint to determine the number of splits returned by getSplits. Higher values indicate that more chunks of data are sampled at random, which improves the sample.

You can increase the value for larger data sets, that is, more than a million rows of about 100 bytes per row. However, extremely large values can cause the input format's getSplits method to run out of memory by returning too many splits.

oracle.hadoop.balancer.inputFormat.mapred.max.split.size

Type: Long

Default Value: 1048576 (1 MB)

Description: Sets the Hadoop mapred.max.split.size property for the duration of sampling, just before calling the input format's getSplits method. It does not change mapred.max.split.size for the actual job.

Set this property to a value greater than or equal to one (1). A value less than 1 disables the property. The optimal split size is a trade-off between obtaining a good sample (smaller splits) and efficient I/O performance (larger splits).

Some input formats, such as FileInputFormat, use the maximum split size as a hint to determine the number of splits returned by getSplits. Smaller split sizes indicate that more chunks of data are sampled at random, which improves the sample. Set the value small enough for good sampling performance, but no smaller. Extremely small values can cause inefficient I/O performance, while not improving the sample.

You can increase the value for larger data sets (tens of terabytes) or if the input format's getSplits method throws an out of memory error. Large splits are better for I/O performance, but not for sampling.

oracle.hadoop.balancer.keyLoad.class

Type: Class

Default Value: oracle.hadoop.balancer.KeyLoadLinear

Description: The name of a class that implements the oracle.hadoop.balancer.KeyLoad interface.

oracle.hadoop.balancer.keyLoad.minChopBytes

Type: Long

Default Value: 0

Description: Controls whether Perfect Balance chops large map output keys into medium keys:

  • -1: Perfect Balance does not chop large map output keys.

  • 0: Perfect Balance chops large map output keys and determines the optimal size of each medium key.

  • Positive integer: Perfect Balance chops large map output keys into medium keys with a size greater than or equal to the specified integer.

oracle.hadoop.balancer.linearKeyLoad.byteWeight

Type: Float

Default Value: 0.05

Description: Weights the number of bytes per key in the linear key load model specified by the oracle.hadoop.balancer.KeyLoadLinear class.

oracle.hadoop.balancer.linearKeyLoad.keyWeight

Type: Float

Default Value: 50.0

Description: Weights the number of medium keys per large key in the linear key load model specified by the oracle.hadoop.balancer.KeyLoadLinear class.

oracle.hadoop.balancer.linearKeyLoad.rowWeight

Type: Float

Default Value: 0.05

Description: Weights the number of rows per key in the linear key load model specified by the oracle.hadoop.balancer.KeyLoadLinear class.

oracle.hadoop.balancer.maxLoadFactor

Type: Float

Default Value: 0.05

Description: The target reducer load factor that you want the balancer's partition plan to achieve.

The load factor is the relative deviation from an estimated value. For example, if maxLoadFactor=0.05 and confidence=0.95, then with a confidence greater than 95%, the job's reducer loads should be, at most, 5% greater than the value in the partition plan.

The values of these two properties determine the sampler's stopping condition. The balancer samples until it can generate a plan that guarantees the specified load factor at the specified confidence level. This guarantee may not hold if the sampler stops early because of other stopping conditions, such as the number of samples exceeds oracle.hadoop.balancer.maxSamplesPct. The partition report logs the stopping condition.

See oracle.hadoop.balancer.confidence.

oracle.hadoop.balancer.maxSamplesPct

Type: Float

Default Value: 0.01 (1%)

Description: Limits the number of samples that Perfect Balance can collect to a fraction of the total input records. A value less than zero disables the property (no limit).

You may need to increase the value for Hadoop applications with very unbalanced reducer partitions or densely clustered map-output keys. The sampler needs to sample more data to achieve a good partitioning plan in these cases.

See oracle.hadoop.balancer.useClusterStats.

oracle.hadoop.balancer.minSplits

Type: Integer

Default Value: 5

Description: Sets the minimum number of splits that the sampler reads. If the total number of splits is less than this value, then the sampler reads all splits. Set this property to a value greater than or equal to one (1). A nonpositive number sets the property to 1.

oracle.hadoop.balancer.numThreads

Type: Integer

Default Value: 5

Description: Number of sampler threads. Set this value based on the processor and memory resources available on the node where the job is initiated. A higher number of sampler threads implies higher concurrency in sampling. Set this property to one (1) to disable multithreading in the sampler.

oracle.hadoop.balancer.report.overwrite

Type: Boolean

Default Value: false

Description: Controls whether Perfect Balance overwrites files in the location specified by the oracle.hadoop.balancer.reportPath property. By default, Perfect Balance does not overwrite files; it throws an exception. Set this property to true to allow partition reports to be overwritten.

oracle.hadoop.balancer.reportPath

Type: String

Default Value: directory/orabalancer_report-random_unique_string.xml, where directory for HDFS is the home directory of the user who submits the job. For the local file system, it is the directory where the job is submitted.

Description: The path where Perfect Balance writes the partition report before the Hadoop job output directory is available, that is, before the MapReduce job finishes running. At the end of the job, Perfect Balance moves the file to job_output_dir/_balancer/orabalancer_report.xml. In the API, the save method does this task.

See oracle.hadoop.balancer.submitJob.

oracle.hadoop.balancer.submitJob

Type: Boolean

Default Value: true

Description: Controls whether the BalancerDriver class submits the Hadoop job for execution. Set to false to prevent the job from executing. In this case, the partition report is stored in the path identified by oracle.hadoop.balancer.reportPath.

oracle.hadoop.balancer.tmpDir

Type: String

Default Value: /tmp/orabalancer-user_name

Description: The path to a staging directory in the file system of the job output directory (HDFS or local). Perfect Balance creates the directory if it does not exist, and copies the partition report to it for loading into the Hadoop distributed cache.

oracle.hadoop.balancer.tools.jobConfPath

Type: String

Default Value: ${mapred.output.dir}/_logs/history

Description: The path to a Hadoop job configuration file. Job Analyzer uses this setting to locate the file.

oracle.hadoop.balancer.tools.jobHistoryPath

Type: String

Default Value: ${mapred.output.dir}/_logs/history

Description: The path to a Hadoop job history file. Job Analyzer uses this setting to locate the file.

oracle.hadoop.balancer.tools.printRecommendation

Type: Boolean

Default Value: false

Description: Controls whether Job Analyzer recommends values for the key load model properties, based on the elapsed time, input record, and input value byte statistics it gathers for each key. Job Analyzer does not print a recommendation in its report if it cannot make a confident recommendation. You can set the load model properties to the recommended values when running a balanced job. See "Load Model Properties."

oracle.hadoop.balancer.tools.useCountingReducer

Type: Boolean

Default Value: false

Description: Controls whether Job Analyzer can collect load statistics for each reduce task in a job. Set to true to allow statistics collection. The Perfect Balance driver uses this property; the API does not.

oracle.hadoop.balancer.tools.writeKeyBytes

Type: Boolean

Default Value: false

Description: Controls whether the counting reducer collects the byte representations of the reduce keys for the Job Analyzer. Set this property to true to represent the unique key values in base64 in the report. A string representation of the key, created using key.toString, is also provided in the report. This string value may not be unique for each key.

oracle.hadoop.balancer.useClusterStats

Type: Boolean

Default Value: true

Description: Enables the sampler to use cluster sampling statistics. These statistics improve the accuracy of sampled estimates, such as the number of records in a map-output key, when the map-output keys are distributed in clusters across input splits, instead of being distributed independently across all input splits.

Set this property to false only if you are absolutely certain that the map-output keys are not clustered. This setting improves the sampler's estimates only when there is, in fact, no clustering. Oracle recommends leaving this property set to true, because the distribution of map-output keys is usually unknown.

oracle.hadoop.balancer.useMapreduceApi

Type: Boolean

Default Value: true

Description: Identifies the MapReduce API used in the Hadoop job:

  • true: The job uses the mapreduce API.

  • false: The job uses the mapred API.