5.8.1 Selecting a Chopping Method

You can configure how Perfect Balance chops the values by setting the oracle.hadoop.balancer.choppingStrategy configuration property:

  • Chopping by hash partitioning: Set choppingStrategy=hash when sorting is not required. This is the default chopping strategy.

  • Chopping by round robin: Set choppingStrategy=roundRobin as an alternative strategy when total-order chopping is not required. If the load for a hash chopped key is unbalanced among reducers, try to use this chopping strategy.

  • Chopping by total-order partitioning: Set choppingStrategy=range to sort the values in each subpartition and order them across all subpartitions. In any parallel sort job, each task sort the rows within the task. The job must ensure that the values in reduce task 2 are greater than values in reduce task 1, the values in reduce task 3 are greater than the values in reduce task 2, and so on. The job generates multiple files containing data in sorted order, instead of one large file with sorted data.

    For example, if a key is chopped into three subpartitions, and the subpartitions are sent to reducers 5, 8 and 9, then the values for that key in reducer 9 are greater than all values for that key in reducer 8, and the values for that key in reducer 8 are greater than all values for that key in reducer 5. When choppingStrategy=range, Perfect Balance ensures this ordering across reduce tasks.

If an application requires that the data is aggregated across files, then you can disable chopping by setting oracle.hadoop.balancer.keyLoad.minChopBytes=-1. Perfect Balance still offers performance gains by combining smaller reduce keys, called bin packing.