Implementing the MapReduce Application

Implement a MapReduce application.

  1. Create src\main\java\org\apache\hadoop\examples\wordcount.java.
  2. Copy and paste the following Java code into the new file.
    package org.apache.hadoop.examples;
     
    import java.io.IOException;
    import java.util.StringTokenizer;
    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.io.IntWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Job;
    import org.apache.hadoop.mapreduce.Mapper;
    import org.apache.hadoop.mapreduce.Reducer;
    import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
    import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
    import org.apache.hadoop.util.GenericOptionsParser;
     
    /**
     * Simple Word count application that defines Mapper and Reducer classes that counts the words in a give input file
     */
    public class WordCount {
     
        public static class TokenizerMapper
            extends Mapper<Object, Text, Text, IntWritable>{
     
        private final static IntWritable one = new IntWritable(1);
        private Text word = new Text();
     
        // map function is called for every line in the input file
        public void map(Object key, Text value, Context context
                        ) throws IOException, InterruptedException {
            StringTokenizer itr = new StringTokenizer(value.toString());
            while (itr.hasMoreTokens()) {
             word.set(itr.nextToken());
     
            // Update count as 1 for every word read from input line
             context.write(word, one);
            }
        }
    }
     
    public static class IntSumReducer
            extends Reducer<Text,IntWritable,Text,IntWritable> {
        private IntWritable result = new IntWritable();
     
       // reduce function is called with list of values sent by map function
        public void reduce(Text key, Iterable<IntWritable> values,
                            Context context) throws IOException, InterruptedException {
    
            // perform necessary aggregate operation and write result to the context
    
            int sum = 0;
            for (IntWritable val : values) {
              sum += val.get();
            }
            result.set(sum);
            context.write(key, result);
        }
    }
     
    /**
     * Driver code to provide the mapper & reducer function classes for Hadoop Map Reduce framework
     */
    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
        if (otherArgs.length != 2) {
            System.err.println("Usage: wordcount <in> <out>");
            System.exit(2);
        }
        Job job = new Job(conf, "word count");
        job.setJarByClass(WordCount.class);
        job.setMapperClass(TokenizerMapper.class);
        job.setCombinerClass(IntSumReducer.class);
        job.setReducerClass(IntSumReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
        FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
        System.exit(job.waitForCompletion(true) ? 0 : 1);
        }
    }
  3. Close the file.

Building the Application JAR

Run mvn clean package from the project root folder to build application JAR in target folder. For example, wordcountjava-1.0-SNAPSHOT.jar.

mvn clean package

This command cleans any previous build artifacts, downloads any dependencies that haven't already been installed, and then builds and package the application.

After the command finishes, the wordcountjava/target directory contains a file named wordcountjava-1.0-SNAPSHOT.jar.

Note

The wordcountjava-1.0-SNAPSHOT.jar file is an uberjar, which contains not only the WordCount job, but also dependencies that the job requires at runtime.

Running the MapReduce Application JAR

Copy the JAR to any node. For example:

- yarn jar /tmp/wordcountjava-1.0-SNAPSHOT.jar org.apache.hadoop.examples.WordCount /example/data/input.txt /example/data/wordcountout

Parameters:

  • /example/data/input.txt: HDFS path for the input text file
  • /example/data/wordcountout: HDFS path for result (output of reducer)

Sample output:

 >> hdfs dfs -cat /example/data/wordcountout/* 

          zeal    1
          zelus   1
          zenith  2
Note

The wordcountjava-1.0-SNAPSHOT.jar file is a fat JAR if built through maven-shade-plugin. Alternatively, use -libjars as option to provide additional JARs to the job class path. In a secure cluster, kinit and appropriate Ranger policies are required before submitting the job.