Hadoop Custom Job Example

MapReduce is based on two standard features in many functional programming languages.

The Map function takes a {key, value}, performs some computation on them, then outputs a list of {key, value} pairs.

   Map: (key, value) -> (key, value)

The Reduce function takes a {key, value[]}, performs some computation on them, then outputs a {key, value}[].

   Reduce: (key, value[]) -> (key, value)[]

Writing a MapReduce program is conceptually as simple as writing the Map and Reduce functions (or using already-provided functions), then telling a job object which functions you want to use.

Of course, a real Hadoop program involves a fair amount of additional code (largely boilerplate), so a good way to jump in is to look at some existing Hadoop programs.

Trivial Identity Job

The following job simply reads in MapReduce-generated key/value pairs and outputs them as-is. The code itself is fairly straightforward and most consists of defining the Mapper and Reducer classes and then telling the job to use them. Read through it and make sure you understand the general flow, even if you presuambly aren't familiar with the actual classes involved.

import java.io.IOException;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

/**
 * Trivial MapReduce job that pipes input to output as MapReduce-created key-value pairs.
 */
public class Trivial extends Configured implements Tool {

  public static void main(String[] args) throws Exception {
    int res = ToolRunner.run(new Configuration(), new Trivial(), args);
    System.exit(res);
  }

  // the "real" main method, invoked in a slightly roundabout way
  @Override
  public int run(String[] args) throws Exception {

    if (args.length < 2) {
      System.err.println("Error: Wrong number of parameters");
      System.err.println("Expected: [in] [out]");
      System.exit(1);
    }

    Configuration conf = getConf();

    Job job = Job.getInstance(conf, "trivial job");
    job.setJarByClass(Trivial.class);

    // set the Mapper and Reducer functions we want
    job.setMapperClass(Trivial.IdentityMapper.class);
    job.setReducerClass(Trivial.IdentityReducer.class);

    // input arguments tell us where to get/put things in HDFS
    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));

    // ternary operator - a compact conditional
    return job.waitForCompletion(true) ? 0 : 1;
  }

  /**
   * map: (LongWritable, Text) --> (LongWritable, Text)
   * NOTE: Keys must implement WritableComparable, values must implement Writable
   */
  public static class IdentityMapper extends Mapper<LongWritable, Text, LongWritable, Text> {

    @Override
    public void map(LongWritable key, Text val, Context context)
        throws IOException, InterruptedException {
      // write (key, val) out to memory/disk
      context.write(key, val);
    }

  }

  /**
   * reduce: (LongWritable, Text) --> (LongWritable, Text)
   */
  public static class IdentityReducer extends Reducer<LongWritable, Text, LongWritable, Text> {

    @Override
    public void reduce(LongWritable key, Iterable<Text> values, Context context) 
        throws IOException, InterruptedException {
      // write (key, val) for every value
      for (Text val : values) {
        context.write(key, val);
      }
    }

  }

}

Compiling the Job

These instructions assume you have already (1) configured your Hadoop cluster, (2) formatted HDFS, and (3) started the cluster.

The trivial job above and a Makefile to compile it are provided in your starter repository in the trivial directory. Build the program by running make in that directory. Doing so should produce a file called build.jar, which is your compiled Hadoop job. Now, you can launch your Hadoop job (from your master) by running the following:

# /usr/local/hadoop/bin/hadoop jar build.jar Trivial /input-file /output-dir

For this to work, you must have already copied /input-file to HDFS (remember, input and output paths are HDFS paths). You can use any file as the input file (e.g., Trivial.java). Hadoop will report the progress of your job on standard output. Once the job finishes, the output will be available under /output-dir inside HDFS. If you view the resulting part-r-00000 file, you should see an exact copy of your original input file.

Here is some example output from a run of this job:

19/03/28 07:34:19 INFO mapreduce.Job:  map 77% reduce 0%
19/03/28 07:34:20 INFO mapreduce.Job:  map 93% reduce 0%
19/03/28 07:34:21 INFO mapreduce.Job:  map 97% reduce 0%
19/03/28 07:34:24 INFO mapreduce.Job:  map 100% reduce 0%
19/03/28 07:34:25 INFO mapreduce.Job:  map 100% reduce 100%
19/03/28 07:34:25 INFO mapreduce.Job: Job job_1553754551422_0004 completed successfully
19/03/28 07:34:25 INFO mapreduce.Job: Counters: 52
  File System Counters
    FILE: Number of bytes read=115852
    FILE: Number of bytes written=6363395
    FILE: Number of read operations=0
    FILE: Number of large read operations=0
    FILE: Number of write operations=0
    HDFS: Number of bytes read=95160
    HDFS: Number of bytes written=102975
    HDFS: Number of read operations=93
    HDFS: Number of large read operations=0
    HDFS: Number of write operations=2
  Job Counters 
    Failed map tasks=2
    Killed map tasks=1
    Launched map tasks=33
    Launched reduce tasks=1
    Other local map tasks=2
    Data-local map tasks=31
    Total time spent by all maps in occupied slots (ms)=868458
    Total time spent by all reduces in occupied slots (ms)=41226
    Total time spent by all map tasks (ms)=434229
    Total time spent by all reduce tasks (ms)=20613
    Total vcore-milliseconds taken by all map tasks=434229
    Total vcore-milliseconds taken by all reduce tasks=20613
    Total megabyte-milliseconds taken by all map tasks=111162624
    Total megabyte-milliseconds taken by all reduce tasks=5276928
  Map-Reduce Framework
    Map input records=2423
    Map output records=2423
    Map output bytes=110993
    Map output materialized bytes=116026
    Input split bytes=3384
    Combine input records=0
    Combine output records=0
    Reduce input groups=1850
    Reduce shuffle bytes=116026
    Reduce input records=2423
    Reduce output records=2423
    Spilled Records=4846
    Shuffled Maps =30
    Failed Shuffles=0
    Merged Map outputs=30
    GC time elapsed (ms)=8832
    CPU time spent (ms)=18280
    Physical memory (bytes) snapshot=7933607936
    Virtual memory (bytes) snapshot=67026501632
    Total committed heap usage (bytes)=5230821376
  Shuffle Errors
    BAD_ID=0
    CONNECTION=0
    IO_ERROR=0
    WRONG_LENGTH=0
    WRONG_MAP=0
    WRONG_REDUCE=0
  File Input Format Counters 
    Bytes Read=91776
  File Output Format Counters 
    Bytes Written=102975

Once you have gotten this working and are comfortable compiling and running a Hadoop job, you are ready to begin writing your own jobs for Project 3.