distributed

CSCI 3325
Distributed Systems

Bowdoin College
Spring 2015
Instructor: Sean Barker

MapReduce Example

MapReduce is based on two standard features in many functional programming languages.

The Map function takes a {key, value}, performs some computation on them, then outputs a list of {key, value} pairs.

   Map: (key, value) -> (key, value)

The Reduce function takes a {key, value[]}, performs some computation on them, then outputs a {key, value}[].

   Reduce: (key, value[]) -> (key, value)[]

In a Hadoop job, output.collect(key, value); is how Map and Reduce functions emit their {key, value} pairs.

Writing a MapReduce is as simple as writing the Map and Reduce functions (or using already-provided functions), then telling a job object which functions you want to use.

Trivial Identity Job

The following job simply reads in MapReduce-generated key-value pairs and outputs them as-is. The code itself is fairly straightforward and most consists of defining the Mapper and Reducer classes and then telling the job to use them.

import java.io.IOException;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

/**
 * Trivial MapReduce job that pipes input to output as MapReduce-created key-value pairs.
 */
public class Trivial extends Configured implements Tool {

  public static void main(String[] args) throws Exception {
    int res = ToolRunner.run(new Configuration(), new Trivial(), args);
    System.exit(res);
  }

  @Override
  public int run(String[] args) throws Exception {

    if (args.length < 2) {
      System.err.println("Error: Wrong number of parameters");
      System.err.println("Expected: [in] [out]");
      System.exit(1);
    }

    Configuration conf = getConf();

    Job job = new Job(conf, "trivial job");
    job.setJarByClass(Trivial.class);

    job.setMapperClass(Trivial.IdentityMapper.class);
    job.setReducerClass(Trivial.IdentityReducer.class);

    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));

    return job.waitForCompletion(true) ? 0 : 1;
  }

  /**
   * map: (LongWritable, Text) --> (LongWritable, Text)
   * NOTE: Keys must implement WritableComparable, values must implement Writable
   */
  public static class IdentityMapper extends Mapper<LongWritable, Text, LongWritable, Text> {

    @Override
    public void map(LongWritable key, Text val, Context context)
        throws IOException, InterruptedException {
      // write (key, val) out to memory/disk
      context.write(key, val);
    }

  }

  /**
   * reduce: (LongWritable, Text) --> (LongWritable, Text)
   */
  public static class IdentityReducer extends Reducer<LongWritable, Text, LongWritable, Text> {

    @Override
    public void reduce(LongWritable key, Iterable<Text> values, Context context) 
        throws IOException, InterruptedException {
      // write (key, val) for every value
      for (Text val : values) {
        context.write(key, val);
      }
    }

  }

}

Compiling the Job

These instructions assume you have already (1) installed the Hadoop cluster, (2) formatted your HDFS, and (3) started Hadoop via start-dfs.sh and start-mapred.sh.

Make a directory for the MapReduce example. Save the trivial MapReduce program into Trivial.java and the following Makefile into that directory as well:

PATH:=${JAVA_HOME}/bin:${PATH}
HADOOP_PATH=/usr/local/hadoop
NEW_CLASSPATH=${HADOOP_PATH}/*:${CLASSPATH}

SRC = $(wildcard *.java) 

all: build

build: ${SRC}
  ${JAVA_HOME}/bin/javac -Xlint -classpath ${NEW_CLASSPATH} ${SRC}
  ${JAVA_HOME}/bin/jar cvf build.jar *.class 

Now cd into that directory and build the program by running make. This should produce a file called build.jar, which is your compiled Hadoop job.

Now, you can launch your Hadoop job (from your master) by running:

/usr/local/hadoop/bin/hadoop jar build.jar Trivial input-file output-dir

For this to work, you must have already copied input-file to HDFS (remember, input and output paths are HDFS paths) using hadoop dfs -put localfile remotefile). You can use any file as the input file (e.g., Trivial.java). Hadoop should report the progress of your job on standard output. Once the job finishes, the output will be available under output-dir inside your HDFS, so you will need to use hadoop dfs -get or hadoop dfs -cat to view it. If you cat part-r-00000, you should see an exact copy of your original input-file.

Here is some example output from a correct run of this job:

15/03/31 11:02:16 INFO input.FileInputFormat: Total input paths to process : 1
15/03/31 11:02:16 INFO util.NativeCodeLoader: Loaded the native-hadoop library
15/03/31 11:02:16 WARN snappy.LoadSnappy: Snappy native library not loaded
15/03/31 11:02:17 INFO mapred.JobClient: Running job: job_201503311018_0008
15/03/31 11:02:18 INFO mapred.JobClient:  map 0% reduce 0%
15/03/31 11:02:35 INFO mapred.JobClient:  map 100% reduce 0%
15/03/31 11:02:48 INFO mapred.JobClient:  map 100% reduce 33%
15/03/31 11:02:51 INFO mapred.JobClient:  map 100% reduce 100%
15/03/31 11:02:57 INFO mapred.JobClient: Job complete: job_201503311018_0008
15/03/31 11:02:57 INFO mapred.JobClient: Counters: 29
15/03/31 11:02:57 INFO mapred.JobClient:   Job Counters 
15/03/31 11:02:57 INFO mapred.JobClient:     Launched reduce tasks=1
15/03/31 11:02:57 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=22708
15/03/31 11:02:57 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
15/03/31 11:02:57 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
15/03/31 11:02:57 INFO mapred.JobClient:     Rack-local map tasks=1
15/03/31 11:02:57 INFO mapred.JobClient:     Launched map tasks=1
15/03/31 11:02:57 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=15371
15/03/31 11:02:57 INFO mapred.JobClient:   File Output Format Counters 
15/03/31 11:02:57 INFO mapred.JobClient:     Bytes Written=5293
15/03/31 11:02:57 INFO mapred.JobClient:   FileSystemCounters
15/03/31 11:02:57 INFO mapred.JobClient:     FILE_BYTES_READ=6120
15/03/31 11:02:57 INFO mapred.JobClient:     HDFS_BYTES_READ=4655
15/03/31 11:02:57 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=126863
15/03/31 11:02:57 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=5293
15/03/31 11:02:57 INFO mapred.JobClient:   File Input Format Counters 
15/03/31 11:02:57 INFO mapred.JobClient:     Bytes Read=4544
15/03/31 11:02:57 INFO mapred.JobClient:   Map-Reduce Framework
15/03/31 11:02:57 INFO mapred.JobClient:     Map output materialized bytes=6120
15/03/31 11:02:57 INFO mapred.JobClient:     Map input records=157
15/03/31 11:02:57 INFO mapred.JobClient:     Reduce shuffle bytes=6120
15/03/31 11:02:57 INFO mapred.JobClient:     Spilled Records=314
15/03/31 11:02:57 INFO mapred.JobClient:     Map output bytes=5800
15/03/31 11:02:57 INFO mapred.JobClient:     Total committed heap usage (bytes)=152244224
15/03/31 11:02:57 INFO mapred.JobClient:     CPU time spent (ms)=3210
15/03/31 11:02:57 INFO mapred.JobClient:     Combine input records=0
15/03/31 11:02:57 INFO mapred.JobClient:     SPLIT_RAW_BYTES=111
15/03/31 11:02:57 INFO mapred.JobClient:     Reduce input records=157
15/03/31 11:02:57 INFO mapred.JobClient:     Reduce input groups=157
15/03/31 11:02:57 INFO mapred.JobClient:     Combine output records=0
15/03/31 11:02:57 INFO mapred.JobClient:     Physical memory (bytes) snapshot=250531840
15/03/31 11:02:57 INFO mapred.JobClient:     Reduce output records=157
15/03/31 11:02:57 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=1663946752
15/03/31 11:02:57 INFO mapred.JobClient:     Map output records=157