CSCI 3325 Spring 2015 Instructor: Sean Barker |
MapReduce is based on two standard features in many functional programming languages.
The Map function takes a {key, value}, performs some computation on them, then outputs a list of {key, value} pairs.
Map: (key, value) -> (key, value)
The Reduce function takes a {key, value[]}, performs some computation on them, then outputs a {key, value}[].
Reduce: (key, value[]) -> (key, value)[]
In a Hadoop job, output.collect(key, value); is how Map and Reduce functions emit their {key, value} pairs.
Writing a MapReduce is as simple as writing the Map and Reduce functions (or using already-provided functions), then telling a job object which functions you want to use.
The following job simply reads in MapReduce-generated key-value pairs and outputs them as-is. The code itself is fairly straightforward and most consists of defining the Mapper and Reducer classes and then telling the job to use them.
import java.io.IOException; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.Text; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.conf.Configured; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.util.Tool; import org.apache.hadoop.util.ToolRunner; /** * Trivial MapReduce job that pipes input to output as MapReduce-created key-value pairs. */ public class Trivial extends Configured implements Tool { public static void main(String[] args) throws Exception { int res = ToolRunner.run(new Configuration(), new Trivial(), args); System.exit(res); } @Override public int run(String[] args) throws Exception { if (args.length < 2) { System.err.println("Error: Wrong number of parameters"); System.err.println("Expected: [in] [out]"); System.exit(1); } Configuration conf = getConf(); Job job = new Job(conf, "trivial job"); job.setJarByClass(Trivial.class); job.setMapperClass(Trivial.IdentityMapper.class); job.setReducerClass(Trivial.IdentityReducer.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); return job.waitForCompletion(true) ? 0 : 1; } /** * map: (LongWritable, Text) --> (LongWritable, Text) * NOTE: Keys must implement WritableComparable, values must implement Writable */ public static class IdentityMapper extends Mapper<LongWritable, Text, LongWritable, Text> { @Override public void map(LongWritable key, Text val, Context context) throws IOException, InterruptedException { // write (key, val) out to memory/disk context.write(key, val); } } /** * reduce: (LongWritable, Text) --> (LongWritable, Text) */ public static class IdentityReducer extends Reducer<LongWritable, Text, LongWritable, Text> { @Override public void reduce(LongWritable key, Iterable<Text> values, Context context) throws IOException, InterruptedException { // write (key, val) for every value for (Text val : values) { context.write(key, val); } } } }
These instructions assume you have already (1) installed the Hadoop cluster, (2) formatted your HDFS, and (3) started Hadoop via start-dfs.sh and start-mapred.sh.
Make a directory for the MapReduce example. Save the trivial MapReduce program into Trivial.java and the following Makefile into that directory as well:
PATH:=${JAVA_HOME}/bin:${PATH} HADOOP_PATH=/usr/local/hadoop NEW_CLASSPATH=${HADOOP_PATH}/*:${CLASSPATH} SRC = $(wildcard *.java) all: build build: ${SRC} ${JAVA_HOME}/bin/javac -Xlint -classpath ${NEW_CLASSPATH} ${SRC} ${JAVA_HOME}/bin/jar cvf build.jar *.class
Now cd into that directory and build the program by running make. This should produce a file called build.jar, which is your compiled Hadoop job.
Now, you can launch your Hadoop job (from your master) by running:
/usr/local/hadoop/bin/hadoop jar build.jar Trivial input-file output-dir
For this to work, you must have already copied input-file to HDFS (remember, input and output paths are HDFS paths) using hadoop dfs -put localfile remotefile). You can use any file as the input file (e.g., Trivial.java). Hadoop should report the progress of your job on standard output. Once the job finishes, the output will be available under output-dir inside your HDFS, so you will need to use hadoop dfs -get or hadoop dfs -cat to view it. If you cat part-r-00000, you should see an exact copy of your original input-file.
Here is some example output from a correct run of this job:
15/03/31 11:02:16 INFO input.FileInputFormat: Total input paths to process : 1 15/03/31 11:02:16 INFO util.NativeCodeLoader: Loaded the native-hadoop library 15/03/31 11:02:16 WARN snappy.LoadSnappy: Snappy native library not loaded 15/03/31 11:02:17 INFO mapred.JobClient: Running job: job_201503311018_0008 15/03/31 11:02:18 INFO mapred.JobClient: map 0% reduce 0% 15/03/31 11:02:35 INFO mapred.JobClient: map 100% reduce 0% 15/03/31 11:02:48 INFO mapred.JobClient: map 100% reduce 33% 15/03/31 11:02:51 INFO mapred.JobClient: map 100% reduce 100% 15/03/31 11:02:57 INFO mapred.JobClient: Job complete: job_201503311018_0008 15/03/31 11:02:57 INFO mapred.JobClient: Counters: 29 15/03/31 11:02:57 INFO mapred.JobClient: Job Counters 15/03/31 11:02:57 INFO mapred.JobClient: Launched reduce tasks=1 15/03/31 11:02:57 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=22708 15/03/31 11:02:57 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 15/03/31 11:02:57 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 15/03/31 11:02:57 INFO mapred.JobClient: Rack-local map tasks=1 15/03/31 11:02:57 INFO mapred.JobClient: Launched map tasks=1 15/03/31 11:02:57 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=15371 15/03/31 11:02:57 INFO mapred.JobClient: File Output Format Counters 15/03/31 11:02:57 INFO mapred.JobClient: Bytes Written=5293 15/03/31 11:02:57 INFO mapred.JobClient: FileSystemCounters 15/03/31 11:02:57 INFO mapred.JobClient: FILE_BYTES_READ=6120 15/03/31 11:02:57 INFO mapred.JobClient: HDFS_BYTES_READ=4655 15/03/31 11:02:57 INFO mapred.JobClient: FILE_BYTES_WRITTEN=126863 15/03/31 11:02:57 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=5293 15/03/31 11:02:57 INFO mapred.JobClient: File Input Format Counters 15/03/31 11:02:57 INFO mapred.JobClient: Bytes Read=4544 15/03/31 11:02:57 INFO mapred.JobClient: Map-Reduce Framework 15/03/31 11:02:57 INFO mapred.JobClient: Map output materialized bytes=6120 15/03/31 11:02:57 INFO mapred.JobClient: Map input records=157 15/03/31 11:02:57 INFO mapred.JobClient: Reduce shuffle bytes=6120 15/03/31 11:02:57 INFO mapred.JobClient: Spilled Records=314 15/03/31 11:02:57 INFO mapred.JobClient: Map output bytes=5800 15/03/31 11:02:57 INFO mapred.JobClient: Total committed heap usage (bytes)=152244224 15/03/31 11:02:57 INFO mapred.JobClient: CPU time spent (ms)=3210 15/03/31 11:02:57 INFO mapred.JobClient: Combine input records=0 15/03/31 11:02:57 INFO mapred.JobClient: SPLIT_RAW_BYTES=111 15/03/31 11:02:57 INFO mapred.JobClient: Reduce input records=157 15/03/31 11:02:57 INFO mapred.JobClient: Reduce input groups=157 15/03/31 11:02:57 INFO mapred.JobClient: Combine output records=0 15/03/31 11:02:57 INFO mapred.JobClient: Physical memory (bytes) snapshot=250531840 15/03/31 11:02:57 INFO mapred.JobClient: Reduce output records=157 15/03/31 11:02:57 INFO mapred.JobClient: Virtual memory (bytes) snapshot=1663946752 15/03/31 11:02:57 INFO mapred.JobClient: Map output records=157