MapReduce is based on two standard features in many functional programming languages.
The Map function takes a {key, value}
, performs some computation on them, then outputs a list of {key, value}
pairs.
Map: (key, value) -> (key, value)
The Reduce function takes a {key, value[]}
, performs some computation on them, then outputs a {key, value}[]
.
Reduce: (key, value[]) -> (key, value)[]
Writing a MapReduce program is conceptually as simple as writing the Map and Reduce functions (or using already-provided functions), then telling a job object which functions you want to use.
Of course, a real Hadoop program involves a fair amount of additional code (largely boilerplate), so a good way to jump in is to look at some existing Hadoop programs.
The following job simply reads in MapReduce-generated key/value pairs and outputs them as-is. The code itself is fairly straightforward and most consists of defining the Mapper and Reducer classes and then telling the job to use them. Read through it and make sure you understand the general flow, even if you presuambly aren't familiar with the actual classes involved.
import java.io.IOException; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.Text; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.conf.Configured; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.util.Tool; import org.apache.hadoop.util.ToolRunner; /** * Trivial MapReduce job that pipes input to output as MapReduce-created key-value pairs. */ public class Trivial extends Configured implements Tool { public static void main(String[] args) throws Exception { int res = ToolRunner.run(new Configuration(), new Trivial(), args); System.exit(res); } // the "real" main method, invoked in a slightly roundabout way @Override public int run(String[] args) throws Exception { if (args.length < 2) { System.err.println("Error: Wrong number of parameters"); System.err.println("Expected: [in] [out]"); System.exit(1); } Configuration conf = getConf(); Job job = Job.getInstance(conf, "trivial job"); job.setJarByClass(Trivial.class); // set the Mapper and Reducer functions we want job.setMapperClass(Trivial.IdentityMapper.class); job.setReducerClass(Trivial.IdentityReducer.class); // input arguments tell us where to get/put things in HDFS FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); // ternary operator - a compact conditional return job.waitForCompletion(true) ? 0 : 1; } /** * map: (LongWritable, Text) --> (LongWritable, Text) * NOTE: Keys must implement WritableComparable, values must implement Writable */ public static class IdentityMapper extends Mapper<LongWritable, Text, LongWritable, Text> { @Override public void map(LongWritable key, Text val, Context context) throws IOException, InterruptedException { // write (key, val) out to memory/disk context.write(key, val); } } /** * reduce: (LongWritable, Text) --> (LongWritable, Text) */ public static class IdentityReducer extends Reducer<LongWritable, Text, LongWritable, Text> { @Override public void reduce(LongWritable key, Iterable<Text> values, Context context) throws IOException, InterruptedException { // write (key, val) for every value for (Text val : values) { context.write(key, val); } } } }
These instructions assume you have already (1) configured your Hadoop cluster, (2) formatted HDFS, and (3) started the cluster.
The trivial job above and a Makefile
to compile it are provided in your starter repository in the trivial
directory. Build the program by running make
in that directory. Doing so should produce a file called build.jar
, which is your compiled Hadoop job. Now, you can launch your Hadoop job (from your master) by running the following:
# /usr/local/hadoop/bin/hadoop jar build.jar Trivial /input-file /output-dir
For this to work, you must have already copied /input-file
to HDFS (remember, input and output paths are HDFS paths). You can use any file as the input file (e.g., Trivial.java
). Hadoop will report the progress
of your job on standard output. Once the job finishes, the output will be available under
/output-dir
inside HDFS. If you view the resulting part-r-00000
file, you should see an exact copy of your original input file.
Here is some example output from a run of this job:
19/03/28 07:34:19 INFO mapreduce.Job: map 77% reduce 0% 19/03/28 07:34:20 INFO mapreduce.Job: map 93% reduce 0% 19/03/28 07:34:21 INFO mapreduce.Job: map 97% reduce 0% 19/03/28 07:34:24 INFO mapreduce.Job: map 100% reduce 0% 19/03/28 07:34:25 INFO mapreduce.Job: map 100% reduce 100% 19/03/28 07:34:25 INFO mapreduce.Job: Job job_1553754551422_0004 completed successfully 19/03/28 07:34:25 INFO mapreduce.Job: Counters: 52 File System Counters FILE: Number of bytes read=115852 FILE: Number of bytes written=6363395 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=95160 HDFS: Number of bytes written=102975 HDFS: Number of read operations=93 HDFS: Number of large read operations=0 HDFS: Number of write operations=2 Job Counters Failed map tasks=2 Killed map tasks=1 Launched map tasks=33 Launched reduce tasks=1 Other local map tasks=2 Data-local map tasks=31 Total time spent by all maps in occupied slots (ms)=868458 Total time spent by all reduces in occupied slots (ms)=41226 Total time spent by all map tasks (ms)=434229 Total time spent by all reduce tasks (ms)=20613 Total vcore-milliseconds taken by all map tasks=434229 Total vcore-milliseconds taken by all reduce tasks=20613 Total megabyte-milliseconds taken by all map tasks=111162624 Total megabyte-milliseconds taken by all reduce tasks=5276928 Map-Reduce Framework Map input records=2423 Map output records=2423 Map output bytes=110993 Map output materialized bytes=116026 Input split bytes=3384 Combine input records=0 Combine output records=0 Reduce input groups=1850 Reduce shuffle bytes=116026 Reduce input records=2423 Reduce output records=2423 Spilled Records=4846 Shuffled Maps =30 Failed Shuffles=0 Merged Map outputs=30 GC time elapsed (ms)=8832 CPU time spent (ms)=18280 Physical memory (bytes) snapshot=7933607936 Virtual memory (bytes) snapshot=67026501632 Total committed heap usage (bytes)=5230821376 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=91776 File Output Format Counters Bytes Written=102975
Once you have gotten this working and are comfortable compiling and running a Hadoop job, you are ready to begin writing your own jobs for Project 3.