MapReduce is based on two standard features in many functional programming languages.
The Map function takes a {key, value}, performs some computation on them, then outputs a list of {key, value} pairs.
Map: (key, value) -> (key, value)
The Reduce function takes a {key, value[]}, performs some computation on them, then outputs a {key, value}[].
Reduce: (key, value[]) -> (key, value)[]
Writing a MapReduce program is conceptually as simple as writing the Map and Reduce functions (or using already-provided functions), then telling a job object which functions you want to use.
Of course, a real Hadoop program involves a fair amount of additional code (largely boilerplate), so a good way to jump in is to look at some existing Hadoop programs.
The following job simply reads in MapReduce-generated key/value pairs and outputs them as-is. The code itself is fairly straightforward and most consists of defining the Mapper and Reducer classes and then telling the job to use them. Read through it and make sure you understand the general flow, even if you presuambly aren't familiar with the actual classes involved.
import java.io.IOException;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
/**
* Trivial MapReduce job that pipes input to output as MapReduce-created key-value pairs.
*/
public class Trivial extends Configured implements Tool {
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new Configuration(), new Trivial(), args);
System.exit(res);
}
// the "real" main method, invoked in a slightly roundabout way
@Override
public int run(String[] args) throws Exception {
if (args.length < 2) {
System.err.println("Error: Wrong number of parameters");
System.err.println("Expected: [in] [out]");
System.exit(1);
}
Configuration conf = getConf();
Job job = Job.getInstance(conf, "trivial job");
job.setJarByClass(Trivial.class);
// set the Mapper and Reducer functions we want
job.setMapperClass(Trivial.IdentityMapper.class);
job.setReducerClass(Trivial.IdentityReducer.class);
// input arguments tell us where to get/put things in HDFS
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
// ternary operator - a compact conditional
return job.waitForCompletion(true) ? 0 : 1;
}
/**
* map: (LongWritable, Text) --> (LongWritable, Text)
* NOTE: Keys must implement WritableComparable, values must implement Writable
*/
public static class IdentityMapper extends Mapper<LongWritable, Text, LongWritable, Text> {
@Override
public void map(LongWritable key, Text val, Context context)
throws IOException, InterruptedException {
// write (key, val) out to memory/disk
context.write(key, val);
}
}
/**
* reduce: (LongWritable, Text) --> (LongWritable, Text)
*/
public static class IdentityReducer extends Reducer<LongWritable, Text, LongWritable, Text> {
@Override
public void reduce(LongWritable key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
// write (key, val) for every value
for (Text val : values) {
context.write(key, val);
}
}
}
}
These instructions assume you have already (1) configured your Hadoop cluster, (2) formatted HDFS, and (3) started the cluster.
The trivial job above and a Makefile to compile it are provided in your starter repository in the trivial directory. Build the program by running make in that directory. Doing so should produce a file called build.jar, which is your compiled Hadoop job. Now, you can launch your Hadoop job (from your master) by running the following:
# /usr/local/hadoop/bin/hadoop jar build.jar Trivial /input-file /output-dir
For this to work, you must have already copied /input-file to HDFS (remember, input and output paths are HDFS paths). You can use any file as the input file (e.g., Trivial.java). Hadoop will report the progress
of your job on standard output. Once the job finishes, the output will be available under
/output-dir inside HDFS. If you view the resulting part-r-00000 file, you should see an exact copy of your original input file.
Here is some example output from a run of this job:
19/03/28 07:34:19 INFO mapreduce.Job: map 77% reduce 0%
19/03/28 07:34:20 INFO mapreduce.Job: map 93% reduce 0%
19/03/28 07:34:21 INFO mapreduce.Job: map 97% reduce 0%
19/03/28 07:34:24 INFO mapreduce.Job: map 100% reduce 0%
19/03/28 07:34:25 INFO mapreduce.Job: map 100% reduce 100%
19/03/28 07:34:25 INFO mapreduce.Job: Job job_1553754551422_0004 completed successfully
19/03/28 07:34:25 INFO mapreduce.Job: Counters: 52
File System Counters
FILE: Number of bytes read=115852
FILE: Number of bytes written=6363395
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=95160
HDFS: Number of bytes written=102975
HDFS: Number of read operations=93
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Failed map tasks=2
Killed map tasks=1
Launched map tasks=33
Launched reduce tasks=1
Other local map tasks=2
Data-local map tasks=31
Total time spent by all maps in occupied slots (ms)=868458
Total time spent by all reduces in occupied slots (ms)=41226
Total time spent by all map tasks (ms)=434229
Total time spent by all reduce tasks (ms)=20613
Total vcore-milliseconds taken by all map tasks=434229
Total vcore-milliseconds taken by all reduce tasks=20613
Total megabyte-milliseconds taken by all map tasks=111162624
Total megabyte-milliseconds taken by all reduce tasks=5276928
Map-Reduce Framework
Map input records=2423
Map output records=2423
Map output bytes=110993
Map output materialized bytes=116026
Input split bytes=3384
Combine input records=0
Combine output records=0
Reduce input groups=1850
Reduce shuffle bytes=116026
Reduce input records=2423
Reduce output records=2423
Spilled Records=4846
Shuffled Maps =30
Failed Shuffles=0
Merged Map outputs=30
GC time elapsed (ms)=8832
CPU time spent (ms)=18280
Physical memory (bytes) snapshot=7933607936
Virtual memory (bytes) snapshot=67026501632
Total committed heap usage (bytes)=5230821376
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=91776
File Output Format Counters
Bytes Written=102975
Once you have gotten this working and are comfortable compiling and running a Hadoop job, you are ready to begin writing your own jobs for Project 3.