Project 3 - MapReduce

The goal of this project is to become familiar with MapReduce, a popular model for programming distributed systems that was developed by Google and publicly released by Apache (in the form of Hadoop).

A good starting point is the sample Hadoop job, which gives you the skeleton of a complete Hadoop program without too much complexity.

This project should be done in teams of two or three.

Part 0: Cluster Setup

You have all been provided with a set of virtual machines from Amazon EC2 (m1.small instances if you are curious). These machines are running a fairly bare-bones Linux installation on which you will configure a small Hadoop cluster.

You should have already completed the following steps to setup your cluster:

Run hadoop namenode -format to create a new distributed file system across all your nodes. You should only run this command once. If you would like to start from a blank file system, just rerun this command. This will delete anything in your distributed file system.
Run start-dfs.sh and start-mapred.sh to start up all the nodes in your cluster (those taking part in the distributed file system and those taking part in the map reduce computations, which will be the same machines in this instance).
Transfer data onto your distributed file system with hadoop dfs -put local_file remote_file. You will want to do this for your input files. Be advised, this can take some time, especially with replication.
Submit jobs. You want to create a .jar file containing your class. See the MapReduce tutorial for help. Run hadoop jar path/to/jar.jar path.to.class <args to class>.
Run stop-dfs.sh and stop-mapred.sh to stop all the nodes in your cluster.

Part 1: Build an Inverted Index

An inverted index is a mapping of words to their location in a set of documents. Most modern search engines utilize some form of an inverted index to process user-submitted queries. It is also one of the most popular MapReduce examples. In its most basic form, an inverted index is a simple hash table which maps words in the documents to some sort of document identifier. For example, if given the following 2 documents:

      Doc1: Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo.
      Doc2: Buffalo are mammals.

We could construct the following inverted file index:

      Buffalo -> Doc1, Doc2
      buffalo -> Doc1
      buffalo. -> Doc1
      are -> Doc2
      mammals. -> Doc2

Your goal is to build an inverted index of words to the documents which contain them. You can try this on the files in the dataset located here. You will need to copy these files to your cluster.

Your end result should be something of the form: (word, docid[]).

Part 2: Build a Summary Table

Suppose you work for an internet advertising company. You want to better target your ads to users, based on prior data. That is, given an advertising context, we would like to predict which of our available advertisements is most likely to result in a click.

The ad serving machines produce two types of log files: impression logs and click logs. Every time we display an advertisement to a customer, we add an entry to the impression log. Every time a customer clicks on an advertisement, we add an entry to the click log.

For this assignment, we will look at one particular feature: the page url on which we will be showing the ad. Given a page url and an ad id, we wish to determine the click through rate, which is just the percentage of impressions with the desired page url and ad id that were clicked.

Your goal is to build a summary table of click through rates, which could later be queried by ad serving machines to determine the best ad to display. Logically, this table is a sparse matrix with the axes page url and ad id. The value represents the percentage of times an ad was clicked.

You can do this on the files located in click_through_data.tar.gz. You will need to copy these files to your cluster. This may take a while! To save time, you may skip ahead and use the merged dataset included below. I do recommend, however, you at least look at the original dataset to understand it's format a bit.

Your end result will be something of the form: (page_url, ad_id, clicked_percent).

You'll have to do a little more than simply tokenize the text on whitespace. The log files are stored in JSON format, with one JSON object per line. You should use a JSON library, such as json-simple. You can include libraries in Hadoop by adding the jar in a folder named lib and including the lib folder in the jar you create. To do this, just change the last line of your Makefile to

${JAVA_HOME}/bin/jar cvf build.jar *.class lib

Hadoop will distrubute the lib folder to all nodes in the cluster and automatically include it in the classpath.

The ad serving machines write log files every few minutes. Thus, there is a large number of files in the dataset. MapReduce was designed to handle a small number of large files, but we instead have a large number of small files. Once you have copied the dataset into HDFS, you can use the following command to merge all of the files together. Do this for both impressions and clicks.

hadoop jar hadoop-examples-1.2.1.jar sort impressions impressions_merged \
        -inFormat org.apache.hadoop.mapreduce.TextInputFormat \
        -outFormat org.apache.hadoop.mapreduce.TextOutputFormat \
        -outKey org.apache.hadoop.io.LongWritable \
        -outValue org.apache.hadoop.io.Text

The merged dataset is available at dataset_merged.tar.gz. If you use this dataset, you do not have to run the merge command shown above! Each line of the merged files begins with "0" and then the data. This differs from the original files which begin immediately with data and no leading 0. This is because 'hadoop sort' outputs 0 as the keys and the data as the values. One way to deal with this is to write the map function as <LongWritable,Text,Text,Text>, taking the value (and ignoring the key). This way your code will work for both the unmerged and merged files. You do not have to do it this way, but this is one option.

IMPORTANT NOTE ABOUT GRADING: I will run the projects in the following way:

bin/hadoop jar build.jar ClickRate [impressions_merged] [clicks_merged] [output]

In addition, the output should be in the following format:

[referrer, ad_id] click_rate

This can be achieved by making the key the string "[referrer, ad_id]", and the value click_rate.

IF YOUR CODE AND OUTPUT DOES NOT ADHERE TO THIS FORMAT, YOU WILL BE PENALIZED!

Part 3: Writeup and Submission

Your writeup can be relatively modest for this assignment -- you should discuss the design of your MapReduce jobs, as well as overview how you implemented them in Hadoop. You should also provide a piece of your final output (i.e., part of the final click rate data). Remember to ensure that your output is in the format specified above!

To submit your assignment, submit a gzipped tarball to Blackboard. Please include the following files in your tarball:

Your writeup (PDF).
All the files for your source code only. Please do not include any executables.
Your Makefile (or compile instructions) and your jar file (build.jar).
A snippet of your final output (either part of or separate from your writeup).

Project 3 - MapReduce

Part 0: Cluster Setup

Part 1: Build an Inverted Index

Part 2: Build a Summary Table

Part 3: Writeup and Submission

Resources