Project 3 - MapReduce

Assigned:	Wednesday, March 30.
Groups Due:	Friday, April 1, 11:59 pm.
Code Due Date:	Monday, April 11, 11:59 pm.
Writeup Due Date:	48 hours after code due.

In this project, you will perform some 'big data' analyses using MapReduce, which is one of the most well-known and influential models for large-scale data analysis on machine clusters. MapReduce was originally developed by Google an was followed by a publicly available system called Hadoop (developed by Apache) which implements the same processing model. In this project, you will write and run several of your own MapReduce jobs on a live Hadoop cluster. You will also be responsible for configuring the cluster itself using "cloud" machines provisioned through Amazon.

This project should be done in teams of two or three (unless otherwise cleared with me). All team members are expected to work on all parts of the project.

Part 1: Cluster Setup

Each group has been provided with a set of virtual servers hosted in an Amazon data center. Your first task is to configure your servers as a small Hadoop cluster. Follow the steps given here to configure your cluster. Note that setting up your cluster from scratch may take some time, dependending on your Unix comfort level.

Once that's done, make sure you can compile and run the sample Hadoop job, which gives you the skeleton of a complete Hadoop program without too much complexity.

Once your cluster is running and you are able to execute MapReduce jobs, your task is to write two such jobs (of which the first is basically a warmup).

Part 2: Inverted Index

An inverted index is a mapping of words to their location in a set of documents. Most modern search engines utilize some form of an inverted index to process user-submitted queries. It is also one of the most popular MapReduce examples. In its most basic form, an inverted index is a simple hash table which maps words in the documents to some sort of document identifier. For example, if given the following 2 documents:

      Doc1: Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo.
      Doc2: Buffalo are mammals.

We could construct the following inverted file index:

      Buffalo -> Doc1, Doc2
      buffalo -> Doc1
      buffalo. -> Doc1
      are -> Doc2
      mammals. -> Doc2

Your goal is to build an inverted index of words to the documents which contain them. You can try this on the files in the dataset located in the gutenberg directory of your Git repository. You will need to copy these files to your cluster.

Your end result should be something of the form: (word, docid[]). It's perfectly fine to just output each "array" as a single Hadoop Text object containing the document names separated by a comma and space.

The actual logic of this job is quite straightforward, so your primary task will be learning your way around the basic Hadoop classes. For example, you will need to use the InputSplit provided to the mapper (which, by default, is a FileSplit object) to get the filename associated with the map invocation. Expect to spend some time reading Javadoc as you get acquainted with the MapReduce classes. Apache also has an official MapReduce tutorial that you may find useful. If you reference any other web sources for Hadoop questions (e.g., Stack Overflow), make sure that you are only using org.apache.mapreduce classes; avoid anything that uses org.apache.mapred classes. The latter package has many similar classes but is part of an old and deprecated Hadoop API, so you should not include anything from it.

Part 3: Internet Advertising

Suppose you work for an internet advertising company and want to better target your ads to users based on prior data. In other words, given an advertising context, you would like to predict which of the available advertisements is most likely to result in a click.

The ad serving machines produce two types of log files: impression logs and click logs. Each time an advertisement is displayed to a customer, an entry is added to the impression log. Each time a customer clicks on an advertisement, an entry is added to the click log. Note that each impression has the potential to result in a click, but only a fraction of impressions will actually do so (so there will be more impressions than clicks).

The advertising company wants to determine which ads should be shown on which pages in order to increase the number of clicks per impression. In particular, given a page URL on which an ad will be shown (this page is called the referrer) and a particular ad ID, your job is to determine the click through rate, which is the percentage of impressions with the specified referrer and ad ID that were clicked. Clearly, a higher click through rate suggests that a particular ad is better suited to a particular page.

More specifically, you will be generating a summary table of click through rates, which could later be queried by ad serving machines to determine the best ad to display. Logically, this table is like a matrix in which one axis represents referrers (pages that have shown ads), the other axis represents ads shown on the referrer pages, and the matrix values represent the click through rates for each (referrer, ad) pair. Note that since each ad will only have been shown on a subset of referrers, most entries of this matrix will be zero (making it what's called a sparse matrix).

Test Logs

Your Git repository contains a test dataset consisting of impression logs (in the impressions directory) and click logs (in the clicks directory) from the ad serving machines. Take a look at a few of the logfiles to get a sense of their content. The logfiles are stored in JSON format, with one JSON object per line. In particular, note that every impression is identified by an impressionId, and each click is similarly associated with an impression ID indicating which ad was clicked. You can try grepping for a particular impression ID in both the impression and click logs to see both matching entries.

Now, copy the log directories into HDFS. As usual, you can ignore any InterruptedException warnings during the copy:

19/03/28 19:26:45 WARN hdfs.DataStreamer: Caught exception
java.lang.InterruptedException
  at java.lang.Object.wait(Native Method)
  at java.lang.Thread.join(Thread.java:1252)
  at java.lang.Thread.join(Thread.java:1326)
  at org.apache.hadoop.hdfs.DataStreamer.closeResponder(DataStreamer.java:980)
  at org.apache.hadoop.hdfs.DataStreamer.endBlock(DataStreamer.java:630)
  at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:807)

Before you can operate on the logfiles directly, a bit of preprocessing is in order. MapReduce will more efficiently operate on a small number of large files, but here we have a large number of small files instead. Thus, we'll first use MapReduce to merge all of the files together (for impressions and clicks, respectively). Rather than writing a custom job for this, we can just use the "sort" example job included with Hadoop, as in the following (note that you may need to modify the HDFS paths depending on where you copied the files):

/usr/local/hadoop/bin/hadoop jar \
      /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.10.1.jar \
      sort /impressions /impressions_merged \
      -inFormat org.apache.hadoop.mapreduce.lib.input.TextInputFormat \
      -outFormat org.apache.hadoop.mapreduce.lib.output.TextOutputFormat \
      -outKey org.apache.hadoop.io.LongWritable \
      -outValue org.apache.hadoop.io.Text

Expect this job to take roughly 30 minutes to complete on your cluster. Repeat this step to merge the click logs (which should take another 30 minutes or so).

Take a look at the output of the merge jobs. Each line of the merged files begins with a number (which wasn't part of the original files) and then the data line itself (a JSON object). This difference is due to the Hadoop sort outputting numeric keys with the actual file data as the values. One straightforward way to handle the merged files as input to your job is to write your map input as LongWritable, Text and just ignore the numeric key.

ClickRate Job

Your custom MapReduce job should be named ClickRate.java and should operate on the merged data files. The job should be provided three arguments: (1) the merged impression files, (2) the merged click files, and (3) the output path. In other words, your job should be executed as follows:

/usr/local/hadoop/bin/hadoop jar build.jar ClickRate [impressions_merged] [clicks_merged] [out]

The output of your job must be in the following exact format:

[referrer, ad_id] click_rate

For example, a line of your output might look like the following, indicating that 10% of the impressions of this ad from this referrer resulted in a click:

[example.com, 01P39XxSg9eU1nmtNxmO028LPd2qz7]	0.1

Each line of your output essentially represents a single entry of the sparse matrix described previously. You do not need to output the zero entries corresponding to ads that were never shown on particular referrers (but many of your click through rates may still be zero if the ad never resulted in a click from that referrer).

When processing the data in your job, you'll need to parse the JSON objects. The easiest way to do so is using a JSON library such as JSON.simple. You can tell Hadoop to include libraries (stored in jarfiles) by adding the jarfiles to a folder named lib and including this folder in the jar you create. Here is a Makefile that will do this:

LIBS=/usr/local/hadoop/share/hadoop
NEW_CLASSPATH=lib/*:${LIBS}/mapreduce/*:${LIBS}/common/*:${LIBS}/common/lib/*:${CLASSPATH}

SRC = $(wildcard *.java) 

all: build

build: ${SRC}
  ${JAVA_HOME}/bin/javac -Xlint -cp ${NEW_CLASSPATH} ${SRC}
  ${JAVA_HOME}/bin/jar cvf build.jar *.class lib

Note: you must use tabs on the last two lines of the Makefile above, not spaces! If you copy-paste the above, you will probably have to delete the spaces and add tabs instead.

Hadoop will distrubute the lib folder to all nodes in the cluster and automatically include it in the classpath. You can download the JSON.simple jarfile here.

Sketching out your job on paper is almost certainly a better approach than trying to develop it "on-the-fly" while writing your code. While this job involves more complexity than the inverted index, the amount of actual code needed is still quite modest. Your code should compile without any warnings, except for several "bad path element" errors that you can ignore.

Finally, an important tip: you may find it simpler to design your job as a sequence of two separate MapReduce operations. You can easily execute a sequence of MapReduce operations within your job by constructing multiple Job objects within your run method. The output of one MapReduce operation can easily be fed as the input into the next by setting the same path in the run method.

Writeup and Submission

Your writeup can be relatively modest for this assignment -- you should discuss the design of your MapReduce jobs, as well as overview how you implemented them in Hadoop. You should also provide a piece of your final output (i.e., part of the final click through rate data). Remember to ensure that your output is in the format specified above!

Submit your assignment via your GitHub repository as usual, which should contain your source code, your writeup (as a PDF), and a snippet of your final output (either in your writeup or in a separate file).

Resources

Hadoop cluster setup guide
Hadoop program example
Hadoop API Javadoc
Official MapReduce Tutorial (more involved; lots of info but not all of it needed)
Hadoop website (note: we're using Hadoop version 2.10.1, not the later version 3)
JSON.simple library
JSON.simple Javadoc