CS 377: Operating Systems - Project 2

Due: Friday, March 14 @ 8 pm EDT

UMass Computer Science, Spring 2014 [Course Homepage]

Project Overview

In this project, you will use Java threads and semaphores to implement a useful application based on the Producer Consumer problem studied in class.

MapReduce is a popular data processing framework used in cloud computing today. MapReduce is used to process large datasets by companies such as Google, Yahoo, and many others. MapReduce processing consists of Mapper tasks and Reducer tasks that coordinate/split the processing of data between them.

In this assignment, you will implement a toy multi-threaded version of MapReduce (note: you do not need to understand how the actual MapReduce framework works to complete this assignment; just follow the instructions here).

In particular, you will write a parallel application to construct an inverted index on all words in a set of files. An inverted index is simply a hash table where each word is mapped to the set of files containing that word (e.g., as in a search engine) and the line on which the word occurs.

Specification

Your program will create k Mapper threads (each corresponding to a single input file) and n Reducer threads responsible for building the inverted index. Your primary class file should be called Index.java and should be called in the following way:

java Index [n] [file1] [file2] ... [filek]

For example:

java Index 5 foo.txt bar.txt baz.txt

The output of your program will be the resulting inverted index: for each word that appears somewhere in one of the files, a single line is printed that specifies which files the word appears in and at which line(s) in those files. The output should use the following format:

word1 file@line,line,... file@line,line,... ...
word2 file@line,line,... file@line,line,... ...

For example, if foo.txt contains "cat" on line 1 and "dog dog" on line 2, and bar.txt contains "dog" on line 1 and "sheep dog" on line 2, then the outputted index should be this:

cat foo.txt@1
dog bar.txt@1,2 foo.txt@2,2
sheep bar.txt@2

The word entries (i.e., lines) in the outputed index should be sorted alphabetically (e.g., cat before dog), as should the filenames within each line (e.g., bar.txt before foo.txt). Note that a line number within a given file may be repeated for the same word, as in "dog dog" above (since the word dog appears twice on line 2 of foo.txt).

Important: Follow the above program and output format specification exactly! We will be running a program to test that your index is correct! We will provide a sample set of input files and (correct) output so you can verify that your output is correct.

Implementation

In the above example, three Mapper threads and five Reducer threads will be created.

Each Map thread reads one of the text files. Thus, Map thread 1 will read the words in file foo.txt, Map thread 2 will read the words in bar.txt, and so on. Ignore all non-alphanumeric characters in the input files and convert all words to lowercase - you can do this on a line of text by calling

str.replaceAll("[^A-Za-z0-9 ]").toLowerCase()

Within each map thread, a hashing function is used to hash each word and compute an integer from 1 to n. To produce a simple hash function, you can simply use Java's built-in hash function for Strings (str.hashCode()) and then mod by n to restrict the output range. If a word produces a hash value i, it is then "sent" to Reduce thread i for the actual computation of the inverted index. This is done by inserting the word into the bounded buffer for the corresponding Reduce thread.

Assume that there is one bounded buffer of size 10 for each Reduce thread. Map threads are the Producers and Reduce threads are the Consumers. Map threads insert words into one of the bounded buffers, and Reduce threads consume words from the buffer to compute the inverted index.

Each Reduce thread repeatedly consumes (i.e, reads) the next word from its bounded buffer and adds it to the inverted index. For each word, the inverted index should contain file name(s) where the word occurred and the line numbers in those files. For example, suppose the word 'pickle' occurs in files foo.txt and bar.txt on lines 2 and 7, respectively. In this case, the inverted index should contain the [key: value] entry of [pickle: (foo.txt, 2), (bar.txt, 7)]. You may use a simple data structure such as a HashMap to track words in the inverted index. You may either give each Reduce thread its own (thread-local) HashMap in which it computes its part of the inverted index, or may share a single map between all threads (in which case the map will be accessed concurrently by all Reduce threads, so you should use a ConcurrentHashMap instead).

You will need to implement your own Bounded Buffer as in Lecture 8. You may use Java monitors OR Java's Semaphore class (java.util.concurrent.Semaphore) for synchronization.

Finally, you will also need to maintain a shared counter variable initialized to m. When each Map thread terminates (i.e., is done reading its file), it will decrement the counter by 1. If the counter equals zero, and if all bounded buffers are empty, then the active Reduce thread will print out the (completed) inverted index, using the above formatting specification. Be sure to protect this shared counter using synchronization. If each Reduce thread has its own HashMap (containing a subset of the complete inverted index), you will need to combine all individual HashMaps (e.g., using hashmap.putAll()) before printing.

Submission

You may work in groups of 2 for this lab. Like last week, you should submit both your program and a design document specifying how you designed and implemented your program.

All of the following files must be submitted on Moodle as an archive file (e.g., .zip or .tar.gz) to get full credit for this assignment.

Your submission archive should contain a copy of all source files.
Your submission should contain a README file identifying your lab partner (if you have one) and containing an outline of what you did for the assignment. It should also explain and motivate your design choices. Explain the design of your MapReduce program and how its synchronization works. Clearly explain your approach but keep it short and to the point. If your implementation does not work, you should also document the problems in the README, preferably with your explanation of why it does not work and how you would solve it if you had more time. Of course, you should also comment your code. We can't give you credit for something we don't understand!
Fnally, your submission should contain a copy showing sample output from your programs.
(if working in a group) A file called GROUP that lists both members of your group and what contributions they made to the project. Of course, you should strive for an equal distribution of labor and effort. We reserve the right to assign different grades to group members if one member was clearly inadequately participating in completing the project.
Note: We will strictly enforce policies on cheating. Remember that we routinely run similarity checking programs on your solutions to detect cheating. Please make sure you turn in your own work.
You should be very careful about using code snippets you find on the Internet. In general your code should be your own. It is OK to read tutorials on the web and use these concepts in your assignment. Blind use of code from web is strictly disallowed. Feel free to check with us if you have questions on this policy. And be sure to document any Internet sources/ tutorials you have used to complete the assignment in your README file.
Late Policy: Please refer to the course syllabus for late policy on labs assignments. This late policy will be strictly enforced. Please start early so that you can submit the assignment on time.

Project 2 Grading scheme

(25) design document
(40) producer/consumer and threading
(15) output correctness
(20) code style (structure and clarity, including comments)

Late Policy: Project 2 is due at 8 PM on Friday, March 14. Please refer to the course syllabus for late policy on labs assignments. This late policy will be strictly enforced. Please start early so that you can submit the assignment on time.