CSCI 3310
Operating Systems

Bowdoin College
Spring 2018
Instructor: Sean Barker

Project 1 - Inverted Index

Assigned:Monday, January 22
Due Date:Wednesday, February 7, 11:59 pm
Collaboration Policy:Level 1
Group Policy:Individual

This project will give you a quick introduction (or refresher) to C++ programming as well as acquaint you with the course infrastructure. While not intrinisically related to operating systems, this project should ensure that you are sufficiently up to speed with C++ to handle the rest of the course.

For this assignment, you will write a program in C++ that generates an inverted index of all the words in a list of text files. Briefly, an inverted index is a data structure that maps content to the location(s) of that content. In this context, this simply means a map of words (e.g., 'hello') to the documents containing those words (e.g., 'doc1.txt' and 'doc3.txt').

A typical use case for an inverted index is a search engine - you enter a keyword, and an inverted index could be used to produce all the pages that contain that keyword.

Program Specification

Your inverter will run on the command-line and will take exactly one command-line argument: a file that contains a list of filenames (one filename per line). Each of the files named in the input file will contain text that you will use to build your index.

For example, you might have a file named inputs.txt with the following content:

foo1.txt
foo2.txt

Separately, you might have a file named foo1.txt containing the following:

this is a test. cool.

and a file named foo2.txt containing the following:

this is also a test.
boring.

Assuming your compiled inverter is named inverter, you could call your inverter via the following:

./inverter inputs.txt

When run, your inverter should print all of the words from all of the inputs, in alphabetical order, followed by the document numbers in which they appear, in order. For instance, in the example shown above, foo1.txt is document 0, foo2.txt is document 1, and the correct output would be the following:

a: 0 1
also: 1
boring: 1
cool: 0
is: 0 1
test: 0 1
this: 0 1

Alphabetical is defined as the order according to ASCII and is case sensitive. So, for example, "The" and "the" are separate words, and "The" comes first. Words may only contain alpha characters and not numbers, spaces, or other special characters. Non-alpha characters should simply split a word into multiple words: for example, "Th3e" is two words, "Th" and "e" (note that the non-alpha character is not part of either word).

Files should be numbered incrementally, starting with 0. Only valid, openable files should be included in the count. If the same filename appears multiple times in the input file, you should process those files as normal (i.e., the same as if they had different filenames).

Your output must follow this specification exactly, and should absolutely not produce any other output. Extraneous output, or output formatted incorrectly (even a single extra space on the end of a line, for instance) will make the autograder mark your solution as incorrect. Your program's behavior even under unusual circumstances (e.g., an input file containing filenames that do not exist) should not differ in any way from what is described here. If you have questions about the specification, please ask!

Read the above paragraph a second time just to make sure you understand it. Programming to a specification requires careful attention to detail!

Lastly, you must name your source file inverter.cc (this is the filename expected by the autograder).

C++ Warmup

You should already have some familiarity with C, but may not have ever programmed in C++ before. C++ is a superset of C that adds many features -- most notably, full support for objects. C++ also adds a library of built-in data structures called the Standard Template Library (STL), which should feel familiar from Java. Another significant difference is that C++ has a real string class (as opposed to C, which just has char*). Finally, C++ also has lots of fancy features that we won't be using. Unlike C, which is a compact language, C++ is a rather sprawling one.

To get you get up to speed with C++, I suggest you reference this UMass Intro to C++ guide. It is largely targeted as an intro to C++ for Java coders, but your existing C knowledge will help quite a bit.

If you also want a refresher on C fundamentals (pointers, etc), the entire guide should be useful. If you are comfortable with C and just want a quick tutorial on core C++ features, you should focus on Sections 10, 13, and 14.

In all cases, you are likely to want to consult other online references in the course of writing this program (e.g., looking up library functions, C++ syntax, etc). Take the time at this early stage to actually fill in gaps in your C++ knowledge as opposed to simply trying to 'make it work'. It will be worth it later on!

Lastly, note that there are several historical C++ standards, with many new features and changes over the years. The version of C++ we will use is C++11 (the 2011 standard). You should not use any C++ features that were not present in C++11.

Implementation Tips

You are strongly encouraged to develop your program on the the autograder server (which is the same machine that will be used to evaluate your program). If you are rusty working with the command line and need a refresher, you should first run through the Unix Crash Course.

The g++ compiler defaults to an older standard, and therefore (by default) will complain if you try to use any C++11 features. To tell g++ to use C++11, you will need to pass the -std=c++11 command-line flag. Of course, the easiest way to include this flag without repeatedly typing the compilation command is to use a Makefile.

Implement the inverted index structure using the C++ Standard Template Library (STL) as a map of sets, as in:

map<string, set<int> > invertedIndex;

Fun fact: unlike in Java, STL maps and sets are implemented as binary search trees. Maps based on hash tables were not added to the STL until C++11 via the unordered_map class.

You should also use C++ strings and file streams to read the files:

#include <string>
#include <fstream>

Make sure that your project uses an ifstream instead of an fstream to avoid accidentally modifying the files. Both are included in the fstream library.

The isalpha function is useful in checking whether a character is alphabetic.

Remember that your program needs to be robust to errors. Files may be empty, etc. Please handle these cases gracefully and with no extra output.

Logistics

Your project will be handed in using the autograding system. Please see the autograder tutorial for submission instructions. Remember to verify before submitting that the code you've written compiles and runs on the autograder server!

Evaluation

Your project will be graded on program correctness, design, and style. Remember that the autograder will only check the correctness of your program, nothing else! Please ask if you have any questions about what constitutes good program design and/or style.