os

CSCI 2310
Operating Systems

Bowdoin College
Fall 2015
Instructor: Sean Barker

Project 1 - Inverted Index

This project will give you a refresher in C++ programming as well as acquaint you with the course infrastructure. The primary goal of this assignment is to ensure that you are sufficiently up to speed in C++ to handle the rest of the course.

For this assignment, you will write a program in C++ that generates an inverted index of all the words in a list of text files.

Inverter Input Specification

Your inverter will take exactly one argument: a file that contains a list of filenames (one filename per line). Each of the files named in the input file will contain text that you will use to build your index.

For example, if you have an input file named inputs.txt that names two files called foo1.txt and foo2.txt, these files could contain the following:

inputs.txt
-----
foo1.txt
foo2.txt

foo1.txt
-----
this is a test. cool.

foo2.txt
-----
this is also a test.
boring.

Inverter Output Specification

Your inverter should print all of the words from all of the inputs, in "alphabetical" order, followed by the document numbers in which they appear, in order. For instance, in the example shown above, foo1.txt is document 0, and foo2.txt is document 1, and the correct output would be the following:

a: 0 1
also: 1
boring: 1
cool: 0
is: 0 1
test: 0 1
this: 0 1

Alphabetical is defined as the order according to ASCII and is case sensitive. So, for example, "The" and "the" are separate words, and "The" comes first. Words may only contain alpha characters and not numbers, spaces, etc. Non-alpha characters should simply split a word into multiple words: for example, "Th3e" is two words, "Th" and "e".

Files are incrementally numbered, starting with 0. Only valid, openable files should be included in the count. If the same filename appears multiple times in the input file, you should process those files as normal (i.e., the same as if they had different filenames).

Your output must follow this specification exactly, and should absolutely not produce any other output. Extraneous output, or output formatted incorrectly (even a single extra space on the end of a line, for instance) will make the autograder mark your solution as incorrect. Your program's behavior even under unusual circumstances (e.g., an input file containing filenames that do not exist) should not differ in any way from what is described here. If you have questions about the specification, please ask!

Read the above paragraph a second time just to make sure you understand it. Programming to a specification requires careful attention to detail!

Implementation Tips

You should name your source file inverter.cc (this is the filename expected by the autograder).

You are welcome to use any Unix/Linux based machine to write your programs. In addition to the autograder server, Bowdoin has two public Linux systems with all necessary software preinstalled that you may use - dover.bowdoin.edu and foxcroft.bowdoin.edu. You can SSH into any of these machines from on-campus in order to write your programs. If you are rusty or unfamiliar with the command line, you should run through the Unix Crash Course. More information on the Bowdoin Linux environment is available here.

If you are rusty with C++, I first suggest you read the UMass Intro to C++ guide. It is partially targeted at those who are familiar with Java, but is also a very good (and compact) refresher on C++ in general.

Implement the inverted index structure using the C++ Standard Template Library (STL) as a map of sets, as in:

map<string, set<int> > invertedIndex;

Use C++ strings and file streams:

#include <string>
#include <fstream>

Make sure that your project uses an ifstream instead of an fstream. Both are included in the fstream library.

The isalpha function is useful in checking whether a character is alphabetic.

Remember, your program needs to be robust to errors. Files may be empty, etc. Please handle these cases gracefully and with no extra output.

Project Submission

Your project will be handed in using the autograding system. Please see the autograder tutorial for submission instructions. Remember to verify before submitting that the code you've written compiles and runs on the autograder server!

Project Writeup

Unlike future projects, no writeup is required for this project.