Project 1 - Inverted Index

Assigned:	Monday, February 8
Due Date:	Tuesday, February 23, 11:59 pm
Collaboration Policy:	Level 1
Group Policy:	Individual

This warmup project has two primary goals: (1) acquainting you with the tools and infrastructure that we will use throughout the course, and (2) giving you an introduction (or refresher) to C++ programming. Completing this project should ensure that you are sufficiently up to speed with C++ to handle the rest of the projects (where we will dive into proper OS topics)!

1 - Tools and Infrastructure

Slack Setup

Questions, discussions, and announcements in this class will all make use of Slack, which is a channel-based messaging platform. Slack will provide a convenient, efficient way to communicate with me, your classmates, and your team (for group projects).

Slack will be used in preference to all other written forms of communication outside of class, including (and especially) email. Once you have Slack set up, do not send me email; send me a direct message (DM) on Slack instead! Similarly, I will post important announcements and information to Slack instead of sending email. Therefore, it is critical that you are configured to receive Slack notifications so you do not miss important information from me. Follow these steps to get set up:

First, watch this 1-minute video to get a quick overview of how Slack works.
Join the CSCI 3310 Slack by going to this course on Blackboard and clicking on "CSCI 3310 Slack Signup" on the landing page. Sign up using your Bowdoin email. Once you have done this (once), you will be able to access the Slack via the regular "CSCI 3310 Slack" link also posted on the Blackboard landing page.
Download the Slack desktop app. While you can access Slack via the web interface, the desktop app is more seamless and may encourage you to check in more regularly (like a dedicated email client). You can also install the Slack mobile app if desired.
Configure your Slack notifications. As mentioned above, I will use Slack instead of email to make important announcements, so it is critical that you are notified of such announcements. Set your notifications to either "All new messages" or "Direct messages, mentions & keywords". At the bottom of the Notifications preferences, under "When I'm not active on desktop", you may want to check the "Send me email notifications" box, particularly if you might otherwise not be signed into Slack on a regular basis (you can always change these settings later on).
From within Slack, send me a DM so I know that you're set up. I will respond with a code that you will need in your final submission (only at the very end; you don't need to wait for my response before continuing).

The Slack contains four public channels (visible to everyone): general (for announcements and general discussion), inclass (for discussion of material that was covered in-class), projects (for discussion of programming projects), and random (for anything else off-topic). You can also create your own private channels, which is particularly useful for working in groups. At the moment, creation of new public channels to restricted to admins (i.e., me), but we may alter the public channel lineup later on. If you have a good idea for a new public channel, please let me know!

Now that you are set up in Slack, you are ready to proceed with the rest of the project and get help along the way as needed!

Command-Line Environment

This class will make heavy use of the Linux command-line. You should already have some comfort with the command-line, but if you need or want a refresher, you should go through the Command-Line Unix Crash Course.

Version Control

We will also be making use of the git version control system (and GitHub) to facilitate collaboration. You may already have some experience with version control, but not necessarily with git specifically. If you are new to git or just want to brush up, check out the Git Essential Training Companion Guide. You can skip the third part (which primarily deals with group collaboration) until the first group project.

Class Server

You have been provided an account on the class Linux server that will allow you to work on and submit your projects. You are strongly encouraged to do your coding on the class server. The name of the class server is turing.bowdoin.edu (or if you are on-campus, just turing for short). The server provides a full-fledged Linux environment with all the standard development tools (editors, compilers, version control, etc) preinstalled. If there is any software that you would like to use that is not already installed on the server, please let me know. Access to your server account is by SSH key only (no password access). You should have received your keyfiles from me via email (if not, let me know immediately). For instructions on using your key, refer to the relevant section of the Unix tutorial.

Project Autograder

Finally, all projects will make use of an autograding system installed on the class server. Read the Autograder Overview carefully and let me know if you have any questions about the autograder.

General Logistics

Starter code for the projects will be distributed via GitHub. For each project, you (or your group) will have a personal git repository accessible only by you and myself in which you will complete your work. At the start of each project, an invitation link will be posted to Blackboard that will initialize your personal project repository on GitHub and add any starter files. Once your project repository is initialized, you can clone it to turing and then begin to work. Although you will make your final submission via the autograder, you should also make sure that your final work is committed and pushed to GitHub.

It is recommended to configure your GitHub account to use your SSH key for authentication, which will allow you to interact with GitHub without needing to type your username or password. Refer to the Git tutorial for instructions.

2 - C++ Overview

All projects in this course will be written in C++. You should already have some familiarity with C, but may not have ever programmed in C++ before. C++ is a superset of C that adds many features -- most significantly, full support for objects and a library of built-in data structures called the Standard Template Library (STL). Another notable difference is that C++ has a real string class (as opposed to C, which just has char*). C++ has lots (lots!) of features and capabilities that we won't be using at all; modern C++ is a sprawling and complex language (unlike C, which is quite compact). To help with this complexity, we will only be focusing on a few bits and pieces of C++, and otherwise most of the code we write should feel familiar coming from a C background.

Compilation

While the standard C compiler you are familiar with is called gcc, the standard C++ compiler is called g++. Note that there are several historical C++ standards, which have added many new features over the years. The version of C++ that we will use is C++11 (the 2011 standard). You should not use any C++ features that were added in later versions (e.g., C++14 or C++17). To tell g++ to use C++11, you need to pass the -std=c++11 flag. You should also always pass the -Wall flag to turn on all compiler warnings.

Language Primer

As a first step, take a look at this Intro to C++ guide. This guide is targeted towards beginners who know Java but not C, so your existing C knowledge will help quite a bit. If you want a refresher on C fundamentals (pointers, etc), the entire guide should be useful. If you are comfortable with C and just want a quick tutorial on core C++ features, you should focus on Section 10 (classes are new in C++), Section 13 (basic input and output in C++), and Section 14 (the STL). Between your existing knowledge of C and Java, you should be able to understand these examples fairly readily.

Reference Variables

C++ has reference variables, which are aliases to some other variable. Section 8 of the above guide contains one example of using reference variables, but they deserve special attention on account of their frequent use as function parameters.

Reference variables are used when you want to pass a reference to a value/struct/object (called pass-by-reference) rather than the standard behavior of passing a copy of the value/struct/object (called pass-by-value). Passing a reference is more efficient (since nothing is copied) and also allows you to modify the original argument within the called function. Achieving this in a plain C program requires passing a pointer (which is still pass-by-value, but you're only copying a single pointer value), whereas in C++ we most often use a reference variable instead of a pointer. Here is a quick example of pass-by-value using a pointer vs. pass-by-reference; the pointer example is nearly identical to a C program (apart from the use of cout rather than printf), whereas the reference example has no analog in C. Here is a more technical explanation of reference variables, which contains more information on syntax and more detailed examples.

In general, pointers tend to be used less often in C++ than in C as a consequence of having reference variables (which are safer and more convenient in most cases). You can also mark reference variables const, which prevents them from being modified (whereas it is usually impossible to protect the memory pointed to by a pointer from modification).

C++11 Features

The above C++ guide predates the C++11 standard, so it does not use any more modern language features. C++11 introduced a pile of new features, most of which we won't need. Two particular additions worth noting, however, are foreach-style loops and the auto type keyword. Foreach-style loops use the same syntax as in Java, e.g., if arr is an array of integers, you could write the following to print the array:

for (int i : arr) {
  cout << i << " ";
}

The auto keyword can be used as a type when you want the compiler to infer the type from the rest of the program. For example, you could rewrite the above loop as such:

for (auto i : arr) { 
  // compiler knows that i is an int because arr is of type int[]
  cout << i << " ";
}

While there is little benefit from using auto in this specific case, it is handy when the type is something long and verbose, which case it is simpler and more compact to just write auto.

General Reference

You are also likely to want to consult other online references while coding (e.g., looking up library functions, C++ syntax, etc). Especially during this first project, take the time to fill in gaps in your C++ knowledge as opposed to simply trying to 'make it work'. It will be worth it later on!

A comprehensive general reference on C++ and STL classes and functions is cppreference.com (which will often show up in Google results as well). Be careful when consulting this site to make sure you're looking at the right version; e.g., often a function will list two usages, one marked something like "(until C++17)" and the other marked something like "(since C++17)". In this example, the version you should pay attention to is the first version, since we're using C++11 (which predates C++17).

3 - Inverted Index

For the actual coding part of this project, you will write a program in C++ that generates an inverted index of all the words in a list of text files. Briefly, an inverted index is a data structure that maps content to the location(s) of that content. In this context, this simply means a map of words (e.g., 'hello') to the documents containing those words (e.g., 'doc1.txt' and 'doc3.txt').

A typical use case for an inverted index is a search engine - you enter a keyword, and an inverted index could be used to produce all the pages that contain that keyword.

Program Specification

Your inverted index generator (or 'inverter') will run on the command-line and will take exactly one command-line argument: a file that contains a list of filenames (one filename per line). Each of the files named in the input file will contain text that you will use to build your index.

For example, you might have a file named inputs.txt with the following content:

foo1.txt
foo2.txt

Separately, you might have a file named foo1.txt containing the following:

this is a test. cool.

and a file named foo2.txt containing the following:

this is also a test.
boring.

Assuming your compiled executable is named inverter, you could call your program as follows:

./inverter inputs.txt

When run, your inverter should print all of the words from all of the inputs in alphabetical order, with each word followed by the sequentially ordered document numbers in which the word appears. For instance, in the example shown above, foo1.txt is document 0, foo2.txt is document 1, and the correct output would be the following:

a: 0 1
also: 1
boring: 1
cool: 0
is: 0 1
test: 0 1
this: 0 1

Alphabetical is defined as the order according to ASCII and is case sensitive. So, for example, "The" and "the" are separate words, and "The" comes first. Words may only contain alpha characters and not numbers, spaces, or other special characters. Non-alpha characters should simply split a word into multiple words: for example, "Th3e" is two words, "Th" and "e" (note that the non-alpha character is not part of either word).

Files should be numbered incrementally, starting with 0. Only valid, openable files should be included in the count (i.e., skip invalid files). If the same filename appears multiple times in the input file, you should process those files as normal (i.e., the same as if they had different filenames).

Your program must follow this output specification exactly, and should produce zero additional output under any circumstance. Extraneous output, or output formatted incorrectly (even a single extra space on the end of a line, for instance) will make the autograder mark your solution as incorrect. Your program's behavior even under unusual circumstances (e.g., an input file containing filenames that do not exist) should not differ in any way from what is described here. If you have questions about the specification, please ask!

Coding Rules

In addition to implementing the program specification described above, your code must follow several additional rules:

Globals: You may not declare any global variables in your program (i.e., variables outside of any function).
Pointers: You may not declare any pointer types in your program, with the exception of argv as required by the declaration of the main function. The purpose of this restriction (combined with no globals) is to force you to get comfortable with using reference variables as an alternative to pointers.
Modularity: One way to deal with the restrictions above is to write your entire program in one giant main function (at which point all your top-level variables effectively become globals). Don't do that. Your program should be appropriately modular, with short functions that accomplish well-defined tasks. A function with more than (say) 40 lines is almost certainly too long and should be modularized. As a point of reference, my own implementation consists of 4 functions (including main) and the longest function is just 16 lines long.
Minutiae: Your source file must be named inverter.cc (this is the filename expected by the autograder) and must include your name and Slack code (from Part 1) in a comment at the top of the program.

Failing to follow these coding rules will result in penalties to your hand-graded score (even if your program passes all the autograder tests).

Implementation Tips

Implement the inverted index structure using the C++ Standard Template Library (STL) as a map of sets, as in:

map<string, set<int> > invertedIndex;

Fun fact: unlike in Java, STL maps and sets are implemented as binary search trees (rather than hash tables). Maps based on hash tables were not added to the STL until C++11 via the unordered_map class.

You should also use C++ strings and file streams to read the files:

#include <string>
#include <fstream>

Make sure that your program uses an ifstream (input-only) instead of an fstream (input/output) to avoid accidentally modifying the files. Both are included in the fstream library.

The built-in isalpha function is useful in checking whether a character is alphabetic.

Remember that your program needs to be robust to errors. Files may be empty, etc. Your program should handle these cases gracefully and with no extra output.

Finally, be careful when passing objects to a function -- remember that by default, passing an object or struct to a function in C++ (or C) results in copying the entire object. You can avoid this copy by using pass-by-reference (or a pointer, but not in this project).

Debugging Tips

Use the gdb debugger (or your IDE's built-in debugger if you are using a regular IDE) to step through the program, and run valgrind to check for memory errors (even if your program isn't crashing). If you need a refresher on using either tool, you can refer to this CSCI 2330 Debugging Mini-Lab (you won't have the actual exercises discussed there, but the basic instructions on using the tools is still valid).

Logistics

To get started, go to Blackboard and browse to this course. Click the Start Projects menu item on the left, then click the "Begin Project 1 - Inverted Index" link, which will take you to GitHub and walk you through initializing your private project repository. When this is done, you will be looking at the GitHub page for your new repository. You can then clone this repository to turing and begin to work.

You will submit your project using the autograding system. Refer to the Autograder Overview for submission instructions and make particular note of the best practices specified there! Remember to commit and push your final work to GitHub as well. It's also a good idea to push to GitHub periodically as you work.

Evaluation

Your program will be graded on (1) correctness, (2) design (which includes following the coding rules), and (3) style. Remember that the autograder will only check correctness, nothing else! For guidance on what constitutes good design and style, see the Coding Design & Style Guide, which lists many common things to look for. Please ask if you have any other questions about design or style issues.