CSCI 3310
Operating Systems

Bowdoin College
Fall 2022
Instructor: Sean Barker

Project 1 - Inverted Index

Assigned:Wednesday, August 31.
Due Date:Tuesday, September 13, 11:59 pm.
Collaboration Policy:Level 1
Group Policy:Individual

This warmup project has two primary goals: (1) acquainting you with the tools and infrastructure that we will use throughout the course, and (2) giving you an introduction (or refresher) to C++ programming. Completing this project should ensure that you are sufficiently up to speed with C++ to handle the rest of the projects (where we will dive into proper OS topics)!

0 - It's in the Syllabus

First, read the class syllabus so that you are aware of important course policies. If you have questions or need clarifications on anything in the syllabus, please let me know! Going forward, I will assume that everyone has read and is familiar with the policies and procedures spelled out in the syllabus.

Somewhere within the syllabus is a code that you'll need to submit in the next part of the lab. Make a note of this code (and note that your code is unique).

1 - Tools and Infrastructure

Slack Setup

Questions, discussions, and announcements in this class will all make use of Slack, which is a channel-based messaging platform. Slack will provide a convenient, efficient way to communicate with me, your classmates, and your team (for group projects).

Slack will be used in preference to all other written forms of communication outside of class, including (and especially) email. Once you have Slack set up, do not send me email; send me a direct message (DM) on Slack instead! Similarly, I will post important announcements and information to Slack instead of sending email. Therefore, it is critical that you are configured to receive Slack notifications so you do not miss important information from me. Follow these steps to get set up:

  1. First, watch this 1-minute video to get a quick overview of how Slack works.
  2. Join the CSCI 3310 Slack using the invitation link that was sent to your Bowdoin email. If you haven't received an invitation email, let me know. Once you have joined, you will be able to access the Slack using the regular CSCI 3310 Slack link (which is also posted to Canvas).
  3. Download the Slack desktop app. While you can access Slack via the web interface, the desktop app is more seamless and may encourage you to check in more regularly (like a dedicated email client). You can also install the Slack mobile app if desired.
  4. Configure your Slack notifications. As mentioned above, I will use Slack instead of email to make important announcements, so it is critical that you are notified of such announcements. Set your notifications to either "All new messages" or "Direct messages, mentions & keywords". At the bottom of the Notifications preferences, under "When I'm not active on desktop", you may want to check the "Send me email notifications" box, particularly if you might otherwise not be signed into Slack on a regular basis (you can always change these settings later on).
  5. From within Slack, send me a DM containing your code from the course syllabus so I know that you're set up.

The Slack contains four public channels (visible to everyone): general (for announcements and general discussion), inclass (for discussion of in-class material), projects (for discussion of programming projects), and random (for anything off-topic). You can also create (private) DM threads with multiple participants. To post anonymously in a channel, just type your message like so: /anonymous This is an anonymous message (but note that you cannot reply anonymously inside a message thread). Messages can also be edited or deleted after posting.

Now that you are set up in Slack, you are ready to proceed with the rest of the project and get help along the way as needed!

Command-Line Environment

This class will make heavy use of the Linux command-line. You should already have some comfort with the command-line, but if you need or want a refresher, you should go through the Command-Line Unix Crash Course.

Version Control

We will also be making use of the git version control system (and GitHub) to facilitate collaboration. You may already have some experience with version control, but not necessarily with git specifically. If you are new to git or just want to brush up, check out the Git Essential Training Companion Guide. You can skip the third part (which primarily deals with group collaboration) until the first group project.

Class Server

You have been provided an account on the class Linux server that will allow you to work on and submit your projects. You are strongly encouraged to do your coding on the class server. The name of the class server is hopper.bowdoin.edu (or if you are on-campus, just hopper for short). The server provides a full-fledged Linux environment with all the standard development tools (editors, compilers, version control, etc) preinstalled. If there is any software that you would like to use that is not already installed on the server, please let me know and I can probably install it. Access to your server account is by SSH key only (no password access). You should have received your keyfiles from me via email (if not, let me know). For instructions on using your key, refer to the relevant section of the Unix tutorial.

Project Autograder

Finally, all projects will make use of an autograding system installed on the class server. Read the Autograder Overview carefully and let me know if you have any questions about the autograder.

General Logistics

Starter code for the projects will be distributed via GitHub. For each project, you (or your group) will have a personal git repository accessible only by you and myself in which you will complete your work. At the start of each project, an invitation link will be posted to the Slack that will initialize your personal project repository on GitHub and add any starter files. Once your project repository is initialized, you can clone it to hopper and then begin to work. Although you will make your final submission via the autograder, you should also make sure that your final work is committed and pushed to GitHub.

It is recommended to configure your GitHub account to use your SSH key for authentication, which will allow you to interact with GitHub without needing to type your username or password. Refer to the Git tutorial for instructions.

2 - C++ Overview

All projects in this course will be written in C++. You should already be familiar with C, but you may not have ever programmed in C++ before. C++ is a superset of C that adds many features -- most significantly, full support for objects and a library of built-in data structures called the Standard Template Library (STL). Another notable difference is that, like Java, C++ has a real string class (as opposed to C, which just has char*) and a standard bool type for boolean values. C++ has many features and capabilities that we won't be using at all; modern C++ is a sprawling and complex language (unlike C, which is quite compact). To help with this complexity, we will only be focusing on a few bits and pieces of C++, and otherwise most of the code we write should feel familiar coming from a C background.

Compilation

While the standard C compiler you are familiar with is called gcc, the standard C++ compiler is called g++. Note that there are several historical C++ standards, which have added many new features over the years. The version of C++ that we will use is C++11 (the 2011 standard). You should not use any C++ features that were added in later versions (e.g., C++14 or C++17). To tell g++ to use C++11, you need to pass the -std=c++11 flag. You should also always pass the -Wall flag to turn on all compiler warnings. The Makefile included in the project starter code will already include these flags.

Language Primer

As a first step, take a look at this (elderly but fairly compact) Intro to C++ guide. This guide is targeted towards beginners who know Java but not C, so your existing C knowledge will help quite a bit. If you want a refresher on C fundamentals (pointers, etc), the entire guide should be useful. If you are comfortable with C and just want a quick tutorial on core C++ features, you should focus on Section 10 (classes are new in C++), Section 13 (basic input and output in C++), and Section 14 (the STL). Between your existing knowledge of C and Java, you should be able to understand these examples fairly readily.

Reference Variables

C++ has reference variables, which are aliases to some other variable. Section 8 of the above guide contains one example of using reference variables, but they deserve special attention on account of their frequent use as function parameters.

Reference variables are used when you want to pass a reference (sort of like a pointer, but without using an explicit pointer type) to a value/struct/object (called pass-by-reference) rather than the standard behavior of passing a copy of the value/struct/object (called pass-by-value). Passing a reference is more efficient (since nothing is copied) and also allows you to modify the original argument within the called function. Achieving the same functionality in a plain C program requires passing a pointer (which is still pass-by-value, wherein you're making a copy of the pointer), whereas in C++ we most often use a reference variable instead of a pointer. Here is a quick example of pass-by-value using a pointer vs. pass-by-reference; the pointer example is nearly identical to a C program (apart from the use of cout rather than printf), whereas the reference example has no analog in C. Here is a more technical explanation of reference variables, which contains more information on syntax and more detailed examples.

In general, explicit pointers tend to be less often used in C++ than in C as a consequence of having reference variables (which are safer and more convenient in most cases). You can also mark reference variables const, which prevents them from being modified (whereas it is usually impossible to protect the memory pointed to by a pointer from modification).

C++11 Features

The above C++ guide predates the C++11 standard, so it does not use any more modern language features. C++11 introduced many new features, most of which we won't need. Two particular additions worth noting, however, are foreach-style loops and the auto type keyword. Foreach-style loops use the same syntax as in Java, e.g., if arr is an array of integers, you could write the following to print the array:

for (int i : arr) {
  cout << i << " ";
}

The auto keyword can be used as a type when you want the compiler to infer the type from the rest of the program. For example, you could rewrite the above loop as such:

for (auto i : arr) { 
  // compiler knows that i is an int because arr is of type int[]
  cout << i << " ";
}

While there is little benefit from using auto in this specific case, it is handy when the type is something long and verbose, in which case it is simpler and more compact to just write auto. However, don't go overboard with auto; you should only use it when the alternative is writing out a long, cumbersome type (as often happens with iterators). For simple, basic types, your code will be clearer if you just specify the appropriate type as usual.

General Reference

You are also likely to want to consult other online references while coding (e.g., looking up library functions, C++ syntax, etc). Especially during this first project, take the time to fill in gaps in your C++ knowledge as opposed to simply trying to 'make it work'. It will be worth it later on!

A comprehensive general reference on C++ and STL classes and functions is cppreference.com (which will often show up in Google results as well). Be careful when consulting this site to make sure you're looking at the right version; e.g., often a function will list two usages, one marked something like "(until C++17)" and the other marked something like "(since C++17)". In this example, the version you should pay attention to is the first version, since we're using C++11 (which predates C++17).

3 - Inverted Index

For the actual coding part of this project, you will write a program in C++ that generates an inverted index of all the words in a list of text files. Briefly, an inverted index is a data structure that maps content to the location(s) of that content. In this context, this simply means a map of words (e.g., 'hello') to the documents containing those words (e.g., 'doc1.txt' and 'doc3.txt').

A typical use case for an inverted index is a search engine - you enter a keyword, and an inverted index could be used to produce all the pages that contain that keyword.

Program Specification

Your inverted index generator (or 'inverter') will run on the command-line and will take exactly one command-line argument: a file that contains a list of filenames (one filename per line). Each of the files named in the input file will contain text that you will use to build your index.

For example, you might have a file named inputs.txt with the following content:

foo1.txt
foo2.txt

Separately, you might have a file named foo1.txt containing the following:

this is a test. cool.

and a file named foo2.txt containing the following:

this is also a test.
boring.

Assuming your compiled executable is named inverter, you could call your program as follows:

./inverter inputs.txt

When run, your inverter should print all of the words from all of the inputs in alphabetical order, with each word followed by the sequentially ordered document numbers in which the word appears. For instance, in the example shown above, foo1.txt is document 0, foo2.txt is document 1, and the correct output would be the following:

a: 0 1
also: 1
boring: 1
cool: 0
is: 0 1
test: 0 1
this: 0 1

Alphabetical is defined as the order according to ASCII and is case sensitive. So, for example, "The" and "the" are separate words, and "The" comes first. Words may only contain alpha characters and not numbers, spaces, or other special characters. Non-alpha characters should simply split a word into multiple words: for example, "Th3e" is two words, "Th" and "e" (note that the non-alpha character is not part of either word).

Files should be numbered incrementally, starting with 0. Only valid, openable files should be included in the count (i.e., skip invalid files). If the same filename appears multiple times in the input file, you should process those files as normal (i.e., the same as if they had different filenames).

If your program is passed an invalid filename as the command line argument, no output should be produced.

Your program must follow this output specification exactly, and should produce zero additional output under any circumstance. Extraneous output, or output formatted incorrectly (even a single extra space on the end of a line, for instance) will make the autograder mark your solution as incorrect. Your program's behavior even under unusual circumstances (e.g., an input file containing filenames that do not exist) should not differ in any way from what is described here. If you have questions about the specification, please ask!

Coding Rules

In addition to implementing the program specification described above, your code must follow several additional rules:

Failing to follow these coding rules will result in penalties to your hand-graded score (even if your program passes all the autograder tests).

Implementation Tips

Implement the inverted index structure using the C++ Standard Template Library (STL) as a map of sets, as in:

map<string, set<int> > invertedIndex;

Fun fact: unlike in Java, STL maps and sets are implemented as binary search trees (rather than hash tables). Maps based on hash tables were not added to the STL until C++11 via the unordered_map class.

You should also use C++ strings and file streams to read the files:

#include <string>
#include <fstream>

Make sure that your program uses an ifstream (input-only) instead of an fstream (input/output) to avoid accidentally modifying the files. Both are included in the fstream library.

The built-in isalpha function is useful in checking whether a character is alphabetic.

Remember that your program needs to be robust to errors. Files may be empty, etc. Your program should handle these cases gracefully and with no extra output.

Finally, be careful when passing objects to a function -- remember that by default, passing an object or struct to a function in C++ (or C) results in copying the entire object. You can avoid this copy by using pass-by-reference (or a pointer, but not in this project).

Debugging Tips

For any basic print-based debugging, don't use cout for debugging output! All regular output from your program is passed to the autograder, so any extra output that your program produces beyond the project specification (such as diagnostic info) will cause your program to fail test cases. Rather than having to comment out or otherwise disable all such output before submitting to the autograder, send any debugging output to cerr ("standard error") rather than cout. Output that goes to cerr will appear as normal when you run the program at the terminal, but will be ignored by the autograder when checking your output.

For more elegant debugging, use the gdb debugger (or your IDE's built-in debugger if you are using a regular IDE) to step through the program. It's also a good idea to run valgrind to check for memory errors (even if your program isn't crashing). If you need a refresher on using gdb or valgrind, you can refer to the CSCI 2330 Debugging Mini-Lab (you won't have the actual exercises discussed there, but the basic information on using the tools is still instructive).

Logistics

To get started, initialize your private GitHub repository for the project using the invitation link posted to the Slack. This link will take you to GitHub and walk you through initializing your private project repository. When this is done, you will be looking at the GitHub page for your new repository. You can then clone this repository to hopper and begin to work.

You will submit your project using the autograding system. Refer to the Autograder Overview for submission instructions and make particular note of the best practices specified there! Remember to commit and push your final work to GitHub as well. It's also a good idea to push to GitHub periodically as you work.

Evaluation

Your program will be graded on (1) correctness, (2) design (which includes following the coding rules), and (3) style. Remember that the autograder will only check correctness, nothing else! For guidance on what constitutes good design and style, see the Coding Design & Style Guide, which lists many common things to look for. Please ask if you have any other questions about design or style issues.