Assigned: | Monday, February 8 |
Due Date: | Tuesday, February 23, 11:59 pm |
Collaboration Policy: | Level 1 |
Group Policy: | Individual |
This warmup project has two primary goals: (1) acquainting you with the tools and infrastructure that we will use throughout the course, and (2) giving you an introduction (or refresher) to C++ programming. Completing this project should ensure that you are sufficiently up to speed with C++ to handle the rest of the projects (where we will dive into proper OS topics)!
Questions, discussions, and announcements in this class will all make use of Slack, which is a channel-based messaging platform. Slack will provide a convenient, efficient way to communicate with me, your classmates, and your team (for group projects).
Slack will be used in preference to all other written forms of communication outside of class, including (and especially) email. Once you have Slack set up, do not send me email; send me a direct message (DM) on Slack instead! Similarly, I will post important announcements and information to Slack instead of sending email. Therefore, it is critical that you are configured to receive Slack notifications so you do not miss important information from me. Follow these steps to get set up:
The Slack contains four public channels (visible to everyone): general
(for announcements and general discussion), inclass
(for discussion of material that was covered in-class), projects
(for discussion of programming projects), and random
(for anything else off-topic). You can also create your own private channels, which is particularly useful for working in groups. At the moment, creation of new public channels to restricted to admins (i.e., me), but we may alter the public channel lineup later on. If you have a good idea for a new public channel, please let me know!
Now that you are set up in Slack, you are ready to proceed with the rest of the project and get help along the way as needed!
This class will make heavy use of the Linux command-line. You should already have some comfort with the command-line, but if you need or want a refresher, you should go through the Command-Line Unix Crash Course.
We will also be making use of the git
version control system (and GitHub) to facilitate collaboration. You may already have some experience with version control, but not necessarily with git
specifically. If you are new to git
or just want to brush up, check out the Git Essential Training Companion Guide. You can skip the third part (which primarily deals with group collaboration) until the first group project.
You have been provided an account on the class Linux server that will allow you to work on and submit your projects. You are strongly encouraged to do your coding on the class server. The name of the class server is turing.bowdoin.edu
(or if you are on-campus, just turing
for short). The server provides a full-fledged Linux environment with all the standard development tools (editors, compilers, version control, etc) preinstalled. If there is any software that you would like to use that is not already installed on the server, please let me know. Access to your server account is by SSH key only (no password access). You should have received your keyfiles from me via email (if not, let me know immediately). For instructions on using your key, refer to the relevant section of the Unix tutorial.
Finally, all projects will make use of an autograding system installed on the class server. Read the Autograder Overview carefully and let me know if you have any questions about the autograder.
Starter code for the projects will be distributed via GitHub. For each project, you (or your group) will have a personal
git
repository accessible only by you and myself in which you will complete your work. At the start of each project, an invitation link will be posted to Blackboard that will initialize your personal project repository on GitHub and add any starter files. Once your project repository is initialized, you can clone it to turing
and then begin to work. Although you will make your final submission via the autograder, you should also
make sure that your final work is committed and pushed to GitHub.
It is recommended to configure your GitHub account to use your SSH key for authentication, which will allow you to interact with GitHub without needing to type your username or password. Refer to the Git tutorial for instructions.
All projects in this course will be written in C++. You should already have some familiarity with C, but may not have ever programmed in C++ before. C++ is a superset of C that adds many features -- most significantly, full support for objects and a library of built-in data structures called the Standard Template Library (STL). Another notable difference is that C++ has a real string
class (as opposed to C, which just has char*
). C++ has lots (lots!) of features and capabilities that we won't be using at all; modern C++ is a sprawling and complex language (unlike C, which is quite compact). To help with this complexity, we will only be focusing on a few bits and pieces of C++, and otherwise most of the code we write should feel familiar coming from a C background.
While the standard C compiler you are familiar with is called gcc
, the standard C++ compiler is called g++
. Note that there are several historical C++ standards, which have added many new features over the years. The version of C++ that we will use is C++11 (the 2011 standard). You should not use any C++ features that were added in later versions (e.g., C++14 or C++17). To tell g++
to use C++11, you need to pass the -std=c++11
flag. You should also always pass the -Wall
flag to turn on all compiler warnings.
As a first step, take a look at this Intro to C++ guide. This guide is targeted towards beginners who know Java but not C, so your existing C knowledge will help quite a bit. If you want a refresher on C fundamentals (pointers, etc), the entire guide should be useful. If you are comfortable with C and just want a quick tutorial on core C++ features, you should focus on Section 10 (classes are new in C++), Section 13 (basic input and output in C++), and Section 14 (the STL). Between your existing knowledge of C and Java, you should be able to understand these examples fairly readily.
C++ has reference variables, which are aliases to some other variable. Section 8 of the above guide contains one example of using reference variables, but they deserve special attention on account of their frequent use as function parameters.
Reference variables are used when you want to pass a reference to a value/struct/object (called pass-by-reference) rather than the standard behavior of passing a copy of the value/struct/object (called pass-by-value). Passing a reference is more efficient (since nothing is copied) and also allows you to modify the original argument within the called function. Achieving this in a plain C program requires passing a pointer (which is still pass-by-value, but you're only copying a single pointer value), whereas in C++ we most often use a reference variable instead of a pointer. Here is a quick example of pass-by-value using a pointer vs. pass-by-reference; the pointer example is nearly identical to a C program (apart from the use of cout
rather than printf
), whereas the reference example has no analog in C. Here is a more technical explanation of reference variables, which contains more information on syntax and more detailed examples.
In general, pointers tend to be used less often in C++ than in C as a consequence of having reference variables (which are safer and more convenient in most cases). You can also mark reference variables const
, which prevents them from being modified (whereas it is usually impossible to protect the memory pointed to by a pointer from modification).
The above C++ guide predates the C++11 standard, so it does not use any more modern language features. C++11 introduced a pile of new features, most of which we won't need. Two particular additions worth noting, however, are foreach-style loops and the auto
type keyword. Foreach-style loops use the same syntax as in Java, e.g., if arr
is an array of integers, you could write the following to print the array:
for (int i : arr) { cout << i << " "; }
The auto
keyword can be used as a type when you want the compiler to infer the type from the rest of the program. For example, you could rewrite the above loop as such:
for (auto i : arr) { // compiler knows that i is an int because arr is of type int[] cout << i << " "; }
While there is little benefit from using auto
in this specific case, it is handy when the type is something long and verbose, which case it is simpler and more compact to just write auto
.
You are also likely to want to consult other online references while coding (e.g., looking up library functions, C++ syntax, etc). Especially during this first project, take the time to fill in gaps in your C++ knowledge as opposed to simply trying to 'make it work'. It will be worth it later on!
A comprehensive general reference on C++ and STL classes and functions is cppreference.com (which will often show up in Google results as well). Be careful when consulting this site to make sure you're looking at the right version; e.g., often a function will list two usages, one marked something like "(until C++17)" and the other marked something like "(since C++17)". In this example, the version you should pay attention to is the first version, since we're using C++11 (which predates C++17).
For the actual coding part of this project, you will write a program in C++ that generates an inverted index of all the words in a list of text files. Briefly, an inverted index is a data structure that maps content to the location(s) of that content. In this context, this simply means a map of words (e.g., 'hello') to the documents containing those words (e.g., 'doc1.txt' and 'doc3.txt').
A typical use case for an inverted index is a search engine - you enter a keyword, and an inverted index could be used to produce all the pages that contain that keyword.
Your inverted index generator (or 'inverter') will run on the command-line and will take exactly one command-line argument: a file that contains a list of filenames (one filename per line). Each of the files named in the input file will contain text that you will use to build your index.
For example, you might have a file named inputs.txt
with the following content:
foo1.txt foo2.txt
Separately, you might have a file named foo1.txt
containing the following:
this is a test. cool.
and a file named foo2.txt
containing the following:
this is also a test. boring.
Assuming your compiled executable is named inverter
, you could call your program as follows:
./inverter inputs.txt
When run, your inverter should print all of the words from all of the inputs in
alphabetical order, with each word followed by the sequentially ordered document numbers in which the word
appears. For instance, in the example shown above, foo1.txt
is document 0, foo2.txt
is document 1, and the correct output would be the following:
a: 0 1 also: 1 boring: 1 cool: 0 is: 0 1 test: 0 1 this: 0 1
Alphabetical is defined as the order according to ASCII and is case sensitive. So, for example, "The" and "the" are separate words, and "The" comes first. Words may only contain alpha characters and not numbers, spaces, or other special characters. Non-alpha characters should simply split a word into multiple words: for example, "Th3e" is two words, "Th" and "e" (note that the non-alpha character is not part of either word).
Files should be numbered incrementally, starting with 0. Only valid, openable files should be included in the count (i.e., skip invalid files). If the same filename appears multiple times in the input file, you should process those files as normal (i.e., the same as if they had different filenames).
Your program must follow this output specification exactly, and should produce zero additional output under any circumstance. Extraneous output, or output formatted incorrectly (even a single extra space on the end of a line, for instance) will make the autograder mark your solution as incorrect. Your program's behavior even under unusual circumstances (e.g., an input file containing filenames that do not exist) should not differ in any way from what is described here. If you have questions about the specification, please ask!
In addition to implementing the program specification described above, your code must follow several additional rules:
argv
as required by the declaration of the main
function. The purpose of this restriction (combined with no globals) is to force you to get comfortable with using reference variables as an alternative to pointers.main
function (at which point all your top-level variables effectively become globals). Don't do that. Your program should be appropriately modular, with short functions that accomplish well-defined tasks. A function with more than (say) 40 lines is almost certainly too long and should be modularized. As a point of reference, my own implementation consists of 4 functions (including main
) and the longest function is just 16 lines long.inverter.cc
(this is the filename expected by the autograder) and must include your name and Slack code (from Part 1) in a comment at the top of the program.Failing to follow these coding rules will result in penalties to your hand-graded score (even if your program passes all the autograder tests).
Implement the inverted index structure using the C++ Standard Template Library (STL) as a map of sets, as in:
map<string, set<int> > invertedIndex;
Fun fact: unlike in Java, STL maps
and sets
are implemented as binary search trees (rather than hash tables). Maps based on hash tables were not added to the STL until C++11 via the unordered_map
class.
You should also use C++ strings and file streams to read the files:
#include <string> #include <fstream>
Make sure that your program uses an ifstream
(input-only) instead of an fstream
(input/output) to avoid accidentally modifying the files. Both are included in the fstream
library.
The built-in isalpha
function is useful in checking whether a character is alphabetic.
Remember that your program needs to be robust to errors. Files may be empty, etc. Your program should handle these cases gracefully and with no extra output.
Finally, be careful when passing objects to a function -- remember that by default, passing an object or struct to a function in C++ (or C) results in copying the entire object. You can avoid this copy by using pass-by-reference (or a pointer, but not in this project).
Use the gdb
debugger (or your IDE's built-in debugger if you are using a regular IDE) to step through the program, and run valgrind
to check for memory errors (even if your program isn't crashing). If you need a refresher on using either tool, you can refer to this CSCI 2330 Debugging Mini-Lab (you won't have the actual exercises discussed there, but the basic instructions on using the tools is still valid).
To get started, go to Blackboard and browse to this course. Click the Start Projects menu item on the left, then click the "Begin Project 1 - Inverted Index" link, which will take you to GitHub and walk you through initializing your private project repository. When this is done, you will be looking at the GitHub page for your new repository. You can then clone this repository to turing
and begin to work.
You will submit your project using the autograding system. Refer to the Autograder Overview for submission instructions and make particular note of the best practices specified there! Remember to commit and push your final work to GitHub as well. It's also a good idea to push to GitHub periodically as you work.
Your program will be graded on (1) correctness, (2) design (which includes following the coding rules), and (3) style. Remember that the autograder will only check correctness, nothing else! For guidance on what constitutes good design and style, see the Coding Design & Style Guide, which lists many common things to look for. Please ask if you have any other questions about design or style issues.