Project 1 - Web Server

Release Date:	Wednesday, September 10.
Acceptance Deadline:	Sunday, September 14, 11:59 pm.
Due Date:	Sunday, September 28, 11:59 pm.
Collaboration Policy:	Level 1
Group Policy:	Groups of 2 or 3

In this project, you will implement a basic web server in C using low-level networking primitives. Building your server will teach you the basics of network programming, client/server architectures, and concurrency in networked applications.

This project should be done in teams of two or three. However, remember that the objective of working in a team is to work as a team. In other words, rather than approaching the project by splitting up the work, you should plan for all team members to work on all parts of the project.

Server Specification

Your task is to write a simple web server capable of servicing remote clients by sending them requested files from the local machine. Communication between a client and the server is defined by the Hypertext Transfer Protocol (HTTP). As such, your server will need to understand HTTP requests sent by clients and send HTTP-formatted responses back to clients.

Your server must support (at least) the following subset of functionality of both the HTTP 1.0 and the HTTP 1.1 standards:

You must support the HTTP GET method. You do not need to support any other methods, such as HEAD or POST.
You must support status codes 200, 400, 403, and 404 in your server responses.
You must support the Content-Type, Content-Length, and Date headers in your server responses.
You must support file types HTML, TXT, JPG, PNG, and GIF.
Specific functionality that you do not need to support includes chunked requests, absolute URLs, 100 Continue responses, or other headers such as If-Modified-Since and If-Unmodified-Since.

To implement the required functionality, the only request headers you need to be concerned with are Host and Connection, and the only response headers you need to be concerned with are Date, Content-Length, and Content-Type. However, feel free to extend your server to provide any functionality not required by the base specification.

Your server program must be written in C on Linux and must accept the following two command-line arguments:

-p [num] to set the port on which the server listens.
-r [path] to set the directory out of which all files are served, called the document root.

For example, you should be able to start the server using a command like the following:

./server -p 8888 -r somedir/serverfiles

Unless your document root starts with a /, it is a relative path, and therefore is interpreted relative to the current working directory. For example, if you are in /home/jdoe/proj1 and execute the above command (which specifies the relative path somedir/serverfiles), then files would be served out of /home/jdoe/proj1/somedir/serverfiles. However, if you instead specified an absolute path like /somedir/serverfiles, then files would be served out of /somedir/serverfiles without respect to the current working directory.

If either command-line option is omitted, the program should exit with an error message. Command-line options may appear in arbitrary order; therefore, you should use getopt for parsing arguments.

As in most web servers, requests for a directory without a filename (e.g., GET / or GET /catpictures/) should default to fetching index.html inside the specified directory (for example, docroot/catpictures/index.html if the document root is docroot). In other words, index.html is the default filename if no explicit filename is provided.

warning IMPORTANT: Do not allow files to be accessed outside of the document root! The simplest way someone could attempt to do so is by sending a request like GET ../myprivatefile.txt, which navigates out of the document root (via ..) and then tries to access a file elsewhere. A server that permits such access would allow a client to potentially access any file on the machine that's readable by your user account. A quick-and-dirty way to prevent such access is by disallowing file paths that include ... While a full-fledged web server would permit paths including .. as long as the path remains inside the document root, it is perfectly sufficient for you to simply treat file paths containing .. as unauthorized (i.e., HTTP code 403), and I will not test your server with legitimate queries for such paths.

A good tutorial on the essentials of the HTTP protocol is linked from the resources at the bottom of this writeup.

Starter Files

Your Git repository includes the following provided starter files:

server.c: the source file in which you should write your server.
Makefile: a Makefile that will let you compile the server by typing make.
testroot: a directory containing a simple web site that you can use as a document root during testing.

The only file you must modify is server.c. You are welcome to modify the test document root or create other document roots to use during testing. Note that the provided Makefile is configured to compile with all errors turned on and warnings counted as errors; this is done intentionally to ensure that you fix compiler warnings rather than ignore them. Fixing warnings will often teach you something about programming even if the warning in question doesn't represent a bug in the program!

Testing the Server

There are several ways you can test your server. The first is to simply access your server in a browser. For example, if your server is running on port 8888, then you could type http://hopper.bowdoin.edu:8888/something.html into your web browser to access something.html inside the document root on the server. However, testing with a browser is not recommended during early development and testing, as browsers will often simply hang or display nothing if your server isn't responding correctly. A more effective initial testing approach is to use telnet, which is a tool for sending arbitrarily-formatted text messages to any network server. For example, below is an example of connecting to google.com on port 80 and then sending an HTTP request for the file index.html:

$ telnet google.com 80
GET /index.html HTTP/1.0

Note that in the above command, there must be two carriage returns (i.e., blank lines) after the "GET" line in order to complete the command. The response to this request will be the HTTP-formatted response from the server. Using telnet will be more initially reliable than a browser, as you will be able to verify that you are getting any response back at all without also having to worry about whether the response is compliant with HTTP.

As an intermediate step between telnet and a full-blown browser, you can also use the wget or curl utilities. These utilities provide command-line HTTP clients: wget will send HTTP/1.0 requests, while curl will send HTTP/1.1 requests (though can be configured to send HTTP/1.0 requests as well). Consult the man pages for details on proper usage.

A recommended testing strategy is to use telnet initially, then move to wget and/or curl, then finally move to a full-blown browser once things seem to be working. The provided sample document root will be useful in testing that HTTP 1.1 is working properly, as the pages with embedded images will be requested through a single connection when accessed via a browser.

warning IMPORTANT: Do not leave your server running when you are not actively testing! Whenever you are done testing, make sure to terminate your server (Control-C), especially before logging off the server. Leaving a server running for long periods will occupy port numbers and is a potential security risk.

Implementation Advice

This section contains tips on implementing various parts of the server.

Parsing Command-Line Arguments

You should use the getopt library function for parsing arguments. The basic idea is that getopt is given a string that specifies all of the possible command-line arguments, some of which may take associated values (which would be both -p and -r here). The string passed to getopt specifies arguments taking a value including a colon : after the associated character (so here, you would want to use p:r:). An idiomatic usage of getopt is to wrap calls to getopt in a while loop, and inside the loop, switch on the return value to process that argument. Within the switch, the predefined global variable optarg will contain the string value passed to that particular argument, which you can use to save each argument value.

You may find it helpful to consult, e.g., this simple example of parsing arguments using getopt, or your caching lab from CSCI 2330, which also used getopt for argument parsing.

Primary Loop

At a high level, your core server functionality should be structured something like the following:

Forever loop:
   Accept new connection from incoming client
   Parse HTTP request
   Ensure well-formed request (return error otherwise)
   Determine if target file exists and is accessible (return error otherwise)
   Transmit contents of file to client (by performing reads on the file and writes on the socket)
   Close the connection (if HTTP/1.0)

You have a choice in how you handle multiple clients within the above loop structure. In particular, recall that we discussed three basic approaches to supporting multiple concurrent client connections:

A multi-threaded approach either spawns a new thread for each incoming connection, or maintains a pool of worker threads to handle whatever clients arrive. That is, once the server accepts a connection, it will hand that connection off to a different thread (either a newly created thread, or a thread in the existing pool) to parse the request, transmit the file, etc. If you decide to use a multi-threaded approach, you should use the pthreads thread library (e.g., pthread_create), as demonstrated in class. Creating a new thread for each new connection is a bit less scalable than using a thread pool but is also simpler to design and perfectly sufficient for this project.
A multi-process approach is similar to using multiple threads, but creates additional processes instead of threads to handle client connections. Using processes avoids a few of the concurrency issues that can arise from multiple threads sharing memory, but also means that coordination between processes is a bit harder. Creating multiple processes also introduces a bit more overhead than creating multiple threads. The best way to communicate between processes is to use the pipe system call; you can see an example usage of pipe together with fork in the manpages (run man 2 pipe).
An event-driven architecture will keep a list of active connections and loop over them, performing a little bit of work on behalf of each connection. For example, there might be a loop that first checks to see if any new connections are pending to the server and then loops over all existing client connections and sends a "block" of file data to each (e.g., 4096 bytes). This event-driven architecture has the primary advantage of avoiding any synchronization issues associated with a multi-threaded model and avoids the performance overhead of context switching among threads or processes. Implementing this approach is likely to require using non-blocking sockets (which we did not specifically discuss in class) and the select system call.

A multi-threaded approach will be generally be the most straightforward option, as coordination among processes is more complicated than coordination among threads. An event-driven approach has the potential to be the most efficient option but is also the most complex.

Translating Filenames

Remember that HTTP requests will specify relative filenames (such as index.html) which are translated by the server into absolute local filenames. For example, if your document root is in ~username/cs3325/proj1/mydocroot, then when a request is received for foo.txt, the file that you should read is actually ~username/cs3325/proj1/mydocroot/foo.txt.

The translated filename may exist and be readable, or it may exist but be unreadable (e.g., due to file permissions), or it may not exist at all. A missing file should result in HTTP error code 404, while an inaccessible file should result in HTTP error code 403. You can test trying to access an inaccessible file by changing file permissions using chmod. For example, chmod a-r foo.txt will leave foo.txt intact but render it unreadable by your server, while chmod a+r foo.txt will make it readable again.

Remember that the default filename (i.e., if just a directory is specified) is index.html. This convention is why, for instance, the two URLs http://www.bowdoin.edu and http://www.bowdoin.edu/index.html return the same page.

HTTP 1.0 and 1.1

When you fetch an HTML web page in a browser (i.e., a file of type text/html), the browser parses the file for embedded links (such as images) and then retrieves those files from the server as well. For example, if a web page contains four images, then a total of five files will be requested from the server. The primary difference between HTTP 1.0 and HTTP 1.1 is how these multiple files are requested.

Using HTTP 1.0, a separate connection is used for each requested file. While simple, this approach is not the most efficient. HTTP 1.1 attempts to address this inefficiency by keeping connections to clients open, allowing for "persistent" connections and pipelining of client requests. That is, after the results of a single request are returned (e.g., index.html), if using HTTP 1.1, your server should leave the connection open for some period of time, allowing the client to reuse that connection to make subsequent requests. One design decision here is determining how long to keep the connection open. This timeout needs to be configured in the server and ideally should be dynamic based on the number of other active connections the server is currently supporting. If the server is idle, it can afford to leave the connection open for a relatively long period of time, but if it is is busy servicing several clients at once, it may not wish to have an idle connection sitting around and consuming thread resources for very long. You should develop a simple heuristic to determine this timeout in your server (but feel free to start with a fixed value at first).

Socket timeouts can be set using setsockopt. Another option for implementing timeouts is the select call. As usual, consult the man pages for details.

Handling Closed Connections

Your server may run into situations in which a client closes one end of the connection and then your server tries to send more data. When this happens, your server process receives a SIGPIPE signal, and the default action upon receipt of this signal is to terminate the process. Since you probably don't want this, an easy solution if you're running into this problem is telling your server to ignore SIGPIPE, as follows:

// ignore SIGPIPE to avoid crashes when using a closed client connection
signal(SIGPIPE, SIG_IGN);

Sending and Receiving Network Data

When you send or receive data over a network socket, what you are really doing is reading or copying data to a lower-level network data buffer in the OS. Since these data buffers are limited in size, you may not be able to read or send all desired data at once. In other words, when receiving data, you have no guarantee of receiving the entire request at once, and when sending data, you have no guarantee of sending the entire response at once. As a result, you may need to call send or recv multiple times in the course of handling a single request.

However, an alternative to using the low-level send and recv functions is to use streams, which you have likely used before in the context of file I/O. Using streams allows you to employ higher-level reading and writing functions like fgets (to read an entire line of data) and fprintf (to write formatted data). To construct a stream from a socket descriptor, just use the fdopen function. You can then use the resulting stream with all of the higher-level I/O functions like fgets and fprintf to receive and send data, rather than the lower-level send and recv calls. Doing so is likely to simplify your string processing code.

Since your server will do a significant amount of string manipulation both when sending and receiving messages, you will want to refamiliarize yourself with C's string processing routines, such as strcat, strncpy, strstr, etc.

Synchronization Issues

Any program involving concurrency (e.g., multiple processes or threads) needs to worry about the issue of synchronization, which refers to ensuring a consistent view of shared data across multiple threads of execution. Remember the general principle that shared data (such as any global variable) should not be modified concurrently by more than one thread to avoid potential data corruption. For example, it is unsafe to have two threads simultaneously incrementing a shared counter. One specific example in this project where you might want such a counter is if you want to track the active number of client connections.

To safely handle a situation like this, you should use a lock (also known as a mutex), which allows ensuring that only a single thread has access to a piece of code. The pthread library includes the pthread_mutex_t type for this situation. For example, if lock is a pthread_mutex_t, then you could safely increment a counter shared across multiple threads as shown below:

pthread_mutex_lock(&lock); // current thread acquires the lock
global_counter++; // safe modification; only one thread can be holding the lock at once
pthread_mutex_unlock(&lock); // release the lock to another thread

Project Writeup

In addition to the code of your server, you must submit a README document (in plain text format) that contains your names and the following sections:

Design Decisions: Explain and justify any significant design decisions that you made in the course of the project. These decisions should not refer to highly specific, code-level details (e.g., splitting up the code into different functions), but rather to higher-level design decisions that likely have little to do with the nuts and bolts of the code itself. In the case of this specific project, you should definitely address (1) your concurrency design for multiple clients, and (2) how you designed connection timeouts for HTTP 1.1. If there were any other parts of the project that required similar kinds of high-level design decisions, include those too. Don't forget to justify why you made the decisions you made!
Testing: Explain in reasonable detail how you went about testing your server. The purpose of this section is to make you thoughtfully consider both how to test as well as whether you have sufficiently tested. For example, if your only testing consists of sending HTTP 1.0 requests over telnet, you should not have very high confidence that your server is fully functional! Most typically, the "hardest" test that you should aim to pass is a sequence of browser requests for pages containing embedded images (which will be requested over HTTP 1.1).
Known Bugs: List any bugs or limitations in functionality that you are aware of. Any information you give here will be helpful to me in fully testing your server and demonstrate that you tested thoroughly yourself. If I come across bugs in my own testing that were not described here, then that will point to a lack of proper testing on your part!

Your writeup should be committed to your repository as a plain text file named README and is due at the same time as your server code.

Logistics and Evaluation

As in Project 0, the link to form a group and initialize your group's project repository on GitHub will be posted to Slack. Once your repository is initialized, clone it to hopper and work there. As a general rule of thumb when working on a group project through GitHub, always pull at the start of a work session and always commit and push at the end of a work session to minimize the chance of a merge conflict. Make sure that your final work (including your writeup) is committed to the repository by the deadline.

To avoid accidentally interfering with the servers of other groups, each group will be assigned a specific (non-standard) port number to use while testing on hopper. Stick to using your assigned port only to avoid conflicting with other groups. However, make sure that you are still able to specify any arbitrary port number via the -p command-line argument. Port assignments will be coordinated over Slack.

Your project will be graded on (1) correctly implementing the server specification, (2) the design and style of your program, and (3) the quality and completeness of your writeup. For guidance on what constitutes good coding design and style, see the Coding Design & Style Guide, which lists many common things to look for. Please ask if you have any other questions about design or style issues. Also don't forget to submit your individual group reports prior to the deadline.

Resources

Here is a list of resources that may be helpful in completing your server:

Linux man pages (e.g., man socket) should be your first stop for details on library functions.
HTTP 1.0 and 1.1 (a good primer on HTTP and primary reference for this project):
http://www.jmarshall.com/easy/http/
HTTP Wikipedia
http://en.wikipedia.org/wiki/Hypertext_Transfer_Protocol
w3c HTTP page (all the gory details):
http://www.w3.org/Protocols/
Example of argument parsing with getopt:
https://www.tutorialspoint.com/getopt-function-in-c-to-parse-command-line-arguments
A nice summary of useful calls for socket programming, including examples (note: read and write are essentially equivalent to recv and send when applied to sockets):
https://ycpcs.github.io/cs365-spring2017/lectures/lecture15.html
pthreads (thread creation) tutorial:
https://www.cs.cmu.edu/afs/cs/academic/class/15492-f07/www/pthreads.html
fork (process creation) tutorial:
http://www.csl.mtu.edu/cs4411.ck/www/NOTES/process/fork/create.html
Non-blocking and event-driven socket I/O (somewhat old, but details the main ideas):
http://www.kegel.com/c10k.html#nb

Reminders on External Resources and AI

Remember that the standard CSCI Collaboration Policy applies to this project (the same as to all projects), including the guidelines relating to external resources. For example, it is fine to consult sources like Google or StackOverflow regarding the use of a specific library function. However, you should not conduct web searches or make queries to AI systems like "web server in C" or anything similar. As a rule of thumb, you should not ask AI to write any code for you.