Project 1 - Web Server

Assigned:	Monday, January 31.
Groups Due:	Thursday, February 3, 11:59 pm.
Code Due Date:	Wednesday, February 16, 11:59 pm.
Writeup Due Date:	48 hours after code due.

The goal of this project is to build a functional web server using low-level networking primitives. This assignment will teach you the basics of network programming, client/server architectures, and concurrency issues in high performance servers. In addition to writing your server, you will also write a document explaining its behavior and your major design choices.

This project should be done in teams of two or three. However, remember that the objective of working in a team is to work as a team. In other words, you should not try to approach the project by splitting up the work; instead, all team members are expected to work on all parts of the project.

Server Specification

Your task is to write a simple web server capable of servicing remote clients by sending them requested files from the local machine. Communication between a client and the server is defined by HTTP (the Hypertext Transfer Protocol); your server will both need to understand HTTP requests sent by clients as well as respond as defined by HTTP.

Your server must support the core functionality of both the HTTP 1.0 and the HTTP 1.1 standards, with several notable limitations:

You only need to support the HTTP GET method, not others like HEAD, POST, etc.
You only need to support status codes 200, 400, 403, and 404 in your server responses.
You only need to support the Content-Type, Content-Length, and Date headers in your server responses.
You only need to support file types HTML, TXT, JPG, PNG, and GIF.
Your server does not need to support chunked requests, absolute URLs, "100 Continue" responses, or the "If-Modified-Since" and "If-Unmodified-Since" headers.

The only request headers you need to be concerned with to implement the required functionality are "Host" and "Connection", and the only response headers you need to be concerned with are "Date", "Content-Length", and "Content-Type". However, feel free to extend your server to provide any functionality not required by the base specification.

Your server program must be written in C on Linux and must accept the following two command-line arguments:

-p [num] to set the port on which the server listens.
-r [path] to set the directory out of which all files are served, called the document root.

For example, you could start the server on port 8888 using the document root serverfiles like the following:

./server -p 8888 -r serverfiles

Command-line options may appear in arbitrary order; therefore, you should use getopt for parsing arguments. Also note that unless your document root starts with a /, it is a relative path, and therefore is interpreted relative to the current working directory. If either command-line option is omitted, the program should exit with an error message.

As with most popular web servers, requests for a directory (e.g., GET / or GET /catpictures/) should default to fetching index.html (i.e., index.html is the default filename if no explicit filename is provided).

A good tutorial on the essentials of the HTTP protocol is linked from the resources at the bottom of this writeup.

Starter Files

Your Git repository includes the following provided starter files:

server.c: the source file in which you should write your server.
Makefile: a Makefile that will let you compile the server by typing make.
testroot: a directory containing a simple web site that you can use as a document root during testing.

The only file you must modify is server.c. You are welcome to modify the test document root or create other document roots to use during testing. Note that the provided Makefile is configured to compile with all errors turned on and warnings counted as errors; this is done intentionally to ensure that you fix compiler warnings rather than ignore them. Fixing warnings will often teach you something about programming even if the warning in question doesn't represent a bug in the program.

Testing the Server

There are several ways you can test your server. The first is to simply access your server in a browser - if your server is running on port 8888, then you can type turing.bowdoin.edu:8888/index.html into your web browser to access index.html on the server. However, using your browser is not recommended during early development and testing, as browsers will often simply hang if your server isn't responding correctly. A more effective initial testing approach is to use telnet, which is a tool for sending arbitrarily-formatted text messages to any network server. For example, below is an example of connecting to google.com on port 80 and then sending an HTTP request for the file index.html:

$ telnet google.com 80
GET /index.html HTTP/1.0

Note that in the above command, there must be two carriage returns (i.e., blank lines) after the "GET" line in order to complete the command. The response to this request will be the HTTP-formatted response from the server. Using telnet will be more initially reliable than a browser, as you will be able to verify that you are getting any response back at all without also having to worry about whether the response is compliant with HTTP.

As an intermediate step between telnet and a full-blown browser, you can also use the wget or curl utilities. These utilities provide command-line HTTP clients: wget will send HTTP/1.0 requests, while curl will send HTTP/1.1 requests (though can be configured to send HTTP/1.0 requests as well). Consult the man pages for details on proper usage.

A recommended testing strategy is to use telnet initially, then move to wget and/or curl, then finally move to a full-blown browser once things seem to be working. The provided sample document root will be useful in testing that HTTP 1.1 is working properly, as the pages with embedded images will be requested through a single connection when accessed via a browser.

warning IMPORTANT: Do not leave your server running when you are not actively testing! Whenever you are done testing, make sure to terminate your server (Control-C), especially before logging off the server. Leaving a server running for long periods will occupy port numbers and is a potential security risk.

Implementation Advice

This section contains tips on implementing various parts of the server.

Primary Loop

At a high level, your web server will be structured something like the following:

Forever loop:
   Accept new connection from incoming client
   Parse HTTP request
   Ensure well-formed request (return error otherwise)
   Determine if target file exists and is accessible (return error otherwise)
   Transmit contents of file to client (by performing reads on the file and writes on the socket)
   Close the connection (if HTTP/1.0)

You have a choice in how you handle multiple clients within the above loop structure. In particular, recall that we discussed three basic approaches to supporting multiple concurrent client connections:

A multi-threaded approach will spawn a new thread for each incoming connection. That is, once the server accepts a connection, it will spawn a thread to parse the request, transmit the file, etc. If you decide to use a multi-threaded approach, you should use the pthreads thread library (i.e., pthread_create).
A multi-process approach maintains a worker pool of active processes to hand requests off to from the main server. This approach is largely appropriate because of its portability (relative to assuming the presence of a given threads package across multiple hardware/software platform). It does face increased context-switch overhead relative to a multi-threaded approach. Creating a new process for every request can also work but is not ideal, as it wastes a significant amount of resources. A better approach is to use pipe to allow your processes to communicate (and thereby avoid just creating a new process every time).
An event-driven architecture will keep a list of active connections and loop over them, performing a little bit of work on behalf of each connection. For example, there might be a loop that first checks to see if any new connections are pending to the server (performing appropriate bookkeeping if so), and then it will loop over all existing client connections and send a "block" of file data to each (e.g., 4096 bytes, or 8192 bytes, matching the granularity of disk block size). This event-driven architecture has the primary advantage of avoiding any synchronization issues associated with a multi-threaded model (though synchronization effects should be limited in your simple web server) and avoids the performance overhead of context switching among a number of threads. To implement this approach, you may need to use non-blocking sockets. The select system may also be quite useful.

A multi-threaded approach will be generally be the most straightforward option, as coordination among processes is more complicated than coordination among threads. An event-driven approach is the most efficient option but also the most complex, so you may wish to consider it if you're looking for an extra challenge.

Translating Filenames

Remember that HTTP requests will specify relative filenames (such as index.html) which are translated by the server into absolute local filenames. For example, if your document root is in ~username/cs3325/proj1/mydocroot, then when a request is received for foo.txt, the file that you should read is actually ~username/cs3325/proj1/mydocroot/foo.txt.

The translated filename may exist and be readable, or it may exist but be unreadable (e.g., due to file permissions), or it may not exist at all. A missing file should result in HTTP error code 404, while an inaccessible file should result in HTTP error code 403.

Remember that the default filename (i.e., if just a directory is specified) is index.html. This is why the two URLs http://www.bowdoin.edu and http://www.bowdoin.edu/index.html return the same page. Also note that some pages, such as Bowdoin's home page above, actually redirect to a different (i.e., the real) home page. This redirection normally happens automatically in a browser, so you don't even realize it's happening, but if testing with telnet, you may see a very short page simply instructing the browser to request a different file instead.

HTTP 1.0 and 1.1

When you fetch an HTML web page in a browser (i.e., a file of type text/html), the browser parses the file for embedded links (such as images) and then retrieves those files from the server as well. For example, if a web page contains 4 images, then a total of 5 files will be requested from the server. The primary difference between HTTP 1.0 and HTTP 1.1 is how these multiple files are requested.

Using HTTP 1.0, a separate connection is used for each requested file. While simple, this approach is not the most efficient. HTTP 1.1 attempts to address this limitation by keeping connections to clients open, allowing for "persistent" connections and pipelining of client requests. That is, after the results of a single request are returned (e.g., index.html), if using HTTP 1.1, your server should leave the connection open for some period of time, allowing the client to reuse that connection to make subsequent requests. One key issue here is determining how long to keep the connection open. This timeout needs to be configured in the server and ideally should be dynamic based on the number of other active connections the server is currently supporting. Thus if the server is idle, it can afford to leave the connection open for a relatively long period of time. If the server is busy servicing several clients at once, it may not be able to afford to have an idle connection sitting around (consuming kernel/thread resources) for very long. You should develop a simple heuristic to determine this timeout in your server (but feel free to start with a fixed value at first).

Socket timeouts can be set using setsockopt. Another option for implementing timeouts is the select call.

Working with Strings

Since a significant part of this assignment involves working with strings, you will want to refamiliarize yourself with C's string processing routines, such as strcat, strncpy, strstr, etc. Also remember that pointer arithmetic can often result in cleaner code (e.g., by maintaining pointers that you increment rather than numeric indices that you increment).

Sending and Receiving Network Data

Remember when sending or receiving data over a network socket that what you are really doing is reading or copying data to a lower-level network data buffer. Since these data buffers are limited in size, you may not be able to read or send all desired data at once. In other words, when receiving data, you have no guarantee of receiving the entire request at once, and when sending data, you have no guarantee of sending the entire response at once. As a result, you may need to call send or recv multiple times in the course of handling a single request.

A handy trick that may make reading and sending data over the socket easier is to use the fdopen function, which will effectively convert a socket descriptor into a "regular" file descriptor. In doing so, you can then work with the socket using regular file I/O functions (e.g., fgets to read a line of text, fprintf to send output, etc). These functions are higher level than the send and recv calls and may be easier to work with.

Synchronization Issues

Any program involving concurrency (e.g., multiple processes or threads) needs to worry about the issue of synchronization, which refers to ensuring a consistent view of shared data across multiple threads of execution. Remember the general principle that shared data (such as a global variable) should not be modified concurrently by more than one thread to avoid potential data corruption. For example, it is unsafe to have two threads simultaneously incrementing a shared counter. One specific example in this project where you might want such a counter is if you want to track the active number of client connections.

To safely handle a situation like this, you should use a lock (also known as a mutex), which allows ensuring that only a single thread has access to a piece of code. The pthread library includes the pthread_mutex_t type for this situation. For example, if lock is a pthread_mutex_t, then you could safely increment some counter across multiple threads as shown below:

pthread_mutex_lock(&lock); // current thread acquires the lock
global_counter++; // safe; only one thread can be holding the lock simultaneously
pthread_mutex_unlock(&lock); // release the lock to another thread

Project Writeup

In addition to your program itself, you will also write a short paper (2-4 pages) that describes your server. A typical format for a systems-style paper such as this is something like the following:

an introductory section that highlights the purpose of the project
a design section that describes your major design choices and key data structures that you used (if a figure makes your explanation more clear, use one!)
an implementation section that overviews the structure of your code (at a reasonably high level - this should supplement rather than duplicate your code)
an evaluation section that describes how you tested your server
a conclusion that summarizes your project and reflects on the assignment in general

While you do not need to rigidly adhere to this structure, it is a good basic framework to follow. The most common type of feedback that students often receive on writeups like this is excessive emphasis on fine-grained coding details (e.g., listing all function names in the program, discussing the details of specific variables, etc). These sorts of details are usually better conveyed by your code itself, and they are also less important and less interesting than the higher-level design choices that you made. A good example of such a design choice is your concurrency design for multiple clients; a decision like this has little to do with the actual code (even though the design is ultimately realized in code). You will also get a sense of how to best structure these writeups as you read papers in this course, which will often go into detail on the design of the system while saying little (if anything) about code or even the programming language used. Some code-related details are appropriate to include in an implementation section, but don't go overboard on this section (and spare yourself the work of doing so!).

Your writeup should also clearly state anything that does not work correctly and any major problems that you encountered.

Add your writeup to your Git repository as a PDF named writeup.pdf.

Logistics and Evaluation

To get started, go to Blackboard and browse to this course. Click the Start Projects menu item on the left, then click the "Begin Project 1 - Web Server" link, which will take you to GitHub and walk you through initializing your private project repository. When this is done, you will be looking at the GitHub page for your new repository. You can then clone this repository to turing and begin to work.

Your program will be graded on (1) correctly implementing the server specification, (2) the design and style of your program, and (3) the quality of your writeup. For guidance on what constitutes good design and style, see the Coding Design & Style Guide, which lists many common things to look for. Please ask if you have any other questions about design or style issues.

Resources

Here is a list of resources that may be helpful in completing your server. Linux man pages will also be useful.

HTTP 1.0 and 1.1 (a good primer on HTTP and primary reference for this project):
http://www.jmarshall.com/easy/http/
HTTP Wikipedia
http://en.wikipedia.org/wiki/Hypertext_Transfer_Protocol
w3c HTTP page:
http://www.w3.org/Protocols/
Parsing command-line arguments with getopt:
https://www.gnu.org/software/libc/manual/html_node/Example-of-Getopt.html
A nice summary of useful calls for socket programming, including examples (note: read and write are essentially equivalent to recv and send when applied to sockets):
https://ycpcs.github.io/cs365-spring2017/lectures/lecture15.html
pthreads (thread creation) tutorial:
https://www.cs.cmu.edu/afs/cs/academic/class/15492-f07/www/pthreads.html
fork (process creation) tutorial:
http://www.csl.mtu.edu/cs4411.ck/www/NOTES/process/fork/create.html
Nonblocking and event-driven socket I/O (somewhat old, but details the main ideas):
http://www.kegel.com/c10k.html#nb
Other resources: Remember that the standard CSCI Collaboration Policy applies to this project (the same as to all projects), including the guidelines relating to external resources. For example, it is fine to consult sources like Google or StackOverflow regarding the use of a specific library function. However, you should not conduct web searches for "web server in C" or anything similar (such things certainly exist, but remember that we are also able to find them).