CS 340 - Spring 2005

CS 340 Spring 2005: Project 3

In this project you will build on your previous projects to (1) check the scalability of your shortest path implementation to large datasets, and (2) add support for range searching queries.

Part 1: Scalability to large datasets

Algorithms are normally designed and analyzed in the RAM model of computation, which assumes that all the input data fits in main memory and all memory accesses cost the same. When the data is large, it does not fit in memory and the virtual memory system places it on disk. When the program runs, data that needs to be accessed is paged into the main memory. When the memory is full, pages that have been accessed least recently are swapped out to the disk. Thus, the main memory functions as a cache for the disk. All these happen invisibly to the programmer, through a part of the operating system called virtual memory manager (VMM).

The good part is that when we program we do not need to worry that the data will not fit into memory; the VMM will take care of this for us.

The bad part is that... disks are much slower than main memory and CPU (typically 3 orders of magnitude or more). When working with large inputs, moving the data between main memory and disks, rather than the internal computation time, is typically the bottleneck. Using the virtual memory system may cause a big performance penalty, and may actually completely change the estimated running time of the program. That is to say, the RAM complexity of an algorithm estimated as the number of instructions executed, may not be good measure of the running time. In this case we need to also analyze the I/O-complexity of the algorithm --- that is the number blocks moved between main memory and disk during the execution of the program.

Algorithms that optimize the I/O in addition to the CPU performance are called I/O-efficient algorithms. I/O-efficient algorithms have been an active area of research as the size of the data that is available and needd to be processed has increased dramatically over the last 10 years.

In the first part of your third project you will investigate the performance of your shortest path algorithm as the size of the input increases. To do this you will draw a graph with the running time (in seconds) on the y-axis, and the size (number of nodes + number of edges) of the dataset on the x-axis.

Pick at least 4 different datasets, of increasing sizes. The first one should be a single state, the second one 10 states, the third one 25 states, and the last one the entire US.
Augment your program so that it prints out the number of nodes and edges in the adjacency list. Add timers around the two main steps of the program: (1) creating the adjacency list of the input, and (2) running the shortest path algorithm between the selected points (Do not include selecting the source and destination). Print out the times.

What to hand in: a paper containing the graph and a brief paragraph explaining it (why it scales up, if it does, or not, if it does not).

Part 2: Range searching

Add support for efficient range searching queries. That is, allow the user to click on the lower-left point and upper-right point of a query rectangle; display the rectangle, and find and display all segments which overlap it.

To make this efficient you will implement an index structure. Since the roads are uniformly distributed (kindoff), I suggest you use a fixed-grid structure. However, if you want to use a (bucket) quadtree or kd-tree feel free! You will, of course, get public recognition if you choose to do so. Some things you need to think about:

How to choose the grid size, so that each grid cell has no more than one page (8KB) of data?
How to store the data in each grid cell?
How to deal with segments (rather than points)? For instance, how to handle segments that overlap cells?

Add a timer around the range searching part (include only the computation time, not the rendering part) and print out the time after each query.

To get this to work, remember, think before you start (programming)!! If you spend even just a few hours thinking, it can save you hours of frustration! Have the entire layout of your program clear in your mind, and program it incrementally. Use short functions, and suggestive variable names.

Start by implementing range searching as a straightforward linear scan. Then refine it by building the grid. And so on.

As you program, keep in mind to follow the programming style we discussed (no globals, no function longer than a screen, etc). I will not help you debug unless you follow these guidelines!