In this project you will build on your previous projects to (1) check the scalability of your shortest path implementation to large datasets, and (2) add support for range searching queries.
The good part is that when we program we do not need to worry that the data will not fit into memory; the VMM will take care of this for us.
The bad part is that... disks are much slower than main memory and CPU (typically 3 orders of magnitude or more). When working with large inputs, moving the data between main memory and disks, rather than the internal computation time, is typically the bottleneck. Using the virtual memory system may cause a big performance penalty, and may actually completely change the estimated running time of the program. That is to say, the RAM complexity of an algorithm estimated as the number of instructions executed, may not be good measure of the running time. In this case we need to also analyze the I/O-complexity of the algorithm --- that is the number blocks moved between main memory and disk during the execution of the program.
Algorithms that optimize the I/O in addition to the CPU performance are called I/O-efficient algorithms. I/O-efficient algorithms have been an active area of research as the size of the data that is available and needd to be processed has increased dramatically over the last 10 years.
In the first part of your third project you will investigate the performance of your shortest path algorithm as the size of the input increases. To do this you will draw a graph with the running time (in seconds) on the y-axis, and the size (number of nodes + number of edges) of the dataset on the x-axis.
Pick at least 4 different datasets, of increasing sizes. The first one should be a single state, the second one 10 states, the third one 25 states, and the last one the entire US.
Augment your program so that it prints out the number of nodes and edges in the adjacency list. Add timers around the two main steps of the program: (1) creating the adjacency list of the input, and (2) running the shortest path algorithm between the selected points (Do not include selecting the source and destination). Print out the times.
What to hand in: a paper containing the graph and a brief paragraph explaining it (why it scales up, if it does, or not, if it does not).
To make this efficient you will implement an index structure. Since the roads are uniformly distributed (kindoff), I suggest you use a fixed-grid structure. However, if you want to use a (bucket) quadtree or kd-tree feel free! You will, of course, get public recognition if you choose to do so. Some things you need to think about:
To get this to work, remember, think before you start (programming)!! If you spend even just a few hours thinking, it can save you hours of frustration! Have the entire layout of your program clear in your mind, and program it incrementally. Use short functions, and suggestive variable names.
Start by implementing range searching as a straightforward linear scan. Then refine it by building the grid. And so on.
As you program, keep in mind to follow the programming style we discussed (no globals, no function longer than a screen, etc). I will not help you debug unless you follow these guidelines!