Project4: The total viewshed: An experimental evaluation using Bowdoin's HPC grid

Overview

The goal of this project is to compute the total viewshed on a grid terrain, parallelize it using OpenMP, and assess the effect of this parallelization using the multicore servers on Bowdoin's HPC grid; You will also explore how using row-major grid layout versus blocked grid layout plays in the overall performance. You will describe your results in a paper.

Outline

The interface: Your code will take as input the name of an elevation grid, in ascii form, the name of the output viewshed grid (Note: do we also want the number of cores/threads to use?? no, this comes from the command that submits the job). For example,
```
[ltoma@dover:\~] ./totalview  test2.asc test2totvis.asc
```
This will read elevation grid test2.asc, compute the total viewshed and save it as grid test2totvis.asc. Note: The grids are assumed to be in the ascii format.
The output grid, test2totvis.asc, represents the total viewshed grid of test2.asc.

The total viewshed: As discussed in class, the value of the total viewshed grid at a point (i,j) is the size of viewshed(i,j), where the size of a viewshed is the number of grid points that are visible.

Use the function to compute the viewshed from the previous project, and modify it so that instead of returning a viewshed grid, it simply returns a count of how many grid points are "1". For example, it could look like this:

/*
  Compute the viewshed of (vprow,vpcol) and store it in the viewshed grid vg,
  which is assumed to be allocated prior to this call. 
  After computing the viewshed, count its size and return it
*/
int compute_viewshed_size (Grid eg, Grid vg, int vprow, int vpcol)

Note that the viewshed grid is allocated outside, to save time (see below). Then you'll have a parallel loop that calls this function for all (i,j). It might look something like this:

/*
  Compute the total viewshed of elevation grid eg,  and store it in the grid tvg,
  which is assumed to be allocated prior to this call.
*/
void compute_total_viewshed (Grid eg, Grid tvg) {
     
     Grid tmpgrid; //we'll use this to store each individual viewshed

     //allocate tmpgrid 

     for (i=0; i< eg.nrows; i++)  {
         for (j=0; j< eg.ncols; j++) {

         //reset tmpgrid 

         set (tvg, i, j) = compute_viewshed_size(eg, tmpgrid, i, j); 
 
         }//for j  
     }//for i
}

Testing correctness: Before you try some larger grids, make sure it computes something reasonable on small grids. Create your own test grids, and post on piazza. For example on a grid where all points have same elevation, every point should see everything. If it fails this simple test, you definitely have problems. Generally speaking, as you write more complex code, you have no chance to debug it on large instances (if you tried debugging the viewshed on set1, you know what I mean!) If something goes wrong (and it always does), your only chance is to try to reproduce the bug on a smaller input. Coming up with small test instances is an art that you learn through debugging experiences.
Add support to time your code and print out the time. When running your code you should see something like this:
```
./totviewshed set1.asc set1totview.asc 1 
computing total viewshed on set1.asc..
TOTAL xxxx seconds
```
When you are confident enough, try running on set1.asc. This may take a couple of hours, so you don;t want to run it on your laptop. Use Bowdoin's HPC grid. More on that below. Record the time it takes to run, and render the output. If it looks scrambled, something is wrong. Go back to debugging.
If you got here, than hooray! Your total viewshed of set1.asc is working. Post a screenshot of the total viewshed for set1 on piazza (and give yourself a pat on the back).
Parallelism: Parallelize the loop that computes the total viewshed using OpenMP. If you've been in class and read the suggested links, throwing a parallel for loop is pretty straighforward. The hard part here is to assess what variables need to be shared or private, and to assess whether you introduce any race conditions. You want to share data, because, we all know it, sharing is good; in this case it makes your code fast. Making everything private can lead to no parallelism and inneficiencies (every tread needs a copy of the data). So you want as much sharing as possible, but do it smartly so that it does not lead to race conditions.
Testing for race conditions: Try running with different number of cores. Render the output, and compare it to the output with one core. If it's different, it's bad. You have a race condition. Go back to debugging.
Running experiments: Once you are confident that your code is correct, you are ready to evaluate it. Use the grid to run your code on set1 with P=1, 2, 4, 8, 12, 16, 20, 24, 32, 40, and write down the time it takes in each case. For each test, make sure you render the output to see it looks right (ie not scrambled). The largest servers in the grid have 40 cores, which is why it's unlikely that going beyond P=40 will give any further speedup ---- but it's definitely worth checking that this is the case. So do try with P=45 and P=50 to see what happens.
Grid layout: We also want to investigate the effect of the grid layout. We know that laying out the grid using a z-order space filling curves improves the locality of the data accesses to the grid, which means it improves the cache-efficiency of the algorithm. (Since Z-order curves can be computed for grid sizes that are power of two) In a previous lab you implemented a blocked layout of a grid. Plug it into this project.
Here is how I would do it: the function that computes the viewsshed and the total viewshed shoudl use the grid getter and setter to access the grid. When they want to read/write an element of a grid at row i and column j, they should use
```
/*
  return the element in g at row i and column j
*/
float get(Grid g, int i, int j)

/*
  set  the element in g at row i and column j to value x
*/
void set(Grid g, int i, int j, float x)
```
Anyone who uses a grid should make no assumption on where element at row i and column j is stored; that is the internal business of the grid, and it's encapsulated in the grid.
At the top of your grid code define a flag that says whether to use row major or blocked. The user can chose between row-major or blocked order by chosing one or the other, and recompiling.
```
#define ROWMAJOR
//#define BLOCKED
```
Then I would define:
```
/*
  return the element in g at row i and column j
*/
float get(Grid g, int i, int j) {

#ifdef ROWMAJOR
   return  get_rowmajor(g, i, j);
#else
   return get_blocked(g, i, j); 
#endif
}
```

Bowdoin HPC grid

The Bowdoin Computing Grid (also known as "The Grid") is a group of Linux servers which appears as one big, multiprocessor server that can run parallel jobs concurrently. As of Fall 2017, the Grid has a number of serves totaling 1088 CPU cores (The most recent servers added have 32 CPU cores and 256 GB of RAM). Check out Bowdoin's HPC website.

The servers in the grid are not interactive machines --- you cannot interact with them the same as you interact with dover and foxcroft, or with your laptop. The Grid is setup to run batch jobs only (not interactive and/or GUI applications).

The servers on the Grid run the Sun Grid Engine (SGE), which is a software environment that coordinates the resources on the grid. The grid has a headnode which accepts jobs, puts them in a waiting queue until they can be run, sends them to the computational node(s) on the grid to run them, manages them while they run, and notifies the owner when the job is finished. This headnode is a machine called moosehead. To interact with The Grid you need to login to the Grid headnode "moosehead.bowdoin.edu" via an SSH client program.

ssh moosehead.bowdoin.edu

Moosehead is an old server which was configured to run the Sun Grid Engine and do whatever a headnode is supposed to do: moosehead accepts jobs, puts them in a queue until they can be executed, sends them to an execution machine, manages them during execution, and logs the record of their execution when they are finished.

Moosehead runs linux so in principle you can run on it anything that you could run on dover. However DJ (the sysadmin, and Director of Bowdoin's HPC Grid) asks that you don't. Moosehead is an old machine. Use it only to submit jobs to the grid and to interact with the grid. Do the compiling, developing and testing somewhere else (e.g. on dover).

The Grid uses the same shared filespace as all of the Bowdoin Linux machines, so you can access the same home directory and data space as with dover or foxcroft (if you need to transfer files from a machine that is not a part of the Bowdoin network, use scp from your machine to dover or foxcroft first).

Running jobs on the grid

Below I am including a summary of the commands we'll use to start. For more detailed information on how to interact with the grid check out the website maintained by DJ, which should be be your go-to page.

To submit to the grid you have two options:

Use hpcsub.
Create a script and use qsub.

Submit using hpcsub.

The command hpcsub will allow you to submit any single commands to the grid. For example:

ssh moosehead
cd [directory-where-your-code-is-compiled]
hpcsub -pe smp 8 -cmd [your-code] [arguments to pass to the program]

The arguments -pe smp 8 are optional (but, if you are running OpenMP code, you should use them). They specify that your code is to be run in the SMP environment, with 8 cores (here 8 is only an example, it can be any number you want).

For example, if I want to run hellosmp that we talked about in class (which you can find here) using 8 CPU cores in the SMP environment, I would do:

ssh moosehead
[ltoma@moosehead:~]$ pwd
/home/ltoma
[ltoma@moosehead:~]$ cd public_html/teaching/cs3225-GIS/fall17/Code/OpenMP/
[ltoma@moosehead:~/public_html/teaching/cs3225-GIS/fall17/Code/OpenMP]$ ls
example1.c  example2.cpp  example3.c  example4.c   hellosmp  hellosmp.c  hellosmp.h  hellosmp.o Makefile
[ltoma@moosehead:~/public_html/teaching/cs3225-GIS/fall17/Code/OpenMP]$ hpcsub -pe smp 8 -cmd hellosmp
Submitting job using:
qsub -pe smp 8 hpc.10866
Your job 236150 ("hpc.10866") has been submitted

The headnode puts this job in the queue and starts looking for 8 cores that are free. When 8 cores become available, it assigns these 8 cores to your job. While your job is running no other job can use the 8 cores that it got assigned---- they are exclusively yours while your job runs. To check the jobs currently in the queue, do:

qstat

To check on all jobs running on the cluster, type

qstat -u "*"

For a full listing of all jobs on the cluster, type

qstat -f -u "*"

To display list of all jobs belonging to user foo, type

qstat -u foo

After I submit a job I usually check the queue:

[ltoma@moosehead:~/public_html/teaching/cs3225-GIS/fall17/Code/OpenMP]$ qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------
 236150 0.00000 hpc.10866  ltoma        qw    10/12/2016 15:53:20                                    8        
[ltoma@moosehead:~/public_html/teaching/cs3225-GIS/fall17/Code/OpenMP]$ qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------
 236150 0.58278 hpc.10866  ltoma        r     10/12/2016 15:53:27 all.q@moose15                      8

Note how the job initially shows as "qw" (queued and waiting) and then changes to "r" (running).

When the job is done you will get an email. If you list the files, you will notice a new file called "hpc.[job-number].xxx". This file represents the standard output for your job ---- all the print commands are redirected to this file.

[ltoma@moosehead:~/public_html/teaching/cs3225-GIS/fall17/Code/OpenMP]$ 
[ltoma@moosehead:~/public_html/teaching/cs3225-GIS/fall17/Code/OpenMP]$ qstat
[ltoma@moosehead:~/public_html/teaching/cs3225-GIS/fall17/Code/OpenMP]$ ls
example1.c  example2.cpp  example3.c  example4.c   hellosmp  hellosmp.c  hellosmp.h  hellosmp.o   hpc.10866.o236150  Makefile

[ltoma@moosehead:~/public_html/teaching/cs3225-GIS/fall17/Code/OpenMP]$ cat hpc.10866.o236150
I am thread 1. Hello world!
I am thread 2. Hello world!
I am thread 7. Hello world!
I am thread 0. Hello world!
I am thread 5. Hello world!
I am thread 6. Hello world!
I am thread 4. Hello world!
I am thread 3. Hello world!

Submit using `qsub`.

A more general way to submit jobs is via a script. You will need to create a script to run your programs on the grid. A sample script myscript.sh might look like this:

#!/bin/bash
#$ -cwd
#$ -j y
#$ -S /bin/bash
#$ -M (my_login_name)@bowdoin.edu -m b -m e

./hellosmp

To submit your job to the grid you will do:

ssh moosehead 
cd [folder-containing-myscript.sh]
qsub myscript.sh

Example:

[ltoma@moosehead:~/public_html/teaching/cs3225-GIS/fall17/Code/OpenMP]$  cat myscript.sh 
#!/bin/bash
#$ -cwd
#$ -j y
#$ -S /bin/bash
#$ -M ltoma@bowdoin.edu -m b -m e

#./hellosmp 
./example1
[ltoma@moosehead:~/public_html/teaching/cs3225-GIS/fall17/Code/OpenMP]$ qsub myscript.sh
Your job 236154 ("myscript.sh") has been submitted
[ltoma@moosehead:~/public_html/teaching/cs3225-GIS/fall17/Code/OpenMP]$ qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------
 236154 0.00000 myscript.s ltoma        qw    10/12/2016 16:00:17                                    1        
[ltoma@moosehead:~/public_html/teaching/cs3225-GIS/fall17/Code/OpenMP]$ qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------
 236154 0.50500 myscript.s ltoma        r     10/12/2016 16:00:27 all.q@moose22                      1        
[ltoma@moosehead:~/public_html/teaching/cs3225-GIS/fall17/Code/OpenMP]$ qstat
[ltoma@moosehead:~/public_html/teaching/cs3225-GIS/fall17/Code/OpenMP]$

Note how your job went from "qw" to not in the queue (basically it ran and finished so fast that we could not see it).

Each job creates a file by appending the job number to the script. In our case this is a file called "myscript.sh.o[job-number]". These .o* file will be the equivalent to what you would see on the console if running the program interactively.

[ltoma@moosehead:~/public_html/teaching/cs3225-GIS/fall17/Code/OpenMP]$ ls 
example1             example3.c           hellosmp             hellosmp.o             myscript.sh          
example1.c           example4.c           hellosmp.c            example2.cpp         hello 
hellosmp.h           hpc.10866.o236150    Makefile   myscript.sh.o236154  
[ltoma@moosehead:~/public_html/teaching/cs3225-GIS/fall17/Code/OpenMP]$ cat myscript.sh.o236154 
Hello World from thread 0
There are 1 threads
[ltoma@moosehead:~/public_html/teaching/cs3225-GIS/fall17/Code/OpenMP]$

Looking at the output we see that example1 was run with just one thread. That's because when we submitted we did not specify that we wanted SMP and how many threads we wanted, so we got whatever the default is (which is no threads). When running OpenMP code you need to submit using arguments -pe smp [numberthreads]. For example:

[ltoma@moosehead:~/public_html/teaching/cs3225-GIS/fall17/Code/OpenMP]$ qsub -pe smp 8 myscript.sh
Your job 236155 ("myscript.sh") has been submitted
[ltoma@moosehead:~/public_html/teaching/cs3225-GIS/fall17/Code/OpenMP]$ qstat 
[ltoma@moosehead:~/public_html/teaching/cs3225-GIS/fall17/Code/OpenMP]$ cat myscript.sh.o23615
Hello World from thread 5
Hello World from thread 1
Hello World from thread 0
There are 8 threads
Hello World from thread 6
Hello World from thread 7
Hello World from thread 2
Hello World from thread 4
Hello World from thread 3

Ah, that's better.

Using a machine exclusively

If you run a job in the SMP environment requesting, say, 8 cores, the headnode will look for one machine that has 8 core available. The other cores on the same machine may be already in use, or if not, may be given to other jobs in the future. Since all cores on a machine share the memory and also some caches, there will be some competition among the threads. The timing of your job will be impacted by what else is running on the same machine.

If you are running an experimental analysi sand youc are about the timings, you want to request that the whole machine is yours, even if your job is only going to use x processors. You can do that by including flag excl=true:

[ltoma@moosehead:~/public_html/teaching/cs3225-GIS/fall17/Code/OpenMP]$ qsub -l excl=true -pe smp 8 myscript.sh
Your job 236157 ("myscript.sh") has been submitted

[ltoma@moosehead:~/public_html/teaching/cs3225-GIS/fall17/Code/OpenMP]$ qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------
 236157 0.00000 myscript.s ltoma        qw    10/12/2016 16:05:15                                    8        
[ltoma@moosehead:~/public_html/teaching/cs3225-GIS/fall17/Code/OpenMP]$ qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------
 236157 0.60500 myscript.s ltoma        r     10/12/2016 16:05:27 all.q@moose22                      8        
[ltoma@moosehead:~/public_html/teaching/cs3225-GIS/fall17/Code/OpenMP]$ qstat
[ltoma@moosehead:~/public_html/teaching/cs3225-GIS/fall17/Code/OpenMP]$

The paper

You need to write a paper that describes the project. The paper should be structured like a research paper:

Introduction/background
(Briefly) describe the total viewshed problem, what the project is doing, and why (to support the why, bring in the running time of the total viewshed on one processor).
Our approach
Here you'll want to say that you use OpenMP which conveniently provides a parallel for loop; the important part is to describe the details on your parallel for loop.
Results: parallelization and speedup
Describe the experiments you ran to assess the effect of parallelization: include the table with the running times and the plot of the speedup.
Datasets: Use set1.asc. It would be great if you also ran experiments for kaweah.asc, but since the running times are larger, it's optional.
For the experiments, include some brief detail on the command you used to submit the jobs that can help us interpret and compare the running times with those of your peers, such as if you used the -excl flag. Also include info on what server ran your job.
The table: the running time of your code on the grid, with number of cores P = 1, 2, 4, 8, 12, 16, 20, 24, 32, 40, and the speedup obtained in each case (speedup is defined as T1/Tk, where T1 is the time to run with P=1 cores and Tk is the time to run with P=k cores.
The plot: plot of the speedup function of the number of cores, for set1.asc
Also include a screenschot of the total viewshed computed by your code on set1.asc (use render2d to render it)
Discuss your findings.
Results: grid layout
Describe the experiments you have done to assess the effect of the blocked layout, and discuss yoru findings. Describe how you chose the value of the block.
Conclusion Describe overall conclusions. This is usually an overview, at high-level, of the discussion in section 3 and 4. Feel free to include any personal thoughts.

What to turn in

Push your code and paper in GitHub; bring a hard copy of your paper to class.

Grading

Total 25 points

13 points: code (style and functionality)
- 5 points: total viewshed working
- 3 points: parallel for loop working
- 2 points: ability to choose blocked vs row-major layout for the elevation grid
- 3 points: code style (structure, comments, header, makefile, readability) [3 excellent; 2 decent; 1 beginner]
12 points: paper
- experiments to assess speedup: 5 points
- experiments to assess blocked layout: 2 points
- structure of the paper, etc: 5 points