OpenMP program structure: An OpenMP program has sections that are sequential and sections that are parallel. In general an OpenMP program starts with a sequential section in which it sets up the environment, initializes the variables, and so on.
When run, an OpenMP program will use one thread (in the sequential sections), and several threads (in the parallel sections).
There is one thread that runs from the beginning to the end, and it's called the master thread. The parallel sections of the program will cause additional threads to fork. These are called the slave threads.
A section of code that is to be executed in parallel is marked by a special directive (omp pragma). When the execution reaches a parallel section (marked by omp pragma), this directive will cause slave threads to form. Each thread executes the parallel section of the code independently. When a thread finishes, it joins the master. When all threads finish, the master continues with code following the parallel section.
Each thread has an ID attached to it that can be obtained using a runtime library function (called omp_get_thread_num()). The ID of the master thread is 0.
Why OpenMP? More efficient, and lower-level parallel code is possible, however OpenMP hides the low-level details and allows the programmer to describe the parallel code with high-level constructs, which is as simple as it can get.
OpenMP has directives that allow the programmer to:
gcc -fopenmp hellosmp.c -o hellosmp
It’s also pretty easy to get OpenMP to work on a Mac. A quick search with google reveals that the native apple compiler clang is installed without openmp support. When you installed gcc it probably got installed without openmp support. To test, go to the terminal and try to compile something:
gcc -fopenmp hellosmp.c -o hellosmpIf you get an error message saying that “omp.h” is unknown, that mans your compiler does not have openmp support.
hellosmp.c:12:10: fatal error: 'omp.h' file not found #includeHere’s what I did:^ 1 error generated. make: *** [hellosmp.o] Error 1
1. I installed Homebrew, the missing package manager for MacOS, http://brew.sh/index.html
/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"2. Then I asked brew to install gcc:
brew install gcc3. Then type ‘gcc’ and press tab; it will complete with all the versions of gcc installed:
$gcc gcc gcc-6 gcc-ar-6 gcc-nm-6 gcc-ranlib-6 gccmakedep4. The obvious guess here is that gcc-6 is the latest version, so I use it to compile:
gcc-6 -fopenmp hellosmp.cWorks!
#pragma omp parallel { }When the master thread reaches this line, it forks additional threads to carry out the work enclosed in the block following the #pragma construct. The block is executed by all threads in parallel. The original thread will be denoted as master thread with thread-id 0.
Example (C program): Display "Hello, world." using multiple threads.
#include < stdio.h > int main(void) { #pragma omp parallel { printf("Hello, world.\n"); } return 0; }Use flag -fopenmp to compile using gcc:
$ gcc -fopenmp hello.c -o helloOutput on a computer with two cores, and thus two threads:
Hello, world. Hello, world.On dover, I got 24 hellos, for 24 threads. On my desktop I get (only) 8. How many do you get?
Note that the threads are all writing to the standard output, and there is a race to share it. The way the threads are interleaved is completely arbitrary, and you can get garbled output:
Hello, wHello, woorld. rld.
The type of the variable, private or shared, is specified following the #pragma omp:
Example:
int main (int argc, char *argv[]) { int th_id, nthreads; #pragma omp parallel private(th_id) //th_id is declared above. It is is specified as private; so each //thread will have its own copy of th_id { th_id = omp_get_thread_num(); printf("Hello World from thread %d\n", th_id); }Private or shared? Sometimes your algorithm will require sharing variables, other times it will require private variables. The caveat with sharing is the race conditions. The task of thinking through the details of a parallel algorithm and specifying the type of the variables is on, of course, the programmer.
Barrier example:
int main (int argc, char *argv[]) { int th_id, nthreads; #pragma omp parallel private(th_id) { th_id = omp_get_thread_num(); printf("Hello World from thread %d\n", th_id); #pragma omp barrier <----------- master waits until all threads finish before printing if ( th_id == 0 ) { nthreads = omp_get_num_threads(); printf("There are %d threads\n",nthreads); } } }//mainNote above the function omp_get_num_threads(). Can you guess what it’s doing? Some other runtime functions are:
The directive is called a work-sharing construct, and must be placed inside a parallel section:
#pragma omp for //specify a for loop to be parallelized; no curly bracesThe “#pragma omp for” distributes the loop among the threads. It must be used inside a parallel block:
#pragma omp parallel { … #pragma omp for //for loop to parallelize … }//end of parallel blockExample:
//compute the sum of two arrays in parallel #include < stdio.h > #include < omp.h > #define N 1000000 int main(void) { float a[N], b[N], c[N]; int i; /* Initialize arrays a and b */ for (i = 0; i < N; i++) { a[i] = i * 2.0; b[i] = i * 3.0; } /* Compute values of array c = a+b in parallel. */ #pragma omp parallel shared(a, b, c) private(i) { #pragma omp for for (i = 0; i < N; i++) { c[i] = a[i] + b[i]; printf ("%f\n", c[10]); } } }Another example (here): adding all elements in an array.
//example4.c: add all elements in an array in parallel #include < stdio.h > int main() { const int N=100; int a[N]; //initialize for (int i=0; i < N; i++) a[i] = i; //compute sum int local_sum, sum; #pragma omp parallel private(local_sum) shared(sum) { local_sum =0; //the array is distributde statically between threads #pragma omp for schedule(static,1) for (int i=0; i< N; i++) { local_sum += a[i]; } //each thread calculated its local_sum. ALl threads have to add to //the global sum. It is critical that this operation is atomic. #pragma omp critical sum += local_sum; } printf("sum=%d should be %d\n", sum, N*(N-1)/2); }There exists also a “parallel for” directive which combines a parallel and a for (no need to nest a for inside a parallel):
int main(int argc, char **argv) { int a[100000]; #pragma omp parallel for for (int i = 0; i < 100000; i++) { a[i] = 2 * i; printf(“assigning i=%d\n”); } return 0; }Exactly how the iterations are assigned to ecah thread, that is specified by the schedule (see below). Note:Since variable i is declared inside the parallel for, each thread will have its own private version of i.
#pragma omp for schedule(static, 5)
#include < stdio.h > #include < omp.h > int main(void) { int count = 0; #pragma omp parallel shared(count) { #pragma omp atomic count++; // count is updated by only a single thread at a time } printf_s("Number of threads: %d\n", count); }
The for loop cannot exit early, for example:
// BAD - can;t parallelize with OpenMP for (int i=0;i < 100; i++) { if (i > 50) break; <----- breaking when i greater than 50 }Values of the loop control expressions must be the same for all iterations of the loop. For example:
// BAD - can;t parallelize with OpenMP for (int i=0;i < 100; i++) { if (i == 50) i = 0; }