Parallel code with OpenMP marks, through a special directive, sections to be executed in parallel. The part of the code that’s marked to run in parallel will cause threads to form. The main tread is the master thread. The slave threads all run in parallel and run the same code. Each thread executes the parallelized section of the code independently. When a thread finishes, it joins the master. When all threads finished, the master continues with code following the parallel section.
Each thread has an ID attached to it that can be obtained using a runtime library function (called omp_get_thread_num()). The ID of the master thread is 0.
OpenMP supports C, C++ and Fortran.
The OpenMP functions are included in a header file called The OpenMP parts in the code are specified using #pragmas
OpenMP has directives that allow the programmer to:
It’s also pretty easy to get OpenMP to work on a Mac. A quick
search with google reveals that the native apple compiler clang is
installed without openmp support. When you installed gcc it probably
got installed without openmp support. To test, go to the terminal and
try to compile something:
Example (C program): Display "Hello, world." using multiple
Note that the threads are all writing to the standard output, and
there is a race to share it. The way the threads are interleaved is
completely arbitrary, and you can get garbled output:
Barrier example:
The directive is called a work-sharing construct:
The for loop cannot exit early, for example:
OpenMP hides the low-level details and allows the programmer to
describe the parallel code with high-level constructs, which is as
simple as it can get.
Compiling and running OpenMP code
The public linux machines dover and foxcroft have gcc/g++ installed
with OpenMP support. All you need to do is use the -fopenmp flag on
the command line:
gcc -fopenmp hellosmp.c -o hellosmp
gcc -fopenmp hellosmp.c -o hellosmp
If you get an error message saying that “omp.h” is unknown, that mans your compiler does not have openmp support.
hellosmp.c:12:10: fatal error: 'omp.h' file not found
Here’s what I did:
1. I installed Homebrew, the missing package manager for MacOS,
/usr/bin/ruby -e "$(curl -fsSL"
2. Then I asked brew to install gcc:
brew install gcc
3. Then type ‘gcc’ and press tab; it will complete with all the versions of gcc installed:
gcc gcc-6 gcc-ar-6 gcc-nm-6 gcc-ranlib-6 gccmakedep
4. The obvious guess here is that gcc-6 is the latest version, so I use it to compile:
gcc-6 -fopenmp hellosmp.c
Specifying the parallel region (creating threads)
The basic directive is:
#pragma omp parallel
This is used to fork additional threads to carry out the work enclosed
in the block following the #pragma construct. The block is executed
by all threads in parallel. The original thread will be denoted as
master thread with thread-id 0.
#include < stdio.h >
int main(void)
#pragma omp parallel
printf("Hello, world.\n");
return 0;
Use flag -fopenmp to compile using gcc:
$ gcc -fopenmp hello.c -o hello
Output on a computer with two cores, and thus two threads:
Hello, world.
Hello, world.
On dover, I got 24 hellos, for 24 threads. On my desktop I get (only) 8. How many do you get?
Hello, wHello, woorld.
Private and shared variables
In a parallel section variables can be private (each thread owns a
copy of the variable) or shared among all threads. Shared variables
must be used with care because they cause race conditions.
The type of variables is specified following the #pragma omp
int main (int argc, char *argv[]) {
int th_id, nthreads;
#pragma omp parallel private(th_id)
// th_id is declared above. It is is specified as private; so each thread will have its own copy of th_id
th_id = omp_get_thread_num();
printf("Hello World from thread %d\n", th_id);
Sharing variables is sometimes what you want, other times its not, and
can lead to race conditions. Put differently, some variables need to
be shared, some need to be private, and you the programmer have to
specify what you want.
OpenMP lets you specify how to synchronize the threads. Here’s what’s
More on barriers: If we wanted all threads to be at a specific point
in their execution before proceeding, we would use a barrier. A
barrier basically tells each thread, "wait here until all other
threads have reached this point...".
int main (int argc, char *argv[]) {
int th_id, nthreads;
#pragma omp parallel private(th_id)
th_id = omp_get_thread_num();
printf("Hello World from thread %d\n", th_id);
#pragma omp barrier <----------- master waits until all threads finish before printing
if ( th_id == 0 ) {
nthreads = omp_get_num_threads();
printf("There are %d threads\n",nthreads);
Note above the function omp_get_num_threads(). Can you guess what it’s doing?
Some other runtime functions are:
Parallelizing loops
Parallelizing loops with OpenMP is straightforward. One simply denotes
the loop to be parallelized and a few parameters, and OpenMP takes
care of the rest. Can't be easier!
#pragma omp for
//specify a for loop to be parallelized; no curly braces
The “#pragma omp for” distributes the loop among the threads. It must
be used inside a parallel block:
#pragma omp parallel
#pragma omp for
//for loop to parallelize
}//end of parallel block
//compute the sum of two arrays in parallel
#include < stdio.h >
#include < omp.h >
#define N 1000000
int main(void) {
float a[N], b[N], c[N];
int i;
/* Initialize arrays a and b */
for (i = 0; i < N; i++) {
a[i] = i * 2.0;
b[i] = i * 3.0;
/* Compute values of array c = a+b in parallel. */
#pragma omp parallel shared(a, b, c) private(i)
#pragma omp for
for (i = 0; i < N; i++) {
c[i] = a[i] + b[i];
printf ("%f\n", c[10]);
Another example (here): adding all elements in an array.
//example4.c: add all elements in an array in parallel
#include < stdio.h >
int main() {
const int N=100;
int a[N];
for (int i=0; i < N; i++)
a[i] = i;
//compute sum
int local_sum, sum;
#pragma omp parallel private(local_sum) shared(sum)
local_sum =0;
//the array is distributde statically between threads
#pragma omp for schedule(static,1)
for (int i=0; i< N; i++) {
local_sum += a[i];
//each thread calculated its local_sum. ALl threads have to add to
//the global sum. It is critical that this operation is atomic.
#pragma omp critical
sum += local_sum;
printf("sum=%d should be %d\n", sum, N*(N-1)/2);
There exists also a “parallel for” directive which combines a parallel
and a for (no need to nest a for inside a parallel):
int main(int argc, char **argv)
int a[100000];
#pragma omp parallel for
for (int i = 0; i < 100000; i++) {
a[i] = 2 * i;
printf(“assigning i=%d\n”);
return 0;
Exactly how the iterations are assigned to ecah thread, that is
specified by the schedule (see below).
Note:Since variable i is declared inside the parallel for, each thread
will have its own private version of i.
Loop scheduling
OpenMP lets you control how the threads are scheduled. The type of
schedule available are:
This is specified by appending schedule(type, chunk) after the pragma for directive:
#pragma omp for schedule(static, 5)
More complex directives
...which you probably won't need.
#include < stdio.h >
#include < omp.h >
int main(void) {
int count = 0;
#pragma omp parallel shared(count)
#pragma omp atomic
count++; // count is updated by only a single thread at a time
printf_s("Number of threads: %d\n", count);
Performance considerations
Critical sections and atomic sections serialize the execution and
eliminate the concurrent execution of threads. If used unwisely,
OpenMP code can be worse than serial code because of all the thread
Some comments
OpenMP is not magic. A loop must be obviously parallelizable in order
for OpenMP to unroll it and facilitate the assignment of iterations
among threads. If there are any data dependencies from one iteration
to the next, then OpenMP can't parallelize it.
// BAD - can;t parallelize with OpenMP
for (int i=0;i < 100; i++) {
if (i > 50)
break; <----- breaking when i greater than 50
Values of the loop control expressions must be the same for all iterations of the loop. For example:
// BAD - can;t parallelize with OpenMP
for (int i=0;i < 100; i++) {
if (i == 50)
i = 0;