----赖格英-----

记忆不好了,记录工作中的点点滴滴....

  博客园  :: 首页  :: 新随笔  :: 联系 :: 订阅 订阅  :: 管理

Beginning OpenMP

OpenMP provides a straight-forward interface to write software that can use multiple cores of a computer. Using OpenMP you can write code that uses all of the cores in a multicore computer, and that will run faster as more cores become available.

OpenMP is a well-established, standard method of writing parallel programs. It was first released in 1997, and is currently on version 3.0. It is provided by default in nearly all compilers, e.g. the gnu compiler suite (gcc, g++, gfortran), the Intel compilers (icc, icpc, ifort) and the Portland Group compilers (pgc, pgCC, pgf77) and works on nearly all operating systems (e.g. Linux, Windows and OS X).

You can read about the history of OpenMP at its Wikipedia page, or by going to one of the many OpenMP websites. The best book to learn OpenMP in my opinion is Using OpenMP: Portable Shared Memory Parallel Programming (I used it to learn OpenMP).

OpenMP can be combined with other parallel programming technologies, e.g. MPI. This course is presented as a companion to my MPI course, with both this and the MPI course following a similar structure and presenting similar examples. If you want to compare OpenMP and MPI, then please click on the Compare with MPI links on each page.

This is a short course that will provide a taster of OpenMP. The course is arranged with examples in C, C++ and Fortran. Feel free to look at just the language you are familiar with, or, if you are interested, feel free to compare OpenMP use in each of the languages. Please also feel free to work through this course at your own pace. OpenMP is best learned by using it, so please copy out and play with the examples provided, and also have a go at the exercises.

Hello OpenMP!

The first stage is to write a small OpenMP program. Choose from C, C++ or Fortran to write the program hello_openmp. Once you have written and compiled hello_openmp return to this page.

Now that you have the compiled the hello_openmp executable, the next step is to run it. The way to run the executable is the same regardless of the programming language used to produce it. The first step is to set the number of threads of execution that you want to run in parallel. This is achieved by setting the environmental variable OMP_NUM_THREADS to the desired number of threads, e.g. in the bash shell;

export OMP_NUM_THREADS=4

or in the tcsh shell;

setenv OMP_NUM_THREADS 4

Now that you have set the number of threads, run the executable by typing;

./hello_openmp

You should see the following output.

Hello OpenMP!
Hello OpenMP!
Hello OpenMP!
Hello OpenMP!

The line Hello OpenMP! is output four times, as the program split into four threads, and each thread printed Hello OpenMP!. Play with the number of threads (by changing OMP_NUM_THREADS) and see how that affects the output.


Next - Compiler Directives / Pragmas



C

Most C compilers support the use of OpenMP. Available compilers include gcc (version 4.2 or above), icc and pgc.

The first step is to create a simple OpenMP C program, which we will call hello_openmp. Open a text editor (e.g. pico), create a file called hello_openmp.c and copy in the following code;

#include <stdio.h>

int main(int argc, char **argv)
{
    #pragma omp parallel
    {
        printf("Hello OpenMP!\n");
    }

    return 0;
}

The only new line in this example is #pragma omp parallel, which is used to specify that all of the code between the curly brackets is part of an OpenMP parallel section.

You can compile this program using one of these commands (choose one for the compiler you wish to use);

  • gcc : gcc -fopenmp hello_openmp.c -o hello_openmp
  • icc : icc -openmp hello_openmp.c -o hello_openmp -cxxlib-icc
  • pgc : pgc -mp hello_openmp.c -o hello_openmp

This will produce the executable, hello_openmp.

Now return to the previous page to see how to run hello_openmp in parallel.

C++

Most C++ compilers support the use of OpenMP. Available compilers include g++ (version 4.2 or above), icpc and pgCC.

The first step is to create a simple OpenMP C++ program, which we will call hello_openmp. Open a text editor (e.g. pico), create a file called hello_openmp.cpp and copy in the following code;

#include <iostream>

int main(int argc, const char **argv)
{
    #pragma omp parallel
    {
        std::cout << "Hello OpenMP!\n";
    }

    return 0;
}

The only new line in this example is #pragma omp parallel, which is used to specify that all of the code between the curly brackets is part of an OpenMP parallel section.

You can compile this program using one of these commands (choose one for the compiler you wish to use);

  • g++ : g++ -fopenmp hello_openmp.cpp -o hello_openmp
  • icpc : icpc -openmp hello_openmp.cpp -o hello_openmp -cxxlib-icc
  • pgCC : pgCC -mp hello_openmp.cpp -o hello_openmp

This will produce the executable, hello_openmp.

Now return to the previous page to see how to run hello_openmp in parallel.

Fortran

Most Fortran compilers support the use of OpenMP. Available compilers include gfortran (version 4.2 or above), ifort, and pgf77/90.

Note that OpenMP is available for all flavours of Fortran (e.g. Fortran 77, Fortran 90, Fortran 2008), and is used in the same way in each of these flavours. In this course, all of the examples presented will be standard Fortran 77.

The first step is to create a simple OpenMP Fortran program, which we will call hello_openmp. Open a text editor (e.g. pico), create a file called hello_openmp.f and copy in the following code;

      program main
      implicit none

C$OMP PARALLEL
      print *,"Hello OpenMP!"
C$OMP END PARALLEL

      end

The only new lines in this example is C$OMP PARALLEL and C$OMP END PARALLEL, which is used to specify that all of the code between these two lines is part of an OpenMP parallel section.

You can compile this program using one of these commands (choose one for the compiler you wish to use);

  • gfortran : gfortran -fopenmp hello_openmp.f -o hello_openmp
  • ifort : ifort -openmp hello_openmp.f -o hello_openmp -cxxlib-icc
  • pgf77 : pgf77 -mp hello_openmp.f -o hello_openmp

This will produce the executable, hello_openmp.

Now return to the previous page to see how to run hello_openmp in parallel.

Directives (Pragmas)

So what was going on in the last example?

A standard program works by executing one line of code at a time, starting from the main function (or "program" function in Fortran) and working down line by line. This single thread of execution is known as the "main" thread. All programs have a single main thread, and in most programs, this is the only thread of execution, hence why the program can only do one thing at a time.

The hello_openmp also has a single main thread of execution. However, this main thread is split into a team of threads within the OpenMP parallel section. Each parallel thread in the team executes all of the code in the parallel section, hence each thread executes the line of code that prints Hello OpenMP!.

We can see this more clearly by getting each thread to identify itself. Please copy the code from one of these examples to create the executable hello_threads.

Try running this executable using different values of OMP_NUM_THREADS.

The OpenMP parallel section is specified by using compiler directives. These directives (also called compiler pragmas) are instructions to the compiler to tell it how to create the team of threads, and to help tell the compiler how to assign threads to tasks. These OpenMP directives are only followed if the program is compiled with OpenMP support. If the program is compiled without OpenMP support, then they are ignored.

There are several OpenMP directives. This course will cover the basic usage of just a selection;

  • parallel : Used to create a parallel block of code which will be executed by a team of threads
  • sections : Used to specify different sections of the code that can be run in parallel by different threads.
  • for (C/C++), do (Fortran) : Used to specify loops where different iterations of the loop are performed by different threads.
  • critical : Used to specify a block of code that can only be run by one thread at a time.
  • reduction : Used to combine (reduce) the results of several local calculations into a single result

Pragmas are added to the code using either;

#pragma omp name_of_directive

in C or C++, or using;

C$OMP name_of_directive

at the start of the line in fixed-format Fortran, or using;

!$OMP name_of_directive

anywhere in the line using free-format Fortran.


Next - Sections



C

Copy this into the file hello_threads.c

#include <stdio.h>

#ifdef _OPENMP
    #include <omp.h>
#else
    #define omp_get_num_threads() 0
    #define omp_get_thread_num() 0
#endif

int main(int argc, char **argv)
{
    int nthreads, thread_id;

    printf("I am the main thread.\n");

    #pragma omp parallel private(nthreads, thread_id)
    {
        nthreads = omp_get_num_threads();
        thread_id = omp_get_thread_num();

        printf("Hello. I am thread %d out of a team of %d\n", 
                 thread_id, nthreads);
    }

    printf("Here I am, back to the main thread.\n");

    return 0;
}

This example uses two OpenMP functions;

  • omp_get_num_threads() : Returns the number of threads in the OpenMP thread team.
  • omp_get_thread_num() : Returns the identifying number of the thread in the team.

Note that using these functions requires you to include the omp.h header file. To ensure portability (if OpenMP is not supported) we hide this header file behind an #ifdef _OPENMP guard, and add stubs for the two OpenMP functions to set them to 0.

This example uses a slightly modified omp parallel line. In this case, private(nthreads, thread_id) is added to specify that each thread should have its own copy of the nthreads and thread_id variables.

You can compile this program using one of these commands (choose one for the compiler you wish to use);

  • gcc : gcc -fopenmp hello_threads.c -o hello_threads
  • icc : icc -openmp hello_threads.c -o hello_threads -cxxlib-icc
  • pgc : pgc -mp hello_threads.c -o hello_threads

This will produce the executable, hello_threads.

Now return to the previous page.

C++

Copy this into the file hello_threads.cpp

#include <iostream>

#ifdef _OPENMP
    #include <omp.h>
#else
    #define omp_get_num_threads() 0
    #define omp_get_thread_num() 0
#endif

int main(int argc, const char **argv)
{
    std::cout << "I am the main thread.\n";

    #pragma omp parallel
    {
        int nthreads = omp_get_num_threads();
        int thread_id = omp_get_thread_num();

        std::cout << "Hello. I am thread " << thread_id
                  << " out of a team of " << nthreads 
                  << std::endl;
    }

    std::cout << "Here I am, back to the main thread.\n";

    return 0;
}

This example uses two OpenMP functions;

  • omp_get_num_threads() : Returns the number of threads in the OpenMP thread team.
  • omp_get_thread_num() : Returns the identifying number of the thread in the team.

Note that using these functions requires you to include the omp.h header file. To ensure portability (if OpenMP is not supported) we hide this header file behind an #ifdef _OPENMP guard, and add stubs for the two OpenMP functions to set them to 0.

Note also that because the nthreads and thread_id variables exist only in the scope of the omp parallel section, each thread will have its own private copy of these variables.

You can compile this program using one of these commands (choose one for the compiler you wish to use);

  • g++ : g++ -fopenmp hello_threads.cpp -o hello_threads
  • icpc : icpc -openmp hello_threads.cpp -o hello_threads -cxxlib-icc
  • pgCC : pgCC -mp hello_threads.cpp -o hello_threads

This will produce the executable, hello_threads.

Now return to the previous page.

Fortran

Copy this into the file hello_threads.F (make sure you use a capital .F)

      program main

#ifdef _OPENMP
      use omp_lib
#endif

      implicit none

      integer thread_id
      integer num_threads

      print *,"I am the main thread."

C$OMP PARALLEL PRIVATE(thread_id, num_threads)

#ifdef _OPENMP
      thread_id = omp_get_thread_num()
      num_threads = omp_get_num_threads()
#else
      thread_id = 0
      num_threads = 1
#endif

      print *,"Hello. I am thread ",thread_id," out of a team of ",
     .        num_threads
C$OMP END PARALLEL

      print *,"Here I am, back to the main thread."

      end

This example uses two OpenMP functions;

  • omp_get_num_threads() : Returns the number of threads in the OpenMP thread team.
  • omp_get_thread_num() : Returns the identifying number of the thread in the team.

Note that using these functions requires you to use the omp_lib Fortran library. To ensure portability (if OpenMP is not supported) we hide this header file behind an #ifdef _OPENMP guard, and manually set the values of num_threads and thread_id if _OPENMP is not set.

This example uses a slightly modified omp parallel line. In this case, private(nthreads, thread_id) is added to specify that each thread should have its own copy of the nthreads and thread_id variables.
You can compile this program using one of these commands (choose one for the compiler you wish to use);

  • gfortran : gfortran -fopenmp hello_threads.F -o hello_threads
  • ifort : ifort -openmp hello_threads.F -o hello_threads -cxxlib-icc
  • pgf77 : pgf77 -mp hello_threads.F -o hello_threads

This will produce the executable, hello_threads.

Now return to the previous page.

Sections

The OpenMP sections directive provides a means by which different threads can run different parts of the program in parallel. Copy the code from the appropriate link below to create the executable omp_sections.

Try running the executable omp_sections using different values of OMP_NUM_THREADS.

In this example, there are three functions, times_table, countdown and long_loop. These three functions are called from within an OpenMP sections directive, where each function is placed into a separate OpenMP section. This tells the compiler that these three functions can be called in parallel, and a thread from the team can be assigned to each section. Note that if there are more sections than threads, then each section is queued until a thread is available to run it, while if there are more threads than sections, then the extra threads will have nothing to do. Note that there is no guarantee as to the order in which sections are executed.


Next - Loops



C

Create the file omp_sections.c and copy in the following;

#include <stdio.h>
#include <unistd.h>

#ifdef _OPENMP
    #include <omp.h>
#else
    #define omp_get_thread_num() 0
#endif

void times_table(int n)
{
    int i, i_times_n, thread_id;
    
    thread_id = omp_get_thread_num();

    for (i=1; i<=n; ++i)
    {
        i_times_n = i * n;
        printf("Thread %d says %d times %d equals %d.\n",
               thread_id, i, n, i_times_n );

        sleep(1);
    }
}

void countdown()
{
    int i, thread_id;

    thread_id = omp_get_thread_num();

    for (i=10; i>=1; --i)
    {
        printf("Thread %d says %d...\n", thread_id, i);
        sleep(1);
    }

    printf("Thread %d says \"Lift off!\"\n", thread_id);
}

void long_loop()
{
    int i, thread_id;
    double sum = 0;

    thread_id = omp_get_thread_num();

    for (i=1; i<=10; ++i)
    {
        sum += (i*i);
        sleep(1);
    }

    printf("Thread %d says the sum of the long loop is %f\n",
            thread_id, sum);
}

int main(int argc, char **argv)
{
    printf("This is the main thread.\n");

    #pragma omp parallel
    {
        #pragma omp sections
        {
            #pragma omp section
            {
                times_table(12);
            }
            #pragma omp section
            {
                countdown();
            }
            #pragma omp section
            {
                long_loop();
            }
        }
    }

    printf("Back to the main thread. Goodbye!\n");

    return 0;
}

Note that we have included the header file unistd.h to use the function sleep, to add one second pauses (via sleep(1)) within each function.

In this example, the omp sections specifies a block of sections that may be run in parallel, with each individual section specified within each omp section block. While it is possible to write the code within each omp section block directly, the code is more readable if you write each section as a function (e.g. countdown, long_loop and times_table) and just call the function from within each section.

You can compile this program using one of these commands (choose one for the compiler you wish to use);

  • gcc : gcc -fopenmp sections.c -o sections
  • icc : icc -openmp sections.c -o sections -cxxlib-icc
  • pgc : pgc -mp sections.c -o sections

This will produce the executable, sections.

Now return to the previous page.

C++

Create the file omp_sections.cpp and copy in the following;

#include <iostream>
#include <unistd.h>

#ifdef _OPENMP
    #include <omp.h>
#else
    #define omp_get_thread_num() 0
#endif

void times_table(int n)
{
    int thread_id = omp_get_thread_num();

    for (int i=1; i<=n; ++i)
    {
        int i_times_n = i * n;
        std::cout << "Thread " << thread_id << " says " << i
                  << " times " << n << " equals " << i_times_n << std::endl;
        sleep(1);
    }
}

void countdown()
{
    int thread_id = omp_get_thread_num();

    for (int i=10; i>=1; --i)
    {
        std::cout << "Thread " << thread_id << " says " << i << "...\n";
        sleep(1);
    }

    std::cout << "Thread " << thread_id << " says \"Lift off!\"\n";
}

void long_loop()
{
    double sum = 0;

    int thread_id = omp_get_thread_num();

    for (int i=1; i<=10; ++i)
    {
        sum += (i*i);
        sleep(1);
    }

    std::cout << "Thread " << thread_id << " says the sum of the long loop is "
              << sum << std::endl;
}

int main(int argc, const char **argv)
{
    std::cout << "This is the main thread.\n";

    #pragma omp parallel
    {
        #pragma omp sections
        {
            #pragma omp section
            {
                times_table(12);
            }
            #pragma omp section
            {
                countdown();
            }
            #pragma omp section
            {
                long_loop();
            }
        }
    }

    std::cout << "Back to the main thread. Goodbye!\n";

    return 0;
}

Note that we have included the header file unistd.h to use the function sleep, to add one second pauses (via sleep(1)) within each function.

In this example, the omp sections specifies a block of sections that may be run in parallel, with each individual section specified within each omp section block. While it is possible to write the code within each omp section block directly, the code is more readable if you write each section as a function (e.g. countdown, long_loop and times_table) and just call the function from within each section.

You can compile this program using one of these commands (choose one for the compiler you wish to use);

  • g++ : g++ -fopenmp omp_sections.cpp -o omp_sections
  • icpc : icpc -openmp omp_sections.cpp -o omp_sections -cxxlib-icc
  • pgCC : pgCC -mp omp_sections.cpp -o omp_sections

This will produce the executable, omp_sections.

Now return to the previous page.

Fortran

Create the file omp_sections.F and copy in the following;

      subroutine times_table(n)
      use omp_lib
      implicit none
 
      integer i, n, i_times_n
      integer thread_id

      thread_id = omp_get_thread_num()

      do 100 i=1,n
          i_times_n = i * n

          print *,"Thread ",thread_id," says ",
     .            i," times ",n," equals ",i_times_n

          call sleep(1)
100   continue

      end

      
      subroutine countdown
      use omp_lib
      implicit none

      integer i, thread_id

      thread_id = omp_get_thread_num()

      do 200 i=10,1,-1
          print *,"Thread ",thread_id," says ",i,"..."

          call sleep(1)
200   continue

      print *,"Thread ",thread_id," says Blast off!"

      end


      subroutine long_loop
      use omp_lib
      implicit none

      integer i, thread_id
      double precision sum

      thread_id = omp_get_thread_num()

      sum = 0

      do 300 i=1,10
          sum = sum + (i*i)
          call sleep(1)
300   continue

      print *,"Thread ",thread_id," says the sum of the long ",
     .        "loop equals ",sum

      end


      program main
      implicit none

C$OMP PARALLEL
C$OMP SECTIONS

C$OMP SECTION
          call times_table(12)

C$OMP SECTION
          call countdown()

C$OMP SECTION
          call long_loop()

C$OMP END SECTIONS
C$OMP END PARALLEL

      end

Note that this example calls the sleep subroutine to add one second pauses (via call sleep(1)) within each function.

In this example, the code from OMP SECTIONS to OMP END SECTIONS specifies a block of sections that may be run in parallel, with each individual section specified using OMP SECTION. While it is possible to write the code within each OMP SECTION block directly, the code is more readable if you write each section as a function (e.g. countdown, long_loop and times_table) and just call the function from within each section.

You can compile this program using one of these commands (choose one for the compiler you wish to use);

  • gfortran : gfortran -fopenmp omp_sections.F -o omp_sections
  • ifort : ifort -openmp omp_sections.F -o omp_sections -cxxlib-icc
  • pgf77 : pgf77 -mp omp_sections.F -o omp_sections

This will produce the executable, sections.

Now return to the previous page.

Loops

OpenMP sections provide a method by which you can assign different functions to be run by different threads. While this is easy to do, it does not scale well. You can only run as many threads in parallel as there are different functions to be run. If you have more threads than functions, then the extra threads will be idle. Also, if different functions take different amounts of time, then some threads may finish earlier than other threads, and they will be left idle waiting for the remaining threads to finish.

One way of achieving better performance is to use OpenMP to parallelise loops within your code. Lets imagine you have a loop that requires 1000 iterations. If you have two threads in the OpenMP team, then it would make sense for one thread to perform 500 of the 1000 iterations while the other thread performs the other 500 of 1000 iterations. This will scale as as more threads are added, the iterations of the loop can be shared evenly between them, e.g.

  • 2 threads : 500 iterations each
  • 4 threads : 250 iterations each
  • 100 threads : 10 iterations each
  • 1000 threads : 1 iteration each

Of course, this only scales up to the number of iterations in the loop, e.g. if there are 1500 threads, then 1000 threads will have 1 iteration each, while 500 threads will sit idle.

Also, and this is quite important, this will only work if each iteration of the loop is independent. This means that it should be possible to run each iteration in the loop in any order, and that each iteration does not affect any other iteration. This is necessary as running loop iterations in parallel means that we cannot guarantee that loop iteration 99 will be performed before loop iteration 100.

To see how to run an OpenMP parallel loop, copy out the appropriate code from the links below to create the executable loops;

Again, run this executable with different values of OMP_NUM_THREADS to get an idea of how it performs.


Next - Critical Sections



C

Copy this into loops.c

#include <stdio.h>

#ifdef _OPENMP
    #include <omp.h>
#else
    #define omp_get_thread_num() 0
#endif

int main(int argc, char **argv)
{
    int i, thread_id, nloops;

    #pragma omp parallel private(thread_id, nloops)
    {
        nloops = 0;

        #pragma omp for
        for (i=0; i<1000; ++i)
        {
            ++nloops;
        }

        thread_id = omp_get_thread_num();

        printf("Thread %d performed %d iterations of the loop.\n",
                thread_id, nloops );
    }

    return 0;
}

The new directive in this code is omp for. This tells the compiler that the for loop directly below this pragma can be run in parallel by the team of threads. In this case, the only work performed in each iteration is increasing the thread private counter of the number of times that the loop has been performed (each thread in the team has its own copy of nloops because it is specified as private as part of the OpenMP parallel pragma). By doing this, each thread in the team counts up the number of times that it has performed the loop.

You can compile this program using one of these commands (choose one for the compiler you wish to use);

  • gcc : gcc -fopenmp loops.c -o loops
  • icc : icc -openmp loops.c -o loops -cxxlib-icc
  • pgc : pgc -mp loops.c -o loops

This will produce the executable, loops.

Now return to the previous page.

C++

Copy this into loops.cpp

#include <iostream>

#ifdef _OPENMP
    #include <omp.h>
#else
    #define omp_get_thread_num() 0
#endif

int main(int argc, const char **argv)
{
    #pragma omp parallel
    {
        int nloops = 0;

        #pragma omp for
        for (int i=0; i<1000; ++i)
        {
            ++nloops;
        }

        int thread_id = omp_get_thread_num();

        std::cout << "Thread " << thread_id << " performed "
                  << nloops << " iterations of the loop.\n";
    }

    return 0;
}

The new directive in this code is omp for. This tells the compiler that the for loop directly below this pragma can be run in parallel by the team of threads. In this case, the only work performed in each iteration is increasing the thread private counter of the number of times that the loop has been performed (each thread in the team has its own copy of nloops because this variable was declared only within the scope of the omp parallel block). By doing this, each thread in the team counts up the number of times that it has performed the loop.

You can compile this program using one of these commands (choose one for the compiler you wish to use);

  • g++ : g++ -fopenmp loops.cpp -o loops
  • icpc : icpc -openmp loops.cpp -o loops -cxxlib-icc
  • pgCC : pgCC -mp loops.cpp -o loops

This will produce the executable, loops.

Now return to the previous page.

Fortran

Copy this into loops.F

      program main
      use omp_lib
      implicit none

      integer i, thread_id, nloops

C$OMP PARALLEL PRIVATE(thread_id, nloops)

      thread_id = omp_get_thread_num()
      nloops = 0

C$OMP DO
      do 100 i=1,1000
          nloops = nloops + 1
100   continue

      print *,"Thread ",thread_id," performed ",
     .        nloops," iterations."

C$OMP END PARALLEL

      end

The new directive in this code is $OMP DO. This tells the compiler that the do loop directly below this pragma can be run in parallel by the team of threads. In this case, the only work performed in each iteration is increasing the thread private counter of the number of times that the loop has been performed (each thread in the team has its own copy of nloops because it is specified as PRIVATE as part of the !OMP PARALLEL directive). By doing this, each thread in the team counts up the number of times that it has performed the loop.

You can compile this program using one of these commands (choose one for the compiler you wish to use);

  • gfortran : gfortran -fopenmp loops.f -o loops
  • ifort : ifort -openmp loops.f -o loops -cxxlib-icc
  • pgf77 : pgf77 -mp loops.f -o loops

This will produce the executable, loops.

Now return to the previous page.

Critical Code

Up to this point, the examples have shown how to just run different parts of the program in parallel, either by using sections, and running different sections using different threads, or using loops, and running different iterations of the loop using different threads. While this is good, it isn't yet useful. This is because we have not yet seen how to make the threads work together on a problem. At the moment, each thread works on its own, using its own private variables. For example, we've seen how each thread can count the number of iterations it performed itself in a loop, but no mechanism has been presented yet that would allow the team of threads to count up the total number of iterations performed by all of them.

What we need is a way to allow the threads to combine their thread private copies of variables into a global copy. One method of doing this is to use an OpenMP critical section. A critical section is a part of code that is performed by all of the threads in the team, but that is only performed by one thread at a time. This allows each thread to update a global variable with the local result calculated by that thread, without having to worry about another thread trying to update that global variable at the same time. To make this clear, use the appropriate link below to create the two executables, broken_loopcount and fixed_loopcount;

Try running both executables using different values of OMP_NUM_THREADS. Depending on the compiler, programming language and number of threads, broken_loopcount may print out a wide range of different outputs. Sometimes it will work, and will correctly add the number of iterations onto the global sum, and will correctly print out the intermediate steps without problem. However, sometimes, completely randomly, it will break, and either it will print out nonsense (e.g. it will add 1000 iterations to a total of 4000, but the insist that the total is 10000) or it will get the total number of iterations completely wrong. The reason for this is that while one thread is updating or printing the global total, another thread may be changing it.

The fixed_loopcount in contrast will always work, regardless of compiler, programming language or number of threads. This is because we've protected the update and printing of the global total within an OpenMP critical section. The critical section ensures that only one thread at a time is printing and updating the global sum of loops, and so ensures that two threads don't try to access global_nloops simultaneously.

OpenMP critical sections are extremely important if you want to guarantee that your program will work reproducibly. Without critical sections, random bugs can sneak through, and the result of your program may be different if different numbers of threads are used.

OpenMP loops plus OpenMP critical sections provide the basis for one of the most efficient models of parallel programming, namely map and reduce. The idea is that you map the problem to be solved into a loop over a large number of iterations. Each iteration solves its own part of the problem and computes the result into a local thread-private variable. Once the iteration is complete, all of the thread-private variables are combined together (reduced) via critical sections to form the final global result.


Exercise
Write an OpenMP parallel program to calculate pi using a Monte Carlo algorithm.

Pi can be calculated using Monte Carlo by imagining a circle with radius 1 sitting at the origin within a square that just contains this circle (so with corners [-1,-1], [-1,1], [1,-1] and [1,1]). The area of the circle is pi, (from pi r squared), while the area of the square is 4. If we imagine throwing darts randomly at the square, than the proportion that lie within the circle compared to the proportion that lie outside the circle will be directly related to the ratio of the area of the circle against the area of the square. In a parallel loop, you must thus generate a large number of random points in the square, and count up the number that lie within the circle and those that lie outside. Reduce these numbers into a global count of the number inside and outside the circle, and then finally take the ratio of these numbers to get the value of pi. This algorithm to calculate pi is described in more detail here.

Note that you will need to use the following random number functions and square root functions;

  • C : rand and sqrt, contained in math.h and stdlib.h
  • C++ : std::rand and std::sqrt, contained in cmath and cstdlib
  • Fortran : rand and sqrt

The Fortran rand function already generates a random number between 0 and 1. To achieve this in C or C++ you need to write the following function;

double rand_one()
{
    return rand() / (RAND_MAX + 1.0);
}

To get a random number from -1 to 1 you need to use;

(2 * rand()) - 1 or (2 * rand_one()) - 1

Here are the possible answers - take a look if you get stuck or you want to check your work;


Next - Reduction



C

Copy this code into broken_loopcount.c

#include <stdio.h>

#ifdef _OPENMP
    #include <omp.h>
#else
    #define omp_get_thread_num() 0
#endif

int main(int argc, char **argv)
{
    int i, thread_id; 
    int global_nloops, private_nloops;

    global_nloops = 0;

    #pragma omp parallel private(private_nloops, thread_id)
    {
        private_nloops = 0;

        thread_id = omp_get_thread_num();

        #pragma omp for
        for (i=0; i<100000; ++i)
        {
            ++private_nloops;
        }

        printf("Thread %d adding its iterations (%d) to the sum (%d)...\n",
               thread_id, private_nloops, global_nloops);

        global_nloops += private_nloops;

        printf("...total nloops now equals %d.\n", global_nloops);
    }

    printf("The total number of loop iterations is %d\n",
           global_nloops);

    return 0;
}

and copy this code into fixed_loopcount.c

#include <stdio.h>

#ifdef _OPENMP
    #include <omp.h>
#else
    #define omp_get_thread_num() 0
#endif

int main(int argc, char **argv)
{
    int i, thread_id; 
    int global_nloops, private_nloops;

    global_nloops = 0;

    #pragma omp parallel private(private_nloops, thread_id)
    {
        private_nloops = 0;

        thread_id = omp_get_thread_num();

        #pragma omp for
        for (i=0; i<100000; ++i)
        {
            ++private_nloops;
        }

        #pragma omp critical
        {
            printf("Thread %d adding its iterations (%d) to the sum (%d)...\n",
                   thread_id, private_nloops, global_nloops);

            global_nloops += private_nloops;

            printf("...total nloops now equals %d.\n", global_nloops);
        }
    }

    printf("The total number of loop iterations is %d\n",
           global_nloops);

    return 0;
}

The only new code here is the omp critical section in fixed_loopcount.c. The critical section is performed by each thread, but can only be performed by one thread at a time. This ensures that while one thread is updating the global nloops variable (global_nloops) with the thread local value of nloops (private_nloops), that the value of global_nloops is not changed by any other thread.

You can compile these programs using one of these sets of commands (choose one for the compiler you wish to use);

  • gcc : gcc -fopenmp broken_loopcount.c -o broken_loopcount ; gcc -fopenmp fixed_loopcount.c -o fixed_loopcount
  • icc : icc -openmp broken_loopcount.c -o broken_loopcount -cxxlib-icc ; icc -openmp fixed_loopcount.c -o fixed_loopcount -cxxlib-icc
  • pgc : pgc -mp broken_loopcount.c -o broken_loopcount ; pgc -mp fixed_loopcount.c -o fixed_loopcount

This will produce the executables, broken_loopcount and fixed_loopcount.

Now return to the previous page.

C++

Copy this code into broken_loopcount.cpp

#include <iostream>

#ifdef _OPENMP
    #include <omp.h>
#else
    #define omp_get_thread_num() 0
#endif

int main(int argc, char **argv)
{
    int global_nloops = 0;

    #pragma omp parallel
    {
        int private_nloops = 0;

        int thread_id = omp_get_thread_num();

        #pragma omp for
        for (int i=0; i<100000; ++i)
        {
            ++private_nloops;
        }

        std::cout << "Thread " << thread_id << " adding its iterations ("
                  << private_nloops << ") to the sum (" << global_nloops 
                  << ")...\n";

        global_nloops += private_nloops;

        std::cout << "...total nloops now equals " << global_nloops << ".\n";
    }

    std::cout << "The total number of loop iterations is " << global_nloops 
              << ".\n";

    return 0;
}

and copy this code into fixed_loopcount.cpp

#include <iostream>

#ifdef _OPENMP
    #include <omp.h>
#else
    #define omp_get_thread_num() 0
#endif

int main(int argc, char **argv)
{
    int global_nloops = 0;

    #pragma omp parallel
    {
        int private_nloops = 0;

        int thread_id = omp_get_thread_num();

        #pragma omp for
        for (int i=0; i<100000; ++i)
        {
            ++private_nloops;
        }

        #pragma omp critical
        {
            std::cout << "Thread " << thread_id << " adding its iterations ("
                      << private_nloops << ") to the sum (" << global_nloops 
                      << ")...\n";

            global_nloops += private_nloops;

            std::cout << "...total nloops now equals " << global_nloops 
                      << ".\n";
        }
    }

    std::cout << "The total number of loop iterations is " << global_nloops 
              << ".\n";

    return 0;
}

The only new code here is the omp critical section in fixed_loopcount.cpp. The critical section is performed by each thread, but can only be performed by one thread at a time. This ensures that while one thread is updating the global nloops variable (global_nloops) with the thread local value of nloops (private_nloops), that the value of global_nloops is not changed by any other thread.

You can compile these programs using one of these sets of commands (choose one for the compiler you wish to use);

  • g++ : g++ -fopenmp broken_loopcount.cpp -o broken_loopcount ; g++ -fopenmp fixed_loopcount.cpp -o fixed_loopcount
  • icpc : icpc -openmp broken_loopcount.cpp -o broken_loopcount -cxxlib-icc ; icpc -openmp fixed_loopcount.cpp -o fixed_loopcount -cxxlib-icc
  • pgCC : pgCC -mp broken_loopcount.cpp -o broken_loopcount ; pgCC -mp fixed_loopcount.cpp -o fixed_loopcount

This will produce the executables, broken_loopcount and fixed_loopcount.

Now return to the previous page.

Fortran

Copy this code into broken_loopcount.F

      program main
      use omp_lib
      implicit none

      integer i, thread_id
      integer private_nloops, global_nloops

      global_nloops = 0

C$OMP PARALLEL PRIVATE(thread_id, private_nloops)
      thread_id = omp_get_thread_num()
      private_nloops = 0

C$OMP DO
      do 100 i=1,100000
          private_nloops = private_nloops + 1
100   continue

      print *,"Thread ",thread_id," adding its iterations (",
     .        private_nloops,") to the sum (",global_nloops,
     .        ")..."

      global_nloops = global_nloops + private_nloops

      print *,"...total nloops now equals ",global_nloops

C$OMP END PARALLEL

      print *,"The total number of loop iterations is ",
     .        global_nloops

      end

and copy this code into fixed_loopcount.F

      program main
      use omp_lib
      implicit none

      integer i, thread_id
      integer private_nloops, global_nloops

      global_nloops = 0

C$OMP PARALLEL PRIVATE(thread_id, private_nloops)
      thread_id = omp_get_thread_num()
      private_nloops = 0

C$OMP DO
      do 100 i=1,100000
          private_nloops = private_nloops + 1
100   continue

C$OMP CRITICAL
      print *,"Thread ",thread_id," adding its iterations (",
     .        private_nloops,") to the sum (",global_nloops,
     .        ")..."

      global_nloops = global_nloops + private_nloops

      print *,"...total nloops now equals ",global_nloops
C$OMP END CRITICAL

C$OMP END PARALLEL

      print *,"The total number of loop iterations is ",
     .        global_nloops

      end

The only new code here is the $OMP CRITICAL to $OMP END CRITICAL section in fixed_loopcount.c. The critical section is performed by each thread, but can only be performed by one thread at a time. This ensures that while one thread is updating the global nloops variable (global_nloops) with the thread local value of nloops (private_nloops), that the value of global_nloops is not changed by any other thread.

You can compile these programs using one of these sets of commands (choose one for the compiler you wish to use);

  • gfortran : gfortran -fopenmp broken_loopcount.F -o broken_loopcount ; gfortran -fopenmp fixed_loopcount.F -o fixed_loopcount
  • ifort : ifort -openmp broken_loopcount.F -o broken_loopcount -cxxlib-icc ; ifort -openmp fixed_loopcount.F -o fixed_loopcount -cxxlib-icc
  • pgf77 : pgf77 -mp broken_loopcount.F -o broken_loopcount ; pgf77 -mp fixed_loopcount.F -o fixed_loopcount

This will produce the executables, broken_loopcount and fixed_loopcount.

Now return to the previous page.

Answer to Exercise (C)

#include <math.h>
#include <stdlib.h>
#include <stdio.h>

double rand_one()
{
    return rand() / (RAND_MAX + 1.0);
}

int main(int argc, char **argv)
{
    int n_inside, n_outside;
    int pvt_n_inside, pvt_n_outside;
    int i;

    double x, y, r, pi;

    n_inside = 0;
    n_outside = 0;

    #pragma omp parallel private(x, y, r, pvt_n_inside, pvt_n_outside)
    {
        pvt_n_inside = 0;
        pvt_n_outside = 0;

        #pragma omp for
        for (i=0; i<1000000; ++i)
        {
            x = (2*rand_one()) - 1;
            y = (2*rand_one()) - 1;

            r = sqrt( x*x + y*y );

            if (r < 1.0)
            {
                ++pvt_n_inside;
            }
            else
            {
                ++pvt_n_outside;
            }
        }

        #pragma omp critical
        {
            n_inside += pvt_n_inside;
            n_outside += pvt_n_outside;
        }
    }

    pi = (4.0 * n_inside) / (n_inside + n_outside);

    printf("The estimated value of pi is %f\n", pi);

    return 0;
}

Answer to Exercise (C++)

#include <cmath>
#include <cstdlib>
#include <iostream>

double rand_one()
{
    return std::rand() / (RAND_MAX + 1.0);
}

int main(int argc, const char **argv)
{
    int n_inside = 0;
    int n_outside = 0;

    #pragma omp parallel
    {
        int pvt_n_inside = 0;
        int pvt_n_outside = 0;

        #pragma omp for
        for (int i=0; i<1000000; ++i)
        {
            double x = (2*rand_one()) - 1;
            double y = (2*rand_one()) - 1;

            double r = sqrt( x*x + y*y );

            if (r < 1.0)
            {
                ++pvt_n_inside;
            }
            else
            {
                ++pvt_n_outside;
            }
        }

        #pragma omp critical
        {
            n_inside += pvt_n_inside;
            n_outside += pvt_n_outside;
        }
    }

    double pi = (4.0 * n_inside) / (n_inside + n_outside);

    std::cout << "The estimated value of pi is " << pi << std::endl;

    return 0;
}

Answer to Exercise (Fortran)

      program main
      implicit none

      integer n_inside, n_outside
      integer pvt_n_inside, pvt_n_outside

      integer i

      double precision x,y,r,pi

      n_inside = 0
      n_outside = 0

C$OMP PARALLEL
      pvt_n_inside = 0
      pvt_n_outside = 0

C$OMP DO
      do 100 i=1,1000000
           x = (2 * rand()) - 1
           y = (2 * rand()) - 1

           r = sqrt(x**2 + y**2)

           if (r < 1) then
               pvt_n_inside = pvt_n_inside + 1
           else
               pvt_n_outside = pvt_n_outside + 1
           endif
100   continue

C$OMP CRITICAL
      n_inside = n_inside + pvt_n_inside
      n_outside = n_outside + pvt_n_outside
C$OMP END CRITICAL

C$OMP END PARALLEL

      pi = 4 * dble(n_inside) / dble(n_inside + n_outside)

      print *,"The estimated value of pi is ",pi

      end

Reduction

Reduction, which is the process of combining (or reducing) the results of several sub-calculations into a single combined (reduced) result, is very common, and is the second half of the very powerful map-reduce form of parallel programming. In the exercise in the last section you used reduction to form the global sum of the total number of points inside and outside the circle, calculated as the sum of the number of points inside and outside the circle calculated by each thread's iterations of the loop. While it may appear easy to write your own reduction code, it is actually very hard to write efficient reduction code. This is because reduction requires the use of a critical section, where only one thread is allowed to update the global sum at a time. Reduction can actually be implemented much more efficiently, e.g. perhaps by dividing threads into pairs, and getting each pair to sum their results, and then dividing into pairs of pairs, and summing the pairs of pairs results, etc. (this method is known as binary tree reduction - see here for a more in-depth discussion of reduction algorithms).

So reduction is actually quite complicated to implement if you want it to be efficient and work well. Fortunately, you don't have to implement it, as OpenMP provides a reduction directive which has implemented it for you! The reduction directive is added to the end of the OpenMP parallel directive and has the following form;

reduction( operator : variable list )

where operator can be any binary operator (e.g. +, -, *), and variable list is a single variable or list of variables that will be used for the reduction, e.g.

reduction( + : sum )

will tell the compiler that sum will hold the global result of a reduction which will be formed by adding together the results of thread-private calculations, while,

reduction( - : balance, diff )

will tell the compiler that both balance and diff will hold the global results of reductions which are formed by subtracting the results of thread-private calculations.

To make this clear, the following links provide the code for the fixed loop counting examples from the last section which use reduction rather than thread-private variables with an OpenMP critical section;


Exercise
Edit your program to estimate pi so that it uses reduction to form the sum of the number of points inside and outside the circle.

Here's the answers for checking (no peeking before you've finished!)


Next - Map/Reduce



C

#include <stdio.h>

int main(int argc, char **argv)
{
    int i;
    int private_nloops, nloops;

    nloops = 0;

    #pragma omp parallel private(private_nloops) \
                         reduction(+ : nloops)
    {
        private_nloops = 0;

        #pragma omp for
        for (i=0; i<100000; ++i)
        {
            ++private_nloops;
        }

        /* Reduction step - reduce 'private_nloops' into 'nloops' */
        nloops = nloops + private_nloops;
    }

    printf("The total number of loop iterations is %d\n",
           nloops);

    return 0;
}

Return to the previous page.

C++

#include <iostream>

int main(int argc, char **argv)
{
    int nloops = 0;

    #pragma omp parallel reduction( + : nloops )
    {
        int private_nloops = 0;

        #pragma omp for
        for (int i=0; i<100000; ++i)
        {
            ++private_nloops;
        }

        nloops = nloops + private_nloops;
    }

    std::cout << "The total number of loop iterations is " << nloops << ".\n";

    return 0;
}

Return to the previous page.

Fortran

      program main
      implicit none

      integer i
      integer private_nloops, nloops

      nloops = 0

C$OMP PARALLEL PRIVATE(private_nloops)
C$OMP&         REDUCTION( + : nloops )
      private_nloops = 0

C$OMP DO
      do 100 i=1,100000
          private_nloops = private_nloops + 1
100   continue

      nloops = nloops + private_nloops

C$OMP END PARALLEL

      print *,"The total number of loop iterations is ",
     .        nloops

      end

Return to the previous page.

Answer to Exercise (C)

#include <math.h>
#include <stdlib.h>
#include <stdio.h>

double rand_one()
{
    return rand() / (RAND_MAX + 1.0);
}

int main(int argc, char **argv)
{
    int n_inside, n_outside;
    int pvt_n_inside, pvt_n_outside;
    int i;

    double x, y, r, pi;

    n_inside = 0;
    n_outside = 0;

    #pragma omp parallel private(x, y, r, pvt_n_inside, pvt_n_outside) \
                         reduction( + : n_inside, n_outside )
    {
        pvt_n_inside = 0;
        pvt_n_outside = 0;

        #pragma omp for
        for (i=0; i<1000000; ++i)
        {
            x = (2*rand_one()) - 1;
            y = (2*rand_one()) - 1;

            r = sqrt( x*x + y*y );

            if (r < 1.0)
            {
                ++pvt_n_inside;
            }
            else
            {
                ++pvt_n_outside;
            }
        }

        n_inside = n_inside + pvt_n_inside;
        n_outside = n_outside + pvt_n_outside;
    }

    pi = (4.0 * n_inside) / (n_inside + n_outside);

    printf("The estimated value of pi is %f\n", pi);

    return 0;
}

Answer to Exercise (C++)

#include <cmath>
#include <cstdlib>
#include <iostream>

double rand_one()
{
    return std::rand() / (RAND_MAX + 1.0);
}

int main(int argc, const char **argv)
{
    int n_inside = 0;
    int n_outside = 0;

    #pragma omp parallel reduction(+ : n_inside, n_outside)
    {
        int pvt_n_inside = 0;
        int pvt_n_outside = 0;

        #pragma omp for
        for (int i=0; i<1000000; ++i)
        {
            double x = (2*rand_one()) - 1;
            double y = (2*rand_one()) - 1;

            double r = sqrt( x*x + y*y );

            if (r < 1.0)
            {
                ++pvt_n_inside;
            }
            else
            {
                ++pvt_n_outside;
            }
        }

        n_inside += pvt_n_inside;
        n_outside += pvt_n_outside;
    }

    double pi = (4.0 * n_inside) / (n_inside + n_outside);

    std::cout << "The estimated value of pi is " << pi << std::endl;

    return 0;
}

Answer to Exercise (Fortran)

      program main
      implicit none

      integer n_inside, n_outside
      integer pvt_n_inside, pvt_n_outside

      integer i

      double precision x,y,r,pi

      n_inside = 0
      n_outside = 0

C$OMP PARALLEL REDUCTION( + : n_inside, n_outside )
      pvt_n_inside = 0
      pvt_n_outside = 0

C$OMP DO
      do 100 i=1,1000000
           x = (2 * rand()) - 1
           y = (2 * rand()) - 1

           r = sqrt(x**2 + y**2)

           if (r < 1) then
               pvt_n_inside = pvt_n_inside + 1
           else
               pvt_n_outside = pvt_n_outside + 1
           endif
100   continue

      n_inside = n_inside + pvt_n_inside
      n_outside = n_outside + pvt_n_outside

C$OMP END PARALLEL

      pi = 4 * dble(n_inside) / dble(n_inside + n_outside)

      print *,"The estimated value of pi is ",pi

      end

Map/Reduce

We have now covered enough that we can use OpenMP to parallelise a map/reduce style calculation. In this case, the problem we will solve will be calculating the total interaction energy between each ion in an array of ions with a single reference ion. Passed into this function will be the reference ion, and an array of ions. The algorithm performed for each ion in the array will be;

  1. Calculate the distance between the ion in the array and the reference ion.
  2. Use this distance (r) to calculate the interaction energy ( 1 / r )
  3. Add this interaction energy onto the total sum.

Map/reduce can be used when you have an array of data, a function you wish to apply (to map) to each item in the array, and a single value you want back that is the reduction of the results of applying the function to each item in the array. In terms of map/reduce, our algorithm would look like this;

  1. Create a function that calculates and returns the interaction between a passed ion and the reference ion, e.g. calc_energy(ion)
  2. Map each ion in the array against the energy function calc_energy
  3. Reduce the result of each mapped function call using a sum (+)

Here are incomplete pieces of code that implement this algorithm (note that this is to provide an example of how map/reduce can be used - you don't need to complete this code);

Note that the amount of OpenMP in these examples is very low (just 2-3 lines). This is quite common for OpenMP programs - most of the work of parallelisation is organising your code so that it can be parallelised. Once it has been organised, you then only need to add a small number of OpenMP directives.

Compile the above programs, and try running them using different numbers of threads.


Next - Maximising Performance



C

#include <stdio.h>
#include <math.h>
#include <stdlib.h>

/* Define an Ion type to hold the 
   coordinates of an Ion */
typedef struct Ion
{
    double x;
    double y;
    double z;
} Ion;

/* The reference Ion */
struct Ion reference_ion;

/* Return the square of a number */
double square(double x)
{
    return x*x;
}

/* Energy function to be mapped */
double calc_energy( struct Ion ion )
{
    double r;

    r = sqrt( square( reference_ion.x - ion.x ) +
              square( reference_ion.y - ion.y ) +
              square( reference_ion.z - ion.z ) );

    /* The energy is simply 1 / r */
    return 1.0 / r;
}

/* You will need to fill in this function to read in 
   an array of ions and return the array */
struct Ion* readArrayOfIons(int *num_ions)
{
    int i;
    Ion *ions = (Ion*)malloc(10 * sizeof(Ion));

    *num_ions = 10;

    for (i=0; i<10; ++i)
    {
        ions[i].x = 0.0;
        ions[i].y = 0.0;
        ions[i].z = i;
    }

    return ions;
}

int main(int argc, char **argv)
{
    int i, num_ions;
    struct Ion *ions_array;
    double total_energy, mapped_energy;

    /* Lets put the reference ion at (1,2,3) */
    reference_ion.x = 1.0;
    reference_ion.y = 2.0;
    reference_ion.z = 3.0;

    /* Now read in an array of ions */
    ions_array = readArrayOfIons( &num_ions );

    total_energy = 0.0;

    #pragma omp parallel private(mapped_energy) \
                         reduction( + : total_energy )
    {
        mapped_energy = 0.0;

        #pragma omp for
        for (i=0; i < num_ions; ++i)
        {
            /* Map this ion against the function */
            mapped_energy += calc_energy( ions_array[i] );
        }

        /* Reduce to get the result */
        total_energy += mapped_energy;
    }
    
    printf("The total energy is %f\n", total_energy);

    return 0;
}

C++

#include <cmath>
#include <vector>
#include <iostream>

// Define an ion type to hold the
// coordinates of the ion
struct Ion
{
    Ion()
    {
        x = 0;
        y = 0;
        z = 0;
    }

    Ion(double _x, double _y, double _z)
    {
        x = _x;
        y = _y;
        z = _z;
    }

    double x;
    double y;
    double z;
};

// The reference ion
Ion reference_ion;

// Return the square of a number
inline double square(const double x)
{
    return x*x;
}

// Energy function to be mapped
double calc_energy(const Ion &ion)
{
    double r = std::sqrt( square( reference_ion.x - ion.x ) + 
                          square( reference_ion.y - ion.y ) +
                          square( reference_ion.z - ion.z ) );

    // The energy is simply 1 / r
    return 1.0 / r;
}

// You would need to fill in this function to read
// in an array of ions and return the array
std::vector<Ion> readArrayOfIons()
{
    std::vector<Ion> ions;

    for (int i=0; i<10; ++i)
    {
        ions.push_back( Ion(0,0,i) );
    }

    return ions;
}

int main(int argc, const char **argv)
{
    //put the reference ion at (1,2,3)
    reference_ion.x = 1.0;
    reference_ion.y = 2.0;
    reference_ion.z = 3.0;

    //read in the array of ions
    std::vector<Ion> ions_array = readArrayOfIons();

    double total_energy = 0.0;

    #pragma omp parallel reduction( + : total_energy )
    {
        double mapped_energy = 0.0;

        const int n_ions = ions_array.size();

        #pragma omp for
        for (int i=0; i < n_ions; ++i)
        {
            //map this ion against the function
            mapped_energy += calc_energy( ions_array[i] );
        }

        //reduce to get the result
        total_energy += mapped_energy;
    }

    std::cout << "The total energy is " << total_energy << std::endl;

    return 0;
}

Fortran

c     Here is the function used to calculate the energy
c     of the passed ion

      double precision function calc_energy(ref_x, ref_y, ref_z,
     .                                      ion_x, ion_y, ion_z)
      implicit none

      double precision ion_x, ion_y, ion_z
      double precision ref_x, ref_y, ref_z
      double precision r

      r = sqrt( (ref_x - ion_x)**2 + 
     .          (ref_y - ion_y)**2 +
     .          (ref_z - ion_z)**2 )

c     The energy is just 1 / r
      calc_energy = 1.0 / r

      return
      end


c     You would need to fill in this subroutine to read
c     in the array of ions
      subroutine readArrayOfIons(ions_array, n_ions)
      implicit none

      integer i, n_ions
      integer MAX_IONS 
      parameter(MAX_IONS = 1000)
      double precision ions_array(MAX_IONS,3)

      n_ions = 10

      do i=1,10
        ions_array(i,1) = 0.0
        ions_array(i,2) = 0.0
        ions_array(i,3) = i-1
      enddo

      end


      program main
      implicit none

      integer i, n_ions
      integer MAX_IONS
      parameter(MAX_IONS = 1000)

      double precision ions_array(MAX_IONS, 3)
      double precision total_energy, mapped_energy
      double precision calc_energy
      double precision ref_ion(3)

      ref_ion(1) = 1
      ref_ion(2) = 2
      ref_ion(3) = 3

      call readArrayOfIons(ions_array, n_ions)

      total_energy = 0

C$OMP PARALLEL PRIVATE(mapped_energy) REDUCTION( + : total_energy)
       
      mapped_energy = 0

C$OMP DO
      do 100 i=1,n_ions
C         map this ion against the function
          mapped_energy = mapped_energy + calc_energy(ref_ion(1),
     .                                                ref_ion(2),
     .                                                ref_ion(3),
     .                                                ions_array(i, 1),
     .                                                ions_array(i, 2),
     .                                                ions_array(i, 3))

100   continue

C     Reduce to get the result
      total_energy = total_energy + mapped_energy

C$OMP END PARALLEL

      print *,"The total energy is ",total_energy

      end

Performance

OpenMP is reasonably straightforward to use, and can be added to existing programs. However, don't let this simplicity deceive you. While writing OpenMP code is straightforward, writing efficient, scalable OpenMP code can be hard, and requires you to think deeply about the problem to be solved. As you saw in the map/reduce example, the amount of OpenMP was quite small, but the application itself had to be arranged in a way that allowed the problem to be solved using a map/reduce approach.

While the techniques used to get good performance using OpenMP are problem specific, there are a set of general guidelines that will put you on the right track;

  • Avoid working with global variables whenever possible (which in Fortran means avoid common blocks!).
  • Try to do as much work as you can using thread-private variables. Structure your program so that each thread can calculate a thread-local result, and then these thread-local results can then be combined together at the end into the final answer.
  • Avoid OpenMP critical regions wherever possible. These will cause threads to block as only one thread can be in a critical region at a time. However, remember that critical regions are necessary whenever you update a global variable.
  • Try to leave all global updates to the end of the parallel region.
  • Prefer to use OpenMP reduction operations rather than trying to write your own.

However, the most important advice is benchmark, benchmark, benchmark! Don't try and guess the performance of your code, or guess that something makes it faster. Actually time how fast your code runs (e.g. using the time command or using timers within your code), and compare timings for different numbers of threads (by changing OMP_NUM_THREADS) every time you make changes to your code. Make benchmarking and performance testing as important in your coding workflow as debugging and unit testing.


Next - Case Study



Case Study

The paper Multicore Parallelization of Kohn-Sham Theory presents an OpenMP parallelisation of a Density Functional Theory (DFT) calculation. The calculation is performed within the quantum chemical program Molpro. Molpro is a large (1.5 million lines of code) program written in a variety of versions of Fortran. Benchmarking revealed that the bottleneck of the calculation was the numerical integration of integrals using a quadrature grid. We thus decided to use OpenMP to parallelise this calculation by using an OpenMP loop to map batches of grid points for evaluation to different threads, and to then use OpenMP reduction to sum together the sum of the integral evaluations at each grid point.

To achieve this, we had to restructure the way that Molpro represented the quadrature grid and basis functions for the integrals. We decided that we would write the code in C++ as this would let us have greater control of which data was thread-local and which was global, and it allowed us to create objects that represented different bits of data in the calculation (e.g. Grid, Orbital, Matrix etc. - if you compare the Fortran, C and C++ examples in this workshop you will see that the OpenMP directives for C++ are much easier as the private directive is not needed). Rather than rewrite the whole of Molpro in C++ (which would have taken years!) we just rewrote the function that performed the numerical integration in C++, and then linked this new C++ function into the Fortran program. This works, as C++ and Fortran are both compiled to machine code, and there are now standard methods for calling Fortran functions from C++, and for calling C++ functions from Fortran.

Rewriting the code in this way allowed us to speed up the DFT calculation in Molpro by over 10 times on a 16 core Intel Xeon (a quad-socket quad-core node). The rewrite itself took a few months, the majority of which were spent thinking about how to restructure the algorithm so that it could be easily parallelised using OpenMP.



The above graphs show the scaling behaviour of the resulting code as a function of the number of OpenMP threads. Three calculations were performed on each of the four different benchmark platforms, with each calculation using a sequentially larger data size. This allowed us to benchmark how well the code scaled as we increased the number of cores, and also how the code behaved as the amount of data to process became larger than the bandwidth between main memory and the processor (the solid line is the smallest problem, the dashed line is the medium problem and the dotted line is the large problem). This kind of benchmarking is invaluable when writing parallel code, and is also useful for machine evaluation, e.g. in this case we can see that the Intel Xeon processors have more bandwidth so cope better than the Opterons as the problem size increases. It should be noted that the Opterons tested here were much older than the Xeons, and that modern Opterons are likely to be comparable to modern Xeons.


Next - What's Next?

What's next?

This course was designed to give you a quick taster of OpenMP. If you enjoyed it, and want to learn more, then I strongly encourage you to take a look at a book on OpenMP, e.g. Using OpenMP: Portable Shared Memory Parallel Programming, or to check out other OpenMP websites or performing a web search for openmp tutorial or openmp for beginners.

The best way to learn any programming language or tool is to read other people's code and copy it. Please feel free to copy, adapt and play with the examples in this workshop. They should hopefully provide the starting points for basic parallelisation of your code using OpenMP.



posted on 2014-01-01 12:38  向北方  阅读(2064)  评论(0编辑  收藏  举报