C++ => C# => F#, more functional, more parallel (3)
In this article, we will see how the functional programming style makes code parallelization extremely easy. First the parallel version of QuickSort, which is quite complex to implement in C++ but very easy in C#, is introduced. Then we will see how to make use of variables’ immutable property of function languages to parallelize code. In the end, let’s visit some useful features in .NET Parallel Extension (.NET PE), although they are less related to functional programming.
Every statement is a variable
Remember that in the famous C++ parallel library OpenMP, we have the “parallel for” primitive director. Using “#pragma omp parallel for”, we can simply turn a sequential for loop into a parallel one. Extending primitives in a mature language rather than defining a new one is quite reasonable and practical. That reduces the learning cost and makes the existed code still available in the extended language.
However, this technique is not always effective, especially in imperative languages. For example, assume we want to implement the parallel version of QuickSort based on existed sequential source code. In C++ and OpenMP style, we may encapsulate the sequential version QuickSort(int *array, int low, int high) and Partition(int *array, int low, int high) into a new parallelized version like that:
1 void ParallelQuickSort(int *pnArray, int nLow, int nHigh)
2 {
3 if (nLow >= nHigh)
4 return;
5
6 int nPivotIndex = Partition(pnArray, nLow, nHigh);
7 int nNewLow[2], nNewHigh[2];
8 nNewLow[0] = nLow;
9 nNewLow[1] = nPivotIndex + 1;
10 nNewHigh[0] = nPivotIndex - 1;
11 nNewHigh[1] = nHigh;
12
13 #pragma omp parallel for nowait
14 for (i = 0; i < 2; i++)
15 QuickSort(pnArray, nNewLow[i], nNewHigh[i]);
16 }
This seems solve the problem. Yes, the function is parallel and may get a relative good result in a dual-core machine. But what if the machine has a 4-way CPU? The code still runs in 2 threads, which cannot take advantage of CPU’s potential. Of course we can manually modify the code to make it suitable for a 4-core CPU, but what if the CPU is 8-way or many-core in the future? This reveals the solution’s lack of scalability, which is critical in parallel world.
An improved solution is to replace the OpenMP primitive with explicit thread creation statements. Yes, this solves previous problem, as there are more threads, with each thread occupying a core, the potential of multi-core CPUs can be used. But each thread requires a named function to specify its actions, which makes the simple QuickSort algorithm needs 4 to 5 functions to describe. In addition, remember that thread creation and switching are both very expensive. There may be more than 1000 threads running simultaneously when sorting 1 million numbers in a dual-core CPU, where the CPU is busy with switching among them rather than doing the actual work. So finally, this solution will have a bad performance in most occasions and is not practical.
If we use a threading pool instead of raw threads to help distribute computation, the frequent threads switching can be avoided. A core will “pull” another thread from threading pool to process until it is free, as Figure 1 shows.
Figure 1. Threading pool
To this end, the problem finally gets solved in C++’s style. Quite complicated and needs a lot of background knowledge such as threading pool. Worst of all, the OpenMP library seems not providing an explicit threading pool which developers can invoke without directors like “#pragma omp parallel for”. So it is still a hard work to write practical codes. Now let’s take a look at how to implement parallel QuickSort with C# + .NET Parallel Extension (.NET PE):
1 static void QuickSort(int[] items, int low, int high)
2 {
3 if (low >= high)
4 return;
5 int q = Partition(items, low, high);
6 Task.Create(x => QuickSort(items, low, q - 1));
7 Task.Create(x => QuickSort(items, q, high));
8 }
Very similar to the sequential version. The only difference is the original QuickSort(…) turns into parallel command Task.Create(x => QuickSort(…)), which means this task is managed by threading pool and runs in parallel automatically. No extended #pragma directors, no blunt and unnecessary for loops, and certainly no need to define a totally new language, all the benefits come from the description ability of this language, in which every statement can be treated as a function variable. Just because of this feature, we can easily specify this statement should run in parallel, that one not. We can also implement functions taking functions as parameters, which helps much to build a parallel library with influent communication with users. Vivid present ability, that’s what functional programming brings to us, and builds the fundaments of an easy-to-use parallel library.
We can also run this code snippet on an actual multi-core computer. In a Xeon 8-way server, sorting 10,000,000 numbers between 0 and 32767 costs 1187ms by sequential version, while only 170ms by parallel version. The acceleration ratio is about 7.0 rather than 8.0 because: First, QuickSort is a O(nlgn) level algorithm, so the theoretical acceleration ratio is about 9.3 rather than an linear result 8.0 for 10,000,000 numbers. Second, there are always overhead in parallel management, so it rarely reaches theoretical value. Third, this simple parallelization scheme does not use all the computation power, for example, all the other 7 cores has to wait and can do nothing until one core finishes partitioning all the 10,000,000 numbers. This also prevents the performance getting perfect.
In conclusion, functional programming style helps to make parallelization existed code very simple and efficient. And the parallel code really works, helping us take full use of computation potential of multi-core CPUs.
Immutable!
As mentioned in the first article of this series, variables in functional languages are immutable, i.e. they cannot be changed after created. This helps much to parallelize existed code by avoiding writing conflicts and dead lock. In languages having functional elements like C#, such feature is also taken advantage of: string class and LINQ are typical paradigms. For objects with this property, we can write parallel version extremely easy.
For example, suppose we want to select all the customers living in UK. The sequential version has been demonstrated before:
1 var UKCustomer =
2 from customer in customers
3 where customer.LivingIn == "UK"
4 select customer;
To make it work in parallel, the only thing we need to do is change customers into customers.AsParallel():
1 var UKCustomer =
2 from customer in customers.AsParallel()
3 where customer.LivingIn == "UK"
4 select customer;
The .NET Parallel Extension will do all the implementation and optimization work for you. Thanks to the immutability of variables in LINQ, developers as well as compilers needn’t worry about complex locks, semaphores, etc. So that there can be a simple pattern to parallelize code, which makes an easy-to-use library possible. This pattern is called PLINQ in C#, while in F# where nearly all the variables are immutable, parallelization is even simpler, often we only need to replace “let” primitive with “let!” and everything is done.
.NET Parallel Extension
Besides taking use of functional style to make parallel programming quite easy, .NET PE has other very useful data structures. Let’s take a brief look.
Future
Future is like Task class mentioned before. It creates another “thread” in threading pool and starts it in proper time. However, unlike Task instances, Future instances can return an object as result, which appears as Value property. When some thread tries to refer to the Value property of a Future object, it will get the result immediately if the Future object has finished the computation, or gets blocked until the Future’s work gets done.
It provides an alternative to events and semaphores, so that the code can be more readable and error-prone. For example, we can count a tree’s nodes like this:
1 int CountNodes(Tree<int> node)
2 {
3 if (node == null) return 0;
4 var left = Future.Create(() => CountNodes(node.Left));
5 int right = CountNodes(node.Right);
6 return 1 + left.Value + right;
7 }
WriteOnce
Many parallel algorithms, especially those allowing concurrent writing, rely on variables that can only be changed once, such as a lot of sorting algorithms. WriteOnce class just provides an implementation of such data structure.
BlockingCollection
In the classical producer/ consumer problem, there is expected a pipeline where all the producers can put their products in and all the consumers can fetch tasks from. In addition, the consumers will get blocked when there is no task in the pipeline. This is exactly BlockCollection does in .NET PE. It provides a thread-safe implementation for concurrently reading and writing, and automatically blocking when it is empty.
From the three useful components in .NET PE, we can see that this extension library is from practical programming and consider developers’ real needs, and get a lot of benefits from functional programming style.
Summary
Every statement in a functional language is a variable, so that we can describe our intention and communicate to the compiler with more powerful vocabularies. In addition, as variables in functional languages are immutable, it enables compiler to provide extremely easy way to parallelize code. From these two factors, we can see that functional programming style does help us write parallel code more easily. And that is more functional, more parallel.
P.S. Note that parallel programming is quite complex, while this series of articles only cover very little part of it. To write practical parallel codes with scalability, robustness and good performance needs knowledge about parallel algorithm, hardware architecture as well as programming language and extension library. Anyway, functional style does simplify the parallel coding procedure and provide new solutions to some sequential problems, although it has fewer effects on algorithms.