简单并行计算技术方法研究

这篇文章主要写给我们这些非计算机专业的又要写程序、实现算法的人，有的连多线程都不会，所以这里就说些不需要大篇幅修改程序就可以简单实现的并行计算。

这边把并行计算分为2类，基于CPU的多线程处理、基于异构架构的并行计算（如GPU等）。基于CPU的主要有：OpenMP、TBB、PPL、Parallel、IPP等，基于异构的并行计算主要有OpenCL、CUDA、AMP等。以上我也没有全部使用过，这里只介绍部分，以后用了再补充吧。

名称解释
线程锁，是指如果线程没有抢到线程锁，那么线程就会被阻塞，线程所在的CPU会发生进程调度，选择其他进程执行。
并行计算（Parallel Computing）,是指同时使用多种计算资源解决计算问题的过程，是提高计算机系统计算速度和处理能力的一种有效手段。

OpenMP

使用条件：语言 C/C++、Fortran，编译器 Sun Studio、Intel Compiler、Microsoft Visual Studio、GCC（但不仅限于），这里只介绍了对for循环的优化

使用要点：

开启编译器OpenMP开关：如VS，点击工程的Properties，弹出菜单里，点击 Configuration Properties->C/C++->Language->OpenMP Support，在下拉菜单里选择Yes。

应用头文件：#include <omp.h>

加入并行计算：在for循环前面加上#pragma omp parallel for

线程锁：#pragma omp critical{…}
完整例程：

#include <iostream>
#include <omp.h>
int main()
{
    int sum = 0;
    int a[10] = {1,2,3,4,5,6,7,8,9,10};
    int coreNum = omp_get_num_procs();//获得处理器个数
    int* sumArray = new int[coreNum];//对应处理器个数，先生成一个数组
    for (int i=0;i<coreNum;i++)//将数组各元素初始化为0
        sumArray[i] = 0;
#pragma omp parallel for
    for (int i=0;i<10;i++)
    {
        int k = omp_get_thread_num();//获得每个线程的ID
        sumArray[k] = sumArray[k]+a[i];
    }
    for (int i = 0;i<coreNum;i++)
        sum = sum + sumArray[i];
    std::cout<<"sum: "<<sum<<std::endl;
    return 0;
}

注意：
对于for循环的优化，其本质是每个核分段处理，例如 for (int i=0;i<40;i++) 而CPU有4个核心，这CPU0 处理i=0~9,CPU2处理1=10-19…以此类推，所以在每次循环有前后影响时应注意不要使用并行处理。
延伸阅读：
openMP的一点使用经验 - yangyangcv - 博客园.html
http://www.cnblogs.com/yangyangcv/archive/2012/03/23/2413335.html
OpenMP创建线程中的锁及原子操作性能比较
http://blog.csdn.net/drzhouweiming/article/details/1689853

Parallel

使用条件：.NET Framework 4以上

使用要点：
添加命名空间：using System.Threading.Tasks
使用一下方法代替for、foreach：
Parallel.For(int fromInclusive,int toExclusive,Action<int, ParallelLoopState> body)
Parallel.ForEach<TSource>(IEnumerable<TSource> source,Action<TSource> body)
完整例程：

using System;
using System.Threading.Tasks;

public class Example
{
   public static void Main()
   {
      ParallelLoopResult result = Parallel.For(0, 100, ctr => 
      { 
            Random rnd = new Random(ctr * 100000);
            Byte[] bytes = new Byte[100];
            rnd.NextBytes(bytes);
            int sum = 0;
            foreach(var byt in bytes)
                sum += byt;
            Console.WriteLine("Iteration {0,2}: {1:N0}", ctr, sum);
      });
      Console.WriteLine("Result: {0}", result.IsCompleted ? "Completed Normally" : String.Format("Completed to {0}", result.LowestBreakIteration));
   }
}

延伸阅读：
MSDN
https://msdn.microsoft.com/zh-cn/library/system.threading.tasks.parallel_methods(v=vs.100).aspx

类似C#的Parallel，详见《遇见PPL：C++ 的并行和异步》

为什么选择在GPU上做并行计算呢？现在的多核CPU一般都是双核或四核的，如果把超线程技术考虑进来，可以把它们看作四个或八个逻辑核，但现在的GPU动则就上百个核，比如中端的NVIDIA GTX 560 SE就有288个核，顶级的NVIDIA GTX 690更有多达3072个核，这些超多核（many-core）GPU非常适合大规模并行计算。
但是GPU的每个核心计算能力没有CPU那么强，适合做海量数据的简单处理。

使用条件:语言C/C++，编译器VS2012及以上、C++11，运行环境DX11以上（Win7以上操作系统安装最新显卡驱动都可以支持，XP无缘）

使用要点：
引用头文件：#include<amp.h> #include<amp_math.h>
添加命名空间：using namespace concurrency::fast_math 只支持单精度浮点数，而using namespace concurrency::precise_math 则对单精度浮点数和双精度浮点数均提供支持。
把array数组对象封装到array_view对象中。
使用parallel_for_each循环。
完整例程：

#include <amp.h>
#include <iostream>
using namespace concurrency;

const int size = 5;

void CppAmpMethod() {
    int aCPP[] = {1, 2, 3, 4, 5};
    int bCPP[] = {6, 7, 8, 9, 10};
    int sumCPP[size];

    // Create C++ AMP objects.
    array_view<const int, 1> a(size, aCPP);
    array_view<const int, 1> b(size, bCPP);
    array_view<int, 1> sum(size, sumCPP);
    sum.discard_data();

    parallel_for_each( 
        // Define the compute domain, which is the set of threads that are created.
        sum.extent, 
        // Define the code to run on each thread on the accelerator.
        [=](index<1> idx) restrict(amp)
    {
        sum[idx] = a[idx] + b[idx];
    }
    );

    // Print the results. The expected output is "7, 9, 11, 13, 15".
    for (int i = 0; i < size; i++) {
        std::cout << sum[i] << "\n";
    }
}

注意
包含 restrict(amp) 子句的函数具有以下限制：

函数只能调用具有 restrict(amp) 子句的函数。
函数必须可内联。
函数只能声明 int、unsigned int、float 和 double 变量，以及只包含这些类型的类和结构。也允许使用 bool，但如果您在复合类型中使用它，则它必须是 4 字节对齐的。
Lambda 函数无法通过引用捕获，并且无法捕获指针。
仅支持引用和单一间接指针作为局部变量、函数参数和返回类型。
不允许使用以下项：
- 递归。
- 使用 volatile 关键字声明的变量。
- 虚函数。
- 指向函数的指针。
- 指向成员函数的指针。
- 结构中的指针。
- 指向指针的指针。
- goto 语句。
- Labeled 语句。
- try 、catch 或 throw 语句。
- 全局变量。
- 静态变量。请改用 tile_static 关键字。
- dynamic_cast 强制转换。
- typeid 运算符。
- asm 声明。
- Varargs。

扩展阅读
入门：http://www.infoq.com/cn/articles/cpp_amp_computing_on_GPU
MSDN：https://msdn.microsoft.com/zh-cn/library/hh265136.aspx

posted @ 2016-02-29 10:59 夏至千秋阅读(1015) 评论(1) 编辑收藏举报

刷新页面返回顶部

千寻

渺小是伟大的起点

简单并行计算技术方法研究

公告