随笔- 26 文章- 0 评论- 344 阅读- 72844

初探 C# GPU 通用计算技术

GPU 的并行计算能力高于 CPU，所以最近也有很多利用 GPU 的项目出现在我们的视野中，在 InfoQ 上看到这篇介绍 Accelerator-V2 的文章，它是微软研究院的研究项目，需要注册后才能下载，感觉作为我接触 GPU 通用运算的第一步还不错，于是去下载了回来。

在安装包里，包含了几个例子程序，比如著名的 Life 游戏，不过，Life 游戏，相对于刚接触 GPU 运算的我，还是稍显复杂了。于是简化一下，只是进行一些简单的计算，发现，DX9Target.ToArray 如果返回参数是 int 数组的话，则会爆出“未支持的操作”的异常，想想也对，显卡确实是精于浮点运算的。

本来，我以为，GPU 运算是 DirectX 11 才有的功能，但是 Accelerator 支持的却是 DirectX 9，想来 DirectX 11 支持的运算能力更高、方式更简单吧。

为了简单比较一下 CPU 和 GPU 的速度，也写了一个 .net 4 的并行运算的程序，因为 DX9Target 不支持 int，所以这里的数组也用 float，如下：

代码

private const int GridSize = 1024;
private float[] _map;

public Form1()
{
    InitializeComponent();
    _map = new float[GridSize * GridSize];
    for (int y = 0; y < GridSize; y++)
    {
        for (int x = 0; x < GridSize; x++)
        {
            _map[x * GridSize + y] = x * y;
        }
    }
    Render();
}

private void Start_Click(object sender, EventArgs e)
{
    var stopwatch = new Stopwatch();
    stopwatch.Start();
    _map = _map.AsParallel().Select(p => p * p * p / 4 + 194).ToArray();
    var time = stopwatch.ElapsedMilliseconds;
    this.Text = time.ToString();
    Render();
}

private void Render()
{
    var workingBitmap = new Bitmap(pictureBox1.Width, pictureBox1.Height);

    for (int y = 0; y < pictureBox1.Height; y++)
    {
        for (int x = 0; x < pictureBox1.Width; x++)
        {
            workingBitmap.SetPixel(x, y, Color.FromArgb(-0x1000000 | (int)_map[x * 2 * GridSize + y * 2]));
        }
    }
    pictureBox1.Image = workingBitmap;
}

而使用 Accelerator 的代码如下：

代码

private const int GridSize = 1024;
private readonly DX9Target _target;
private float[,] _map;

public Form1()
{
    InitializeComponent();
    _target = new DX9Target();
    _map = new float[GridSize, GridSize];
    for (int y = 0; y < GridSize; y++)
    {
        for (int x = 0; x < GridSize; x++)
        {
            _map[x, y] = x * y;
        }
    }
    Render();
}

private void Start_Click(object sender, EventArgs e)
{
    var stopwatch = new Stopwatch();
    stopwatch.Start();

    var p = new FloatParallelArray(_map);
    p = p * p * p / 4 + 194;
    _target.ToArray(p, out _map);

    var time = stopwatch.ElapsedMilliseconds;
    this.Text = time.ToString();
    Render();
}

private void Render()
{
    var workingBitmap = new Bitmap(pictureBox1.Width, pictureBox1.Height);

    for (int y = 0; y < pictureBox1.Height; y++)
    {
        for (int x = 0; x < pictureBox1.Width; x++)
        {
            workingBitmap.SetPixel(x, y, Color.FromArgb(-0x1000000 | (int)_map[x * 2， y * 2]));
        }
    }
    pictureBox1.Image = workingBitmap;
}

用我的笔记本（CPU 为 Core i5 430, 显卡为 ATI 5650）测试，对它们两个程序，都点击几次 Start 按钮，发现运行 3 次左右，图片框会变成全黑，这时，普通并行程序运算速度变慢，而 GPU 程序运行速度无明显变化，普通并行程序 4 次值为：96，89，277，291，而 GPU 程序 4 次值为：71，40，35，50。单就这个测试来说，在我的电脑上，使用 GPU 的程序，大概比普通并行程序快一倍左右吧。这个测试本身，其实不见得很公平，结果仅供参考。

不过，在 Accelerator 中的并行编程，明显感觉受到的约束很大，平常很容易的代码，要改成这种并行模式，需要花费很多力气，有些逻辑甚至无法实现。相对于 Accelerator，Brahma 的代码写起来就容易得多，也更易于阅读，其 Life 游戏的例子程序读起来简单而清晰，可惜我编译了 Brahma v0.1 和 v0.4，在我的电脑上，DirectX 的例子程序没有效果，而 OpenGL 的例子程序则会报一个“The generated GLSL was invalid”的异常，看来还需要等它完善之后才能使用吧。