CUDA学习笔记（一）——CUDA编程模型

转自：http://blog.sina.com.cn/s/blog_48b9e1f90100fm56.html

CUDA的代码分成两部分，一部分在host(CPU)上运行，是普通的C代码；另一部分在device(GPU)上运行，是并行代码，称为kernel，由nvcc进行编译。

Kernel产生的所有线程成为Grid。在并行部分结束后，程序回到串行部分即到host上运行。

在CUDA中，host和device有不同的内存空间。所以在device上执行kernel时，程序员需要把host memory上的数据传送到分配的device memory上。在device执行完以后，需要把结果从device传送回host，并释放device memory。CUDA runtime system提供了API给程序员做这些事情。

Float *Md;

Int size=Width*Width*sizeof(float);

API：

cudaMalloc((void**)&Md, size)——从host code调用，为device在global memory分配内存空间。第一个参数是指向分配对象的地址，第二个参数是分配大小；

cudaFree(Md)——释放device Global Memory。

cudaMemcpy(Md, M, size, dudaMemcpyHostToDevice)——内存数据传输。四个参数分别为：指向目的数据的指针，指向源（要copy的）数据指针，要copy出的数据字节数，传输方式（host to host, host to device, device to host, device to device）

内核部分

__global__说明这个函数是一个kernel，host function可以调用这个函数产生线程

threadIdx.x线程index

一个kernel被调用时，以并行线程的grid形式执行。一个kernel创建一个grid。Grid中的线程被组织成两个层次。在最顶层，每个grid包含一个或多个thread block。Grid中的所有block有相同数目的线程。每个thread block有一个唯一的二维坐标，由CUDA的特定关键字blockIdx.x和blockIdx.y指定。所有的thread block必须以相同的方式组织，并有相同数目的thread。

Thread block：包含相互之间能够协作的线程，这些线程通过同步或者在低延迟的shared memory之间共享数据进行协作。不同block里的线程不能协作。每个thread block组织成三位的线程数组，最大线程数目为512。Block中的线程坐标是唯一的，通过三个线程id指定：threadIdx.x, threadIdx.y, threadIdx.z。不是所有的应用程序会使用thread block的三个维度。

当host code调用一个kernel时，通过参数传递来设置grid和thread block的维度。如下：

// Setup the execution configuration

dim3 dimBlock(WIDTH, WIDTH);

dim3 dimGrid(1, 1);

// Launch the device computation threads!

MatrixMulKernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd);

以上是摘自David Kirk和Wen-mei Hwu的课程，讲的比较清楚。感觉CUDA编程一个比较自由的编程方式，由于是在C之上的扩展，加了一些关键字，比较容易，编程方式让人很好接受。一方面给了程序员很大的发挥空间，thread, thread block等都可以自由配置，另一方面也给程序员提出了挑战，这么大的空间中怎样编程以取得好的性能。

一个简单的矩阵乘程序

#include<stdio.h>

#include<stdlib.h>

#include<cuda.h>

//内核程序

__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width)

{

//2D Thread ID

int tx=threadIdx.x;

int ty=threadIdx.y;

printf("I'm thread: %d %d\n",tx,ty);

//Pvalue stores the Pd element that is computed by the thread

float Pvalue=0;

for(int k=0; k<Width; k++)

{

float Mdelement=Md[ty*Width+k];

float Ndelement=Nd[k*Width+tx];

Pvalue+=Mdelement*Ndelement;

printf("%f %f %f\n",Mdelement,Ndelement,Pvalue);

}

//Write the matrix to device memory each thread writes one element

Pd[ty*Width+tx]=Pvalue;

}

void MatrixMulOnDevice(float* M, float* N, float* P, int Width)

{

int size=Width*Width*sizeof(float);

float *Md,*Nd,*Pd;

dim3 dimBlock(Width,Width);

dim3 dimGrid(1,1);

//Load M and N to device memory

cudaMalloc((void **)&Md,size);

cudaMemcpy(Md,M,size,cudaMemcpyHostToDevice);

cudaMalloc((void **)&Nd,size);

cudaMemcpy(Nd,N,size,cudaMemcpyHostToDevice);

//for(int i=0;i<3;i++)printf("%d ",Md[i]);

//Allocate P on the device

cudaMalloc((void**)&Pd,size);

//Kernel invocation code

MatrixMulKernel<<<dimGrid,dimBlock>>>(Md,Nd,Pd,Width);

//Read P from the device

cudaMemcpy(P,Pd,size,cudaMemcpyDeviceToHost);

//Free device matrices

cudaFree(Md);cudaFree(Nd);cudaFree(Pd);

}

int main(void)

{

// Allocate and initialize the matrices M,N,P

// I/O to read the input matrices M and N

//int size=Width*Width*sizeof(float);

float *M,*N,*P;

int Width=4;

//int size=Width*Width*sizeof(float);

int i=0;

M=(float *)malloc(sizeof(float)*Width*Width);

N=(float *)malloc(sizeof(float)*Width*Width);

P=(float *)malloc(sizeof(float)*Width*Width);

for(i=0;i<Width*Width;i++)

M[i]=(float)i;

for(i=0;i<Width*Width;i++)

N[i]=(float)i;

// for(i=0;i<Width*Width;i++)

// printf("%3f ",N[i]);

// M*N on the device

MatrixMulOnDevice(M,N,P,Width);

for(i=0;i<Width*Width;i++)

{

if(i%Width==0)printf("\n");

printf("%3f ",P[i]);

}

printf("\n");

// I/O to write the output matrix P

// Free matrices M, N, P

free(M);free(N);free(P);

return 0;

}

运行命令：

nvcc -deviceemu matrixmul.cu -o matrixmul

注意： -deviceemu在此处是必须的，因为在device中调用了printf，这属于device调用了host function

下一步：理解计算是怎样并行的？

普通CPU程序和GPU程序的性能比怎样？用时间衡量

posted @ 2013-10-22 17:09 qingsun_ny 阅读(889) 评论(0) 收藏举报

刷新页面返回顶部

qingsun_ny

CUDA学习笔记（一）——CUDA编程模型

公告