CUDA编程:SM(Streaming Multiprocessing)

  1. 一个 GPU 包含多个 Streaming Multiprocessor ,而每个 Streaming Multiprocessor 又包含多个 core Streaming Multiprocessors 支持并发执行多达几百的 thread 。 
  2. 一个 thread block 只能调度到一个 Streaming Multiprocessor 上运行,直到 thread block 运行完毕。一个 Streaming Multiprocessor 可以同时运行多个thread block

  1. 这里有两种分割数据的方式:block就是按线程数等分数据,10个线程就把数据分成10份,一个线程处理一份;而cyclic则是数据的份数大于线程数,举个例子,10个线程把数据分成20份,第一个线程处理第111份,第二个线程处理第212份。。。。。。,循环处理多次。
  • Even though many-core and multicore are used to label GPU and CPU architectures, a GPU core is quite different than a CPU core.
  • A CPU core, relatively heavy-weight, is designed for very complex control logic, seeking to optimize the execution of sequential programs.
  • A GPU core, relatively light-weight, is optimized for data-parallel tasks with simpler control logic, focusing on the throughput of parallel programs.
  1. CPU处理器相对重量化,设计用来应对复杂的控制逻辑,寻求优化序列化程序的执行方法。
  2. GPU处理器相对而言轻量化,经过优化之后用较为简单的控制逻辑来处理并行数据任务,主要集中于并行程序的使用。


A typical CUDA program structure consists of five main steps:
1. Allocate GPU memories. 分配GPU的内存
2. Copy data from CPU memory to GPU memory. 复制CPU内存数据到GPU内存
3. Invoke the CUDA kernel to perform program-specific computation. 激活CUDA内核去计算特定程序的计算
4. Copy data back from GPU memory to CPU memory. 将数据GPU再一次拷贝回到CPU当中
5. Destroy GPU memories.  删除GPU数据


When a kernel function is launched from the host side, execution is moved to a device where a large number of threads are generated and each thread executes the statements specified by the kernel function. CUDA exposes a thread hierarchy abstraction to enable you to organize your threads. This is a two-level thread hierarchy decomposed into blocks of threads and grids of blocks:

当一个核函数在host side启动(即 CPU 处理器端),执行过程将会被移动到device上(即GPU上),在GPU上会产生大量的线程(Threads),每一个线程都执行和函数的一个特定代码。CUDA 中有两种抽象层级,让你可以安排组织你的线程。这两种层级的线程层可以分解成两部分:线程块和包含块的格。

 All threads spawned by a single kernel launch are collectively called a grid. All threads in a grid share the same global memory space. A grid is made up of many thread blocks. A thread block is a group of threads that can cooperate with each other using:
➤ Block-local synchronization
➤ Block-local shared memory
Threads from different blocks cannot cooperate.

 所有由一个核产生一系列线程将成为一个集合,名为 grid 。在一个格子内的所有线程将分享一样的全局内存空间。一个 grid 由许多的 block 组成。一个线程块(thread block) 是一组线程的组合,它们可以互相之间合作(用下面的方法):

  1. Block-local synchronization
  2. Block-local shared memory




A kernel function is the code to be executed on the device side. In a kernel function, you define the computation for a single thread, and the data access for that thread. When the kernel is called, many different CUDA threads perform the same computation in parallel.


The following restrictions apply for all kernels:
➤ Access to device memory only 只可以接触设备的内存
➤ Must have void return type  返回值必须是空
➤ No support for a variable number of arguments 不支持一个参数的变量
➤ No support for static variables  不支持静态变量
➤ No support for function pointers  不支持函数指针
➤ Exhibit an asynchronous behavior  表现为一个非同步的



