- Executing Thread Blocks
- Threads are assigned to Streaming Multiprocessors(SM) in block granularity
- Up to 8 blocks to each SM as resource allows
- Fermi SM can take up to 1536 threads
- SM maintains thread/block idx
- Warps as Scheduling Units
- Each Block is executed as 32-thread Warps(each warp as a SIMD unit)
- SM implements zero-overhead warp
- Control Divergence
- If there are any dependencies between threads, you must __syncthreads() to get correct results.
- Memory Model
- Global Memory (dynamic allocate, static allocate)
- Register (automatic variables)
- Shared Memory (__shared__)
- Constant Memory (__constant__)
- __device __ is optional when used with __shared__, or __constant__
- Partition data into subsets or tiles that fit into shared memory
- use one thread block to handle each tile by
- Loading the tile from global memory to shared memory, using multiple threads
- Performing the computation on subset from shared memory, reducing traffic to the global memory
- Upon completion, writing results from shared memory to global memory