【2014-11-24】Heterogeneous Parallel Programming – Section 2

  1. Executing Thread Blocks
    • Threads are assigned to Streaming Multiprocessors(SM) in block granularity
      • Up to 8 blocks to each SM as resource allows
      • Fermi SM can take up to 1536 threads
        • 256*6
        • 512*3
    • SM maintains thread/block idx
  2. Warps as Scheduling Units
    • Each Block is executed as 32-thread Warps(each warp as a SIMD unit)
    • SM implements zero-overhead warp
  3. Control Divergence
    • If there are any dependencies between threads, you must __syncthreads() to get correct results.
  4. Memory Model
    • Global Memory (dynamic allocate, static allocate)
    • Register (automatic variables)
    • Shared Memory (__shared__)
    • Constant Memory (__constant__)
    • image
    • __device __ is optional when used with __shared__, or __constant__
    • image
    • Partition data into subsets or tiles that fit into shared memory
    • use one thread block to handle each tile by
      • Loading the tile from global memory to shared memory, using multiple threads
      • Performing the computation on subset from shared memory, reducing traffic to the global memory
      • Upon completion, writing results from shared memory to global memory

posted on 2014-11-24 11:23  sjtujoe  阅读(109)  评论(0编辑  收藏  举报