shfl_*

shfl_xor

  1. cuda docs 搜索 shfl_xor
  2. https://tschmidt23.github.io/cse599i/CSE 599 I Accelerated Computing - Programming GPUs Lecture 18.pdf
  3. https://people.maths.ox.ac.uk/gilesm/cuda/2019/lecture_04.pdf

shfl_sync

T __shfl_sync(unsigned mask, T var, int srcLane, int width=warpSize);
返回srcLanevar值,根据mask设置的有效位来获得返回值。
在下面的例子中,warp中的threadIdx.x按照laneid划分,lane0的thread直接获得arg目标值,其他lane通过shfl_sync()来获得返回值。

#include <stdio.h>

__global__ void bcast(int arg) {
    int laneId = threadIdx.x & 0x1f;
    int value;
    if (laneId == 0)        // Note unused variable for
        value = arg;        // all threads except lane 0
    value = __shfl_sync(0xffffffff, value, 0);   // Synchronize all threads in warp, and get "value" from lane 0
    if (value != arg)
        printf("Thread %d failed.\n", threadIdx.x);
}

int main() {
    bcast<<< 1, 32 >>>(1234);
    cudaDeviceSynchronize();

    return 0;
}
posted @ 2022-07-12 17:12  ijpq  阅读(80)  评论(0编辑  收藏  举报