shfl_*
shfl_xor
- cuda docs 搜索 shfl_xor
- https://tschmidt23.github.io/cse599i/CSE 599 I Accelerated Computing - Programming GPUs Lecture 18.pdf
- https://people.maths.ox.ac.uk/gilesm/cuda/2019/lecture_04.pdf
shfl_sync
T __shfl_sync(unsigned mask, T var, int srcLane, int width=warpSize);
返回srcLane
的var
值,根据mask设置的有效位来获得返回值。
在下面的例子中,warp中的threadIdx.x按照laneid划分,lane0的thread直接获得arg目标值,其他lane通过shfl_sync()来获得返回值。
#include <stdio.h>
__global__ void bcast(int arg) {
int laneId = threadIdx.x & 0x1f;
int value;
if (laneId == 0) // Note unused variable for
value = arg; // all threads except lane 0
value = __shfl_sync(0xffffffff, value, 0); // Synchronize all threads in warp, and get "value" from lane 0
if (value != arg)
printf("Thread %d failed.\n", threadIdx.x);
}
int main() {
bcast<<< 1, 32 >>>(1234);
cudaDeviceSynchronize();
return 0;
}
本文来自博客园,作者:ijpq,转载请注明原文链接:https://www.cnblogs.com/ijpq/p/16470829.html