AMD opencl on GCN 优化小结

Debug tools: CodeXL
Arch : GCN
  1 CU 4xSIMD(16 Threads, consective run 4 cycle) = 64 threads
  1 CU 4xVALU + 1xSALU + 32K_LDS max/group + 16K_L1Cache
  1 wavefront == 64 thread
  max 256 VGPR
  max 101 SGPR
  inline Constants integer[-16,64] float[PI -4.0 -2.0 -1.0 -0.5...]
Memory type:
  __private registers
  __constant uniform(readonly constant
  __local LDS
  __global SSBO(shader Storage buffer
  __image
  flat memory (gfx9 above

 


1. intrinsics/builtin function(assembly optimize) 使用内部函数利用硬件的特殊指令
  ballot EXEC
  mbcnt D=(S0 & ThreadMask[31:0]) + S1
  bcnt
  barycentric coordinate
  readfirstlane convert VGPR to SGPR, can reduce VGPR workload,
  interp vertex parameter
  swizzle
  permute(ds,
  fp16
  med3,min3,max3
  fma
  msad
  cube
  div_fixup
  ?trig_
  bit operations
  ......

 
2. about Latency 指令延时

浮点数指令延时比整数指令要少很多,没必要做'整数量化优化'

以trinity APU为例,FP Add 延时 GPU 是CPU的5.7倍, Int Add 延时GPU是CPU的45.3倍
memory 延时gpu也比cpu长一些.
branch 延时特别长
launch 延时长,因为有kernel,buffer的上传设置CU寄存器等过程 (如果一个game把所有的shader都编译到一起或者可分别上传的subroutine,是不是启动会快很多

3. channel rank bank conflict
避免channel,bank冲突,内存最好以tiled存放, 每个tile一个group
指定默认的local_size
__attribute__((reqd_work_group_size(X, Y, Z)))

4. branch eliminate
   gpu 分支非常抵消,通过select指令,循环展开,消除跳转
  . loop unroll
5.  bit operations:
     bfm ballot mbcnt bfe bitcmp wqm quadmask ff0 ff1 flbit bitset0 bitset1 saveexec
6. registers, local
   do not declare global data, declare tempary at local


Example:
1. meansure kernel time

clCreateCommandQueueWithProperties(context, device, properties[CL_PROFILING_COMMAND_QUEUED]...
clEnqueueNDRangeKernel(...,&event)
clWaitForEvents();

clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_START, size, &startTime, &length);
clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_END, size, &endTime, &length);

2. integration Histogram
   http://blog.csdn.net/10km/article/details/51610735
   http://www.doc88.com/p-9933630209054.html

REFERENCE:
      http://developer.amd.com/tools-and-sdks/opencl-zone/amd-accelerated-parallel-processing-app-sdk/opencl-optimization-guide/#50401334_pgfId-472054

 

posted @ 2017-01-09 19:15  neophyte  阅读(1261)  评论(0编辑  收藏  举报