vulkan asynchronous compute

https://www.youtube.com/watch?v=XOGIDMJThto

https://www.khronos.org/assets/uploads/developers/library/2016-vulkan-devday-uk/9-Asynchonous-compute.pdf

https://docs.microsoft.com/en-us/windows/win32/direct3d12/user-mode-heap-synchronization

https://gpuopen.com/concurrent-execution-asynchronous-queues/

通过queue的并行增加GPU的并行

并发性 concurrency

Radeon™ Fury X GPU consists of 64 Compute Units (CUs), each of those containing 4 Single-Instruction-Multiple-Data units (SIMD) and each SIMD executes blocks of 64 threads, which we call a “wavefront”.

Since latency for memory access can cause significant stalls in shader execution, up to 10 wavefronts can be scheduled on each SIMD simultaneously to hide this latency.

GPU有64个CU

每个CU 4个SIMD

每个SIMD 64blocks ----- 一个wavefront

ps的计算在里面

GPU提升并发性减小GPU idel

async compute

Copy Queue(DirectX 12) / Transfer Queue (Vulkan): DMA transfers of data over the PCIe bus
Compute queue (DirectX 12 and Vulkan): execute compute shaders or copy data, preferably within local memory
Direct Queue (DirectX 12) / Graphics Queue (Vulkan): this queue can do anything, so it is similar to the main device in legacy APIs

这三种queue对应metal里面三种encoder 是为了增加上文所述并发性

对GPU底层的操作这种可行性是通过这里的queue体现的

vulkan对queue的个数有限制可以query

dx12没有这种个数限制

更多部分拿出来用cs做异步计算

看图--技能点还没点

problem shooting

If resources are located in system memory accessing those from Graphics or Compute queues will have an impact on DMA queue performance and vice versa.
Graphics and Compute queues accessing local memory (e.g. fetching texture data, writing to UAVs or performing rasterization-heavy tasks) can affect each other due to bandwidth limitations 带宽限制数据onchip
Threads sharing the same CU will share GPRs and LDS, so tasks that use all available resources may prevent asynchronous workloads to execute on the same CU
Different queues share their caches. If multiple queues utilize the same caches this can result in more cache thrashing and reduce performance

Due to the reasons above it is recommended to determine bottlenecks for each pass and place passes with complementary bottlenecks next to each other:

Compute shaders which make heavy use of LDS and ALU are usually good candidates for the asynchronous compute queue
Depth only rendering passes are usually good candidates to have some compute tasks run next to it
A common solution for efficient asynchronous compute usage can be to overlap the post processing of frame N with shadow map rendering of frame N+1
Porting as much of the frame to compute will result in more flexibility when experimenting which tasks can be scheduled next to each other
Splitting tasks into sub-tasks and interleaving them can reduce barriers and create opportunities for efficient async compute usage (e.g. instead of “for each light clear shadow map, render shadow, compute VSM” do “clear all shadow maps, render all shadow maps, compute VSM for all shadow maps”)

然后给异步计算的功能加上开关

看vulkan这个意思它似乎没有metal2 那种persistent thread group 维持数据cs ps之间传递时还可以 on tile

posted on 2019-10-08 16:47 minggoddess 阅读(523) 评论(0) 编辑收藏举报