Life of a triangle - NVIDIA's logical pipeline
Since the release of the ground breaking Fermi architecture almost 5 years have gone by, it might be time to refresh the principle graphics architecture beneath it. Fermi was the first NVIDIA GPU implementing a fully scalable graphics engine and its core architecture can be found in Kepler as well as Maxwell. The following article and especially the “compressed pipeline knowledge” image below should serve as a primer based on the various public materials, such as whitepapers or GTC tutorials about the GPU architecture. This article focuses on the graphics viewpoint on how the GPU works, although some principles such as how shader program code gets executed is the same for compute.
Pipeline Architecture Image
GPUs are super parallel work distributors
Why all this complexity? In graphics we have to deal with data amplification that creates lots of variable workloads. Each drawcall may generate a different amount of triangles. The amount of vertices after clipping is different from what our triangles were originally made of. After back-face and depth culling, not all triangles may need pixels on the screen. The screen size of a triangle can mean it requires millions of pixels or none at all.
【为什么这么复杂?在图形处理中,我们必须处理数据放大,从而产生大量的可变工作负载。每个draw调用可能产生不同数量的三角形。剪裁后的顶点的数量与我们的三角形最初所做的不同。在幕后和深度筛选之后,并不是所有的三角形都可能需要屏幕上的像素。一个三角形的屏幕大小可能意味着它需要数百万像素或根本不需要。】
As a consequence modern GPUs let their primitives (triangles, lines, points) follow a logical pipeline, not a physical pipeline. In the old days before G80's unified architecture (think DX9 hardware, ps3, xbox360), the pipeline was represented on the chip with the different stages and work would run through it one after another. (因此,现代gpu让它们的原语(三角形、线、点)遵循一个逻辑管道,而不是一个物理管道。在G80的统一架构(想想DX9硬件、ps3、xbox360)之前的旧日子里,管道是用不同的阶段在芯片上表示的,工作也会在一个接一个地运行。)G80 essentially reused some units for both vertex and fragment shader computations, depending on the load, but it still had a serial process for the primitives/rasterization and so on. With Fermi the pipeline became fully parallel, which means the chip implements a logical pipeline (the steps a triangle goes through) by reusing multiple engines on the chip.
【G80本质上重用了一些单元来进行顶点和fragment着色计算,这取决于负载,但是它仍然有一个用于primitives/rasterization 的串行过程。通过Fermi ,管道变得完全平行,这意味着芯片实现了一个逻辑管道(一个三角形通过的步骤),通过重复使用在芯片上的多个引擎。】
Let's say we have two triangles A and B. Parts of their work could be in different logical pipeline steps.
【三角形A的情况:已经光栅化了】
A has already been transformed and needs to be rasterized. Some of its pixels could be running pixel-shader instructions already, while others are being rejected by depth-buffer (Z-cull), others could be already being written to framebuffer, and some may actually wait.
【三角形B的情况:紧跟着A】
And next to all that, we could be fetching the vertices of triangle B.
So while each triangle has to go through the logical steps, lots of them could be actively processed at different steps of their lifetime. (每个三角形都需要通过整个逻辑步骤,绝大部分的他们经历的步骤是有可能不一样的)The job (get drawcall's triangles on screen) is split into many smaller tasks and even subtasks that can run in parallel. (job被分成许多较小的任务,甚至是可以并行运行的子任务)Each task is scheduled to the resources that are available, which is not limited to tasks of a certain type (vertex-shading parallel to pixel-shading).(每个任务都安排在可用的资源上,这并不局限于特定类型的任务(与pixel shading并行的vertex shading)。)
Think of a river that fans out. Parallel pipeline streams, that are independent of each other, everyone on their own time line, some may branch more than others. If we would color-code the units of a GPU based on the triangle, or drawcall it's currently working on, it would be multi-color blinkenlights :)
GPU architecture
Since Fermi NVIDIA has a similar principle architecture. There is a Giga Thread Engine which manages all the work that's going on. The GPU is partitioned into multiple GPCs (Graphics Processing Cluster), each has multiple SMs (Streaming Multiprocessor) and one Raster Engine. There is lots of interconnects in this process, most notably a Crossbar that allows work migration across GPCs or other functional units like ROP (render output unit) subsystems.
【有一个Giga Thread Engine管理所有正在进行的工作。GPU被划分为多个gpc(图形处理集群),每个都有多个SMs(流多处理器)和一个Raster Engine。在这个过程中有很多相互连接,最显著的是一个Crossbar ,它允许跨gpc或其他功能单元的工作迁移,比如ROP(呈现输出单元)子系统。】
The work that a programmer thinks of (shader program execution) is done on the SMs. It contains many Cores which do the math operations for the threads. One thread could be a vertex-, or pixel-shader invocation for example. (一个thread可以是一个顶点,或者是像素着色的调用)Those cores and other units are driven by Warp Schedulers, which manage a group of 32 threads as warp and hand over the instructions to be performed to Dispatch Units(并将指令移交给Dispatch Units(分派单元)). The code logic is handled by the scheduler and not inside a core itself,(代码逻辑是由scheduler 处理的,而不是在Core本身中) which just sees something like "sum register 4234 with register 4235 and store in 4230" from the dispatcher. (执行这个逻辑有点像将4234寄存器的加上4235寄存器的,然后保存到4230寄存器)A core itself is rather dumb, compared to a CPU where a core is pretty smart. The GPU puts the smartness into higher levels, it conducts the work of an entire ensemble (or multiple if you will).
How many of these units are actually on the GPU (how many SMs per GPC, how many GPCs..) depends on the chip configuration itself. As you can see above GM204 has 4 GPCs with each 4 SMs, but Tegra X1 for example has 1 GPC and 2 SMs, both with Maxwell design. The SM design itself (number of cores, instruction units, schedulers...) has also changed over time from generation to generation (see first image) and helped making the chips so efficient they can be scaled from high-end desktop to notebook to mobile.
【这些单元中有多少实际上是在GPU上(每个GPC有多少个SMs,多少个GPC..)取决于芯片的配置本身。正如你所看到的,GM204在每4个SMs中有4个GPC,但是Tegra X1有一个GPC和2个SMs,这两个都与麦克斯韦的设计有关系。SM设计本身(核心、指令单元、调度程序……)也随着时间的推移而发生了变化(见第一个图像),并帮助使芯片变得如此高效,可以从高端桌面到笔记本移动到移动设备。】
The logical pipeline
For the sake of simplicity several details are omitted. We assume the drawcall references some index- and vertexbuffer that is already filled with data and lives in the DRAM of the GPU and uses only vertex- and pixelshader (GL: fragmentshader).
- The program makes a drawcall in the graphics api (DX or GL). This reaches the driver at some point which does a bit of validation to check if things are "legal" and inserts the command in a GPU-readable encoding inside a pushbuffer. A lot of bottlenecks can happen here on the CPU side of things, which is why it is important programmers use apis well, and techniques that leverage(利用) the power of today's GPUs.
【该程序在图形api(DX或GL)中进行了一个draw调用。在某个点上,它会对驱动程序进行一些验 证,以检查是否“合法”,并将该命令插入到一个可读的缓冲区中,并将其插入到可读的编码中。很多瓶颈可能发生在CPU的一面,这就是为什么程序员使用api,和使用利用gpu力量的技术这么重要。】
- After a while or explicit "flush" calls, the driver has buffered up enough work in a pushbuffer and sends it to be processed by the GPU (with some involvement of the OS). The Host Interface of the GPU picks up the commands which are processed via the Front End.
【2。经过一段时间或显式的“flush”调用后,驱动程序在一个pushbuffer中缓冲了足够的工作,并将其发送给GPU处理(与操作系统有关)。GPU的host interface就会将这些会通过front end处理的命令收集起来。】
- We start our work distribution in the Primitive Distributor by processing the indices in the indexbuffer and generating triangle work batches that we send out to multiple GPCs.
【3。我们在Primitive Distributor中开始工作分配,通过处理索引缓冲区中的索引,并生成我们发送给多个gpc的triangle work batches。】
- Within a GPC, the Poly Morph Engine of one of the SMs takes care of fetching the vertex data from the triangle indices (Vertex Fetch).
【4。在一个GPC中,一个SMs的Poly Morph Engine负责从三角形索引中获取顶点数据(Vertex Fetch)。】
- After the data has been fetched, warps of 32 threads are scheduled inside the SM and will be working on the vertices.
【5。在获取数据之后,将在SM中安排32个线程的warps,并将对这些顶点进行处理。】
- The SM's warp scheduler issues the instructions for the entire warp in-order. The threads run each instruction in lock-step and can be masked out individually if they should not actively execute it. There can be multiple reasons for requiring such masking. For example when the current instruction is part of the "if (true)" branch and the thread specific data evaluated "false", or when a loop's termination criteria was reached in one thread but not another. (例如,当前的指令是“if(true)”的一部分分支,然而该线程的数据的计算结果是“false”(此时就要mask了),或者在一个线程中达到一个循环的终止条件,而另一个线程还没达到。)Therefore having lots of branch divergence in a shader can increase the time spent for all threads in the warp significantly. Threads cannot advance individually, only as a warp! Warps, however, are independent of each other.【因此,在一个着色器中有大量的分支可以显著地增加花在所有线程上的时间。线程不能单独前进,只能作为一个warp(一个没执行完,其他都得等)!然而,warp是相互独立的。】
- The warp's instruction may be completed at once or may take several dispatch turns. For example the SM typically has less units for load/store than doing basic math operations.
【7。warp的指令可以立即完成,也可以进行几次dispatch turns。例如,SM的加载/存储单元比基本的数学运算要少。】
- As some instructions take longer to complete than others, especially memory loads, the warp scheduler may simply switch to another warp that is not waiting for memory. This is the key concept how GPUs overcome latency of memory reads, they simply switch out groups of active threads. To make this switching very fast, all threads managed by the scheduler have their own registers in the register-file. The more registers a shader program needs, the less threads/warps have space. The less warps we can switch between, the less useful work we can do while waiting for instructions to complete (foremost memory fetches).
【8。由于某些指令比其他指令花费的时间要长,特别是内存负载,所以warp scheduler 可能会切换到另一个不需要等待内存返回的warp。这就是gpu如何克服内存读取延迟的关键概念(延迟隐藏!!),它们只是简单地将当前活动的thread成组成组地换掉。为了使这个转换非常快,由scheduler (调度器)管理的所有线程都在register-file(寄存器文件)中有自己的寄存器。一个着色程序需要的thread寄存器越多,那么线程/warps的空间就越少。我们可以切换的warp越少,我们在等待指令完成((一般指等待内存返回))时,可以做的工作就越少【也就是性能变低了,所以要增加warp?】。】
- Once the warp has completed all instructions of the vertex-shader(完成vertex-shader所有的指令), it's results are being processed by Viewport Transform. The triangle gets clipped by the clipspace volume and is ready for rasterization. We use L1 and L2 Caches for all this cross-task communication data.
【9。一旦 warp 完成了vertex着色器的所有指令,它的结果将被Viewport转换处理。这个三角形会被剪切空间体所剪掉,并准备好进行光栅化。我们使用L1和L2缓存来处理所有这些cross-task communication data(跨任务(进入光栅化?)的通信数据?)。】
- Now it gets exciting, our triangle is about to be chopped up and potentially leaving the GPC it currently lives on. The bounding box of the triangle is used to decide which raster engines need to work on it, as each engine covers multiple tiles of the screen. It sends out the triangle to one or multiple GPCs via the Work Distribution Crossbar. We effectively split our triangle into lots of smaller jobs now.
【现在它变得令人兴奋,我们的三角形即将被砍掉,并且有可能离开它目前所处的GPC。三角形的bounding box被用来决定哪个光栅引擎需要处理它,因为每个引擎都覆盖了屏幕的多个块。通过Work Distribution Crossbar,我们可以将三角形发送给一个或多个gpc。这样,我们就有效地将我们的三角形分割成许多小的工作。】
总结:
顶点
--Primitive Distributor--> triangle work batches -> 发送到对应的GPC处理
--> 处理后经过Work Distribution Crossbar -> 根据三角形的bounding box确定发送给哪个gpc进行处理(有可能跨gpc)
- Attribute Setup at the target SM will ensure that the interpolants (for example the outputs we generated in a vertex-shader) are in a pixel shader friendly format.
【11。目标SM中的 Attribute Setup阶段将确保插值得到的数据格式是像素着色器友好的(例如在vertex shader中生成的输出)。】
- The Raster Engine of a GPC works on the triangle it received and generates the pixel information for those sections that it is responsible for (also handles back-face culling and Z-cull).
【12。GPC的Raster Engine(光栅引擎)在它所接收的三角形上工作,并为它负责的三角形部分(三角形被切分成一个个tile)生成像素信息(也可以处理back face culling和z-cull)。】
- Again we batch up 32 pixel threads, or better say 8 times 2x2 pixel quads, which is the smallest unit we will always work with in pixel shaders. This 2x2 quad allows us to calculate derivatives for things like texture mip map filtering (big change in texture coordinates within quad causes higher mip). Those threads within the 2x2 quad whose sample locations are not actually covering the triangle, are masked out (gl_HelperInvocation). One of the local SM's warp scheduler will manage the pixel-shading task.
【13。再一次,我们32个像素的线程进行处理,或者更好地说是8乘以2x2像素的quads,这是我们在像素着色器中一直使用的最小单位。这个2x2的quad让我们可以计算出derivatives(偏导)诸如纹理mip贴图过滤之类的东西(在quad的纹理坐标上,大的改变会导致更高的mip)。在2x2的quad内的那些线程,如果它们的采样位置实际上并没有覆盖到三角形,它们就会被屏蔽(gl_HelperInvocation)。最后,其中一个local SM的warp scheduler 将会分配/管理 pixel shader的任务。】
- The same warp scheduler instruction game, that we had in the vertex-shader logical stage, is now performed on the pixel-shader threads. The lock-step processing is particularly handy because we can access the values within a pixel quad almost for free, as all threads are guaranteed to have their data computed up to the same instruction point (NV_shader_thread_group).
【使用的同样的warp scheduler指令游戏,跟我们在vertex-材质的逻辑阶段上的一模一样,只不过现在变成在pixel shader的thread上执行而已。这种lock-step处理特别方便,因为我们可以几乎没有消耗地访问一个像素的值,因为所有的线程都保证将它们的数据计算到同一个 instruction point (指令点)(NV_shader_thread_group)。】
- Are we there yet? Almost, our pixel-shader has completed the calculation of the colors to be written to the rendertargets and we also have a depth value. At this point we have to take the original api ordering of triangles into account before we hand that data over to one of the ROP (render output unit) subsystems, which in itself has multiple ROP units. Here depth-testing, blending with the framebuffer and so on is performed. These operations need to happen atomically (one color/depth set at a time) to ensure we don't have one triangle's color and another triangle's depth value when both cover the same pixel. NVIDIA typically applies memory compression, to reduce memory bandwidth requirements, which increases "effective" bandwidth (see GTX 980 pdf).
【一切都结束了吗?几乎,我们的像素着色器已经完成了对渲染目标的颜色的计算,但是我们还有一个深度值。在这一点上,我们必须将原始的三角形api顺序考虑在内,然后将这些数据移交给一个ROP子系统(渲染输出单元),该子系统本身具有多个ROP单元。在这里进行深度测试,与framebuffer进行混合,等等。这些操作需要以原子的方式进行(一种颜色/深度设置一次),以确保在覆盖同一个像素时,我们不会出现一个三角形的颜色对应另一个三角形的深度值。NVIDIA通常应用内存压缩,以减少内存带宽需求,这增加了“effective(有效)”带宽(参见GTX 980 pdf)。】
Puh! we are done, we have written some pixel into a rendertarget. I hope this information was helpful to understand some of the work/data flow within a GPU. It may also help understand another side-effect of why synchronization with CPU is really hurtful. One has to wait until everything is finished and no new work is submitted (all units become idle), that means when sending new work, it takes a while until everything is fully under load again, especially on the big GPUs.
【它还可能有助于理解为什么与CPU同步的另一个副作用是非常有害的。cpu必须一直等待,直到所有的事情都完成了,并且没有提交新的工作(所有的单元都是空闲的),这就意味着在发送新工作的时候,我们需要等待一段时间,直到所有的事情都完全被重新加载,尤其是在大的gpu上。】
In the image below you can see how we rendered a CAD model and colored it by the different SMs or warp ids that contributed to the image (NV_shader_thread_group). The result would not be frame-coherent, as the work distribution will vary frame to frame. The scene was rendered using many drawcalls, of which several may also be processed in parallel (using NSIGHT you can see some of that drawcall parallelism as well).
【在下面的图片中,你可以看到我们是如何渲染一个CAD模型,并通过不同的SMs或warp ids来给图像着色(NV_shader_thread_group)。其结果不会是frame-coherent(每个frame都一致)的,因为工作分布会随着frame的变化而变化。这个场景使用了许多draw调用,其中几个也可以并行处理(使用NSIGHT,您也可以看到一些drawcall并行性)。】
https://blog.csdn.net/qq_14914623