ComputeShader基础用法系列之四
这次接着上一篇ComputeShader基础用法系列之三来继续说。上一节说到了要通过Compute Shader进行GPU Culling。
为什么需要GPU Culling呢?使用GPU Culling能带来什么好处?
传统意义上的culling是通过相机的Cull进行的,Camera.Cull所带来的性能问题随着场景的复杂程度提高而会越来越严重。那么我们能否将Cull放到GPU来做呢,利用GPU的高并行处理机制达到转移CPU压力。
答案当然是可以的,但是像CameraCulling一样,GPU Culling同样需要包围盒数据,这就意味着需要传入数据到GPU内存。所以我们能推出以下的方法:
1.将包围盒数据通过ComputeBuffer传入GPU
2.在ComputeShader中进行Culling操作
3. 通过DrawIndirect的方式将物体绘制出来。
这里为什么要用DrawIndirect的呢?DrawIndirect是什么呢?我们来看一下:
这个方法前两个步骤都没有问题,但是第三个步骤回读CPU是个大问题,我们知道CPU和GPU之间的传输带宽在手机上是非常有限的,如果大量GPU数据回读CPU,手机上必然是难以承受的。而且还有个问题在于这样做只是确定可以把视锥外的物体Renderer禁用,但是视锥内的这些物体还是要再走一遍相机裁减,这样的话两遍裁减两边都占用性能,体验简直不要太差。 通过在PC上profiler我们可以看到直接回读cpu culling结果的问题:
Camera Culling也在执行,Gpu Culling也在执行,而且注意等待GPU返回数据这一步,相当的耗时。
关于回读CPU的代码我就不往外面粘贴了,没什么参考意义,只是用来看看回读究竟多耗性能。那么接下来我们的主角:DrawIndirect就登场了。
Graphics.DrawMeshInstancedIndirect 这个方法主要是把在显存里面的数据直接Draw到渲染管线中,而不是传统的从CPU发送数据,通过这个接口,我们就可以直接把GPU Culling的结果放到渲染管线中执行,而无需回读CPU,也可以绕过CameraCulling机制。
我们首先来看官方对于这个API的讲解:https://docs.unity3d.com/ScriptReference/Graphics.DrawMeshInstancedIndirect.html
大家可以把代码直接copy到Unity工程查看一下效果。满屏幕的小方块:
官方这个例子只是告诉我们这个API如何使用,但是并没有做Culling操作。这就会导致很多不需要Draw的信息被放入了管线中处理。
跟着官方的例子,学会使用这个接口后,就直接上代码:
代码时基于官方提供的例子进行了一点点修改:
using System.Collections; using System.Collections.Generic; using UnityEngine; public class DrawIndirectCulled : MonoBehaviour { public struct ObjInfo { public Vector3 boundMin; public Vector3 boundMax; public Matrix4x4 localToWorldMatrix; public Matrix4x4 worldToLocalMatrix; } public struct MatrixInfo { public Matrix4x4 localToWorldMatrix; public Matrix4x4 worldToLocalMatrix; } public int instanceCount = 100000; public Mesh instanceMesh; public Material instanceMaterial; public int subMeshIndex = 0; public ComputeShader compute; private int cachedInstanceCount = -1; private int cachedSubMeshIndex = -1; private ComputeBuffer positionBuffer; private ComputeBuffer argsBuffer; private ComputeBuffer cullResult; List<ObjInfo> infos = new List<ObjInfo>(); private uint[] args = new uint[5] { 0, 0, 0, 0, 0 }; private int kernel; private int visibleCount; void Start() { kernel = compute.FindKernel("CSMain"); argsBuffer = new ComputeBuffer(1, args.Length * sizeof(uint), ComputeBufferType.IndirectArguments); cullResult = new ComputeBuffer(instanceCount, sizeof(float)*32, ComputeBufferType.Append); UpdateBuffers(); } void Update() { // Update starting position buffer if (cachedInstanceCount != instanceCount || cachedSubMeshIndex != subMeshIndex) UpdateBuffers(); var camera = Camera.main; var vpMatrix = GL.GetGPUProjectionMatrix(camera.projectionMatrix,false) * camera.worldToCameraMatrix; compute.SetMatrix("vpMatrix", vpMatrix); positionBuffer.SetData(infos); compute.SetBuffer(kernel, "input", positionBuffer); cullResult.SetCounterValue(0); compute.SetBuffer(kernel, "cullresult", cullResult); compute.SetInt("instanceCount", instanceCount); compute.SetInt("visibleCount", 0); compute.Dispatch(kernel, instanceCount / 64, 1, 1); instanceMaterial.SetBuffer("positionBuffer", cullResult); // Indirect args if (instanceMesh != null) { args[0] = (uint)instanceMesh.GetIndexCount(subMeshIndex); args[1] = (uint)instanceCount; args[2] = (uint)instanceMesh.GetIndexStart(subMeshIndex); args[3] = (uint)instanceMesh.GetBaseVertex(subMeshIndex); } else { args[0] = args[1] = args[2] = args[3] = 0; } argsBuffer.SetData(args); // Pad input if (Input.GetAxisRaw("Horizontal") != 0.0f) instanceCount = (int)Mathf.Clamp(instanceCount + Input.GetAxis("Horizontal") * 40000, 1.0f, 5000000.0f); // Render Graphics.DrawMeshInstancedIndirect(instanceMesh, subMeshIndex, instanceMaterial, new Bounds(Vector3.zero, new Vector3(100.0f, 100.0f, 100.0f)), argsBuffer); } void OnGUI() { GUI.Label(new Rect(265, 25, 200, 30), "Instance Count: " + instanceCount.ToString()); instanceCount = (int)GUI.HorizontalSlider(new Rect(25, 20, 200, 30), (float)instanceCount, 1.0f, 5000000.0f); } void UpdateBuffers() { // Ensure submesh index is in range if (instanceMesh != null) subMeshIndex = Mathf.Clamp(subMeshIndex, 0, instanceMesh.subMeshCount - 1); // Positions if (positionBuffer != null) positionBuffer.Release(); positionBuffer = new ComputeBuffer(instanceCount, 152); infos.Clear(); Vector4[] positions = new Vector4[instanceCount]; for (int i = 0; i < instanceCount; i++) { ObjInfo info = default; float angle = Random.Range(0.0f, Mathf.PI * 2.0f); float distance = Random.Range(20.0f, 100.0f); float height = Random.Range(-2.0f, 2.0f); float size = Random.Range(0.05f, 0.25f); var position = new Vector3(Mathf.Sin(angle) * distance, height, Mathf.Cos(angle) * distance); info.boundMin = position - new Vector3(0.5f, 0.5f, 0.5f); info.boundMax = position + new Vector3(0.5f, 0.5f, 0.5f); info.localToWorldMatrix = Matrix4x4.TRS(position, Quaternion.identity, Vector3.one); info.worldToLocalMatrix = Matrix4x4.Inverse(info.localToWorldMatrix); infos.Add(info); } cachedInstanceCount = instanceCount; cachedSubMeshIndex = subMeshIndex; } void OnDestroy() { if (positionBuffer != null) positionBuffer.Release(); positionBuffer = null; if (argsBuffer != null) argsBuffer.Release(); argsBuffer = null; if (cullResult != null) cullResult.Release(); cullResult = null; } }
compute shader代码如下:
// Each #kernel tells which function to compile; you can have many kernels #pragma kernel CSMain struct ObjInfo { float3 boundMin; float3 boundMax; float4x4 localToWorldMatrix; float4x4 worldToLocalMatrix; }; struct MatrixInfo { float4x4 localToWorldMatrix; float4x4 worldToLocalMatrix; }; uint instanceCount; // Create a RenderTexture with enableRandomWrite flag and set it // with cs.SetTexture float4x4 vpMatrix; StructuredBuffer<ObjInfo> input; AppendStructuredBuffer<MatrixInfo> cullresult; [numthreads(64,1,1)] void CSMain (uint3 id : SV_DispatchThreadID) { if(instanceCount<=id.x) return; ObjInfo info = input[id.x]; float3 boundMax = info.boundMax; float3 boundMin = info.boundMin; float4 boundVerts[8]; float4x4 mvpMatrix = mul(vpMatrix,info.localToWorldMatrix); boundVerts[0] = mul(mvpMatrix, float4(boundMin, 1)); boundVerts[1] = mul(mvpMatrix, float4(boundMax, 1)); boundVerts[2] = mul(mvpMatrix, float4(boundMax.x, boundMax.y, boundMin.z, 1)); boundVerts[3] = mul(mvpMatrix, float4(boundMax.x, boundMin.y, boundMax.z, 1)); boundVerts[4] = mul(mvpMatrix, float4(boundMin.x, boundMax.y, boundMax.z, 1)); boundVerts[5] = mul(mvpMatrix, float4(boundMin.x, boundMax.y, boundMin.z, 1)); boundVerts[6] = mul(mvpMatrix, float4(boundMax.x, boundMin.y, boundMin.z, 1)); boundVerts[7] = mul(mvpMatrix, float4(boundMin.x, boundMin.y, boundMax.z, 1)); bool isInside = false; for (int i = 0; i < 8; i++) { float4 boundVert = boundVerts[i]; bool inside = boundVert.x <= boundVert.w && boundVert.x >= -boundVert.w && boundVert.y <= boundVert.w && boundVert.y >= -boundVert.w && boundVert.z <= boundVert.w && boundVert.z >= -boundVert.w; isInside = isInside || inside; } if (isInside) { MatrixInfo matrixInfo; matrixInfo.localToWorldMatrix = info.localToWorldMatrix; matrixInfo.worldToLocalMatrix = info.worldToLocalMatrix; cullresult.Append(matrixInfo); } }
我们会看到从脚本里面传入compute shader的包围盒信息的八个顶点都进行了转换到投影空间裁剪的操作。裁剪完成将结果buffer传入shader中,shader代码如下(为了方便,直接用了内置管线的表面着色器):
Shader "Unlit/IndirectShader" { Properties { _MainTex ("Albedo (RGB)", 2D) = "white" {} _Glossiness ("Smoothness", Range(0,1)) = 0.5 _Metallic ("Metallic", Range(0,1)) = 0.0 } SubShader { Tags { "RenderType"="Opaque" } LOD 200 CGPROGRAM // Physically based Standard lighting model #pragma surface surf Standard addshadow fullforwardshadows #pragma multi_compile_instancing #pragma instancing_options procedural:setup sampler2D _MainTex; struct Input { float2 uv_MainTex; }; #ifdef UNITY_PROCEDURAL_INSTANCING_ENABLED struct MatrixInfo { float4x4 localToWorldMatrix; float4x4 worldToLocalMatrix; }; StructuredBuffer<MatrixInfo> positionBuffer; #endif void rotate2D(inout float2 v, float r) { float s, c; sincos(r, s, c); v = float2(v.x * c - v.y * s, v.x * s + v.y * c); } void setup() { #ifdef UNITY_PROCEDURAL_INSTANCING_ENABLED MatrixInfo data = positionBuffer[unity_InstanceID]; unity_ObjectToWorld = data.localToWorldMatrix; unity_WorldToObject = data.worldToLocalMatrix; #endif } half _Glossiness; half _Metallic; void surf (Input IN, inout SurfaceOutputStandard o) { fixed4 c = tex2D (_MainTex, IN.uv_MainTex); o.Albedo = c.rgb; o.Metallic = _Metallic; o.Smoothness = _Glossiness; o.Alpha = c.a; } ENDCG } }
效果如下:
准确的视锥culling。。。
这样,Gpu culling就完成了。核心就是理解DrawIndirect这个接口和GpuInstance,这个比较基础,这里就不说了(不会用接口看官方文档的介绍,GPU Instance的原理可以自行百度,或者找个时间再写一篇扫个盲),代码没什么难度,但是跑一下发现一个问题:
可以看到set compute buffer的执行效率如此之低。因为set compute buffer实际上是cpu 向 gpu传输数据,带宽问题就会导致这个效率问题。因此我们可以把set compute buffer这一步骤移到当数量改变时再去set,但是这种程度的卡顿在游戏中实际使用时无法接受的。所以目前draw indirect和gpu culling更适合于位置旋转缩放不变的一些物体,并且有高度的重复mesh。我们可以将所有的模型预烘焙位置信息,然后数据一次放在gpu就不动了。最常见的例子就是大批量草地的渲染,通过这种方式会得到非常好的优化。
这就完了?就这?
是的,完了,本来想把基于GPU的Hi-z写一下,但是懒,嗯!在这里简单说下原理吧:
我们刚才GPU culling做的是视锥剔除,还有遮挡剔除还没有做,而通过GPU 的 Hi-z culling是常见的遮挡剔除方案。简单来说就是通过不同采样不同mip level的深度图,根据深度图和物体进行深度对比,决定哪个物体被cull,就不会被append到result中。深度图的miplevel可以直接采样低level的mipmap,但是会比较激进,因为要保证正确的遮挡剔除,必须取多个像素中深度最大的一个像素。而默认的mipmap不是这样的。
具体hiz的实现已经有很多了,这里给一个链接:https://zhuanlan.zhihu.com/p/47615677 文章来自知乎大V:MAXWELL
揉了揉困酣的双眼,看了看时间,已经是凌晨1点20了,写的内容如果有误可能是因为太困了,欢迎指正。