Optimizing Graphics in Unity 学习
https://learn.unity.com/tutorial/optimizing-graphics-in-unity
4. Multithreaded Rendering & Graphics Jobs
1. Rendering
2. Camera
Clear
On mobile tile-based renderers, the clear command is particularly important. Unity takes care of the details, so you only have to set the clear flags on the Camera and avoid using the "Don't Clear" flag when targeting mobile devices. The underlying behavior of the clear command depends on the platform and graphics driver, but depending on the clear flag you choose, it can impact performance significantly, because Unity has to either clear the previous content, set flags to ignore the previous content, or read previous content back from the buffer.
Clear flags
On mobile, avoid Unity's default Skybox (appropriately named Default-Skybox), which is computationally expensive and which is enabled by default in all new Scenes.
Discard and Restore buffer
When using OpengGLES on Adreno GPUs(高通), Unity only discards the framebuffer to avoid a framebuffer restore. On PVR(Power VR) and Mali(arm) GPUs, Unity clears to prevent a framebuffer restore.
Moving things in or out of graphics memory is resource-intensive on mobile devices, because the devices use a shared memory architecture, meaning CPU and GPU share the same physical memory. On tile-based GPUs like Adreno, PowerVR or the Apple A-series, loading or storing data in the logical buffer uses significant system time and battery power. Transfering content from shared memory to the portion of the framebuffer for each tile (or from the framebuffer to shared memory) is the main source of resource-heavy activity.
Tile-based Rendering
Tile-based rendering divides the viewport into smaller tiles with a typical size of 32x32px, and keeps these tiles in faster memory closer to the GPU. The copy operations between this smaller memory and the real framebuffer can take some time, because memory operations are a lot slower than arithmetic operations
These slow memory operations are the main reason you should avoid loading the previous framebuffer with a glClear (OpenGLES) call on tile-based GPUs each new frame. By issuing a glClear command, you are telling the hardware that you do not need previous buffer content, so it does not need to copy the color buffer, depth buffer, and stencil buffer from the framebuffer to the smaller tile memory
RenderTexture Switching
The graphics driver executes load and store operations on the framebuffer when you switch rendering targets. For example, if you render to a view's color buffer and Texture in two continuous frames, the system repeatedly transfers (loads and stores) the Texture's content between shared memory and the GPU
FrameBuffer Compression
The clear command also has an effect on the compression of the frame buffer, including the color, depth, and stencil buffers. Clearing the entire buffer allows it to compress more tightly, reducing the amount of data the driver has to transfer between the GPU and memory, therefore allowing for higher frame rates due to improved throughput. On tile based architecture, clearing tiles is a small task that involves setting a few bits in each tile. When complete, this makes the tile very cheap to fetch from memory. Note: These optimizations apply to tile-based deferred rendering GPUs and steraming GPUs
Culling
Culling happens per-camera and can have a serere impact on performance, especially when multiple cameras are enabled concurrently. The two types of culling are frustum and occlusion culling:
- Frustum Culling is performed automatically on every Unity Camera
- Occlusion culling is controlled by the developer
Frustum Culling
Frustum Culling makes sure that GameObjects outside of the Camera frustum are not rendered to save rendering performance
Note: Frustum culling is jobified in 2017.1 and later, and Unity now also culls by layer first. Culling by layer means that Unity only culls the GameObjects on layer the Camera uses, and ignores GameObjects on other layers. Afterwards, Unity uses jobs on threads to cull GameObjects based on the camera frustum
Occlusion Culling
When you enable Occlusion Culling, Unity does not render GameObjects if the Camera cannot see them. For example, rendering another room is unnecessary if a door is closed and the Camera cannot see the room
If you enable Occlusion Culling it can significantly increase performance, but it occupies more disk space and RAM because the Unity Umbra integration bakes the occlusion data during the build and Unity needs to load it from disk to RAM while loading a Scene.
Multiple Cameras
When you use many active cameras in your scene, these is a significant fixed culling and render overhead per-camera. Unity reduced the culling overhead in Unity 2017.1 due to layer culling, but if Cameras do not use a different layer to structure the content to render, this does not have any effect
Per-Layer culling distances
You can set per-layer culling distances manually on the Camera via Script. Setting the cull distance is useful for culling small GameObjects that do not contribute to the Scene when the Camera views them from a given distance
Skinned Motion Vectors
Fillrate
Decreased pixel fillrate is a result of overdraw and fragment shader complexity. Unity often implements shaders as multiple passes (draw diffuse, draw specular, and so forth). Using multiple passes leads to overdraw, where the different Shaders touch (read/write) the same pixels multiple times.
Overdarw
Overdraw view
The Overdarw view allows you to see the objects that Unity draws on top of one another
White is the least optimal, because a pixel is overdrawn multiple times, while black means to overdarw is occuring
Transparency
Transparency also adds to overdarw. In the optimal case, every pixel on the screen is touched only once per frame
Alpha Blending
You should avoid overlapping alpha-blended geometry (such as dense particle effects and full-screen post-processing effects) to keep fillrate low.
Draw Order
Objects in the Unity opaque queue are rendered in front-to-back order using a bounding box (AABB center coordinates) and depth testing to minimize overdarw. However, Unity renders objects in the transparent queue in a back-to-front order, and does not perform depth testing, making objects in the transparent queue subject to overdraw. Unity also sorts Transparent GameObjects based on the center position of their bounding boxes.
Z-testing
Z-testing is faster than drawing a pixel. Unity performs culling and opaque sorting via boungding box. Therefore, Unity may draw large background objects first, such as the Skybox or a ground plane, because the bounding box is large and fills a large number of pixels that end up not being visible later after being overdrawn with other objects. If you see this happen, move those objects to the end of the queue manually.
Draw Call Batching
PC hardware can push a lot of draw calls, but the overhead of each call is still high enough to warrant trying to reduce them. On mobile devices, however, draw call optimization is vital.
Instancing
Instancing forces Unity to use constant buffers, which work well on desktop GPUs but are slow on mobile devices, Instancing only starts to become useful at around 50-100 Meshes, depending on the underlying hardware.
Geometry
It is essential to keep the geometric complexity of GameObjects in your Scenes to a minimum, otherwise Unity has to push a lot of vertex data to the graphics card. 200k static triangles is a conservative target for low-end mobile. However, this also depends on whether your GameObjects are animated or static
Level of Detail (LOD)
Static Scenes
High-Quality LODs
Runtime Mesh Combination
Animation LODs
3. Textures
Asset Auditing
Texture Compression
Texture compression offers significant performance benefits when you apply it correctly. On newer mobile devices, you should favor ASTC compressed Texture formats. If ASTC is not available on your target device, use ETC2 on Android nad PVRTC on iOS
ASTC
Using ASTC Texture compression for Game Assets
PVRTC
PVRTC was the main Texture compression format on iOS until Apple added ASTC. If you use PVRTC on Android, you should replace it with ETC2 if possible.
Note: The PVRTC Texture format on iOS and ETC format (Android 4.x devices) reuiqres square Textures. When compressing a non-square Texture, two behaviors can occur.
- If no Sprite uses the Texture and the compressed memory footprint is smaller than it would be if left uncompressed, Unity resizes the Texture based on the non-power-of-two (NPOT) Texture scale factor.
- Otherwise, Unity does not resize the Texture, and marks it as uncompressed.
GPU Upload
Unity uploads a Texture directly to the GPU after it finishes loading, and does not wait until the Texture becomes visible in the Camera frustum.
When a loading thread finishes loading Scene and Assets, Unity needs to awaken them. Where and how loading happens depends on Unity version and the calls used to initalize the load
Load Behavior
If you load an Asset from AssetBundles, Resources, or Scenes, Unity goes from the preloading thread (disk I/O) to the graphics thread (GPU upload). If you use Unity 5.5 or later, and you enalbe Graphics Jobs, Unity goes from the preloading jobs directly to the GPU
Awake Behavior
Unity awakes Assets on the main thread directly after awakening all Scene GameObjects. If you use AssetBundle.LoadAsset, Resources.Load or SceneManager.LoadScene to load Assets and Scenes, Unity blocks the main thread and wakes up all Assets. If you're using the non-blocking versions of those calls (for example, AssetBundle.LoadAssetAsync), Unity uses time-slicing to wake the Assets up.
Memory Behavior
While loading serval Textures at once, if either the upload rate is not fast enough or the main thread stalls, you can adjust Texture buffsize. Changing the default values, though, can lead to high memory pressure. You can read more about memory restrictions in Texture buffers when using time-slice awake in the RingBuffer section of the Memory Management in Unity guide
Note: If GPU memory overloads, the GPU uploads the least-recently-used Texture and forces the CPU to re-upload it the next time it enters the camera frustum
4. Multithreaded Rendering & Graphics Jobs
Singlethreaded Rendering (single client, no worker thread)
Unity uses singlethread rendering by default if none of the other modes are enabled
This causes the single client to occupy the main thread while executing the high-level rendering commands
The single client executes all the rendering commands (RCMD) on the main thread. The client also owns the real graphics device GfxDevice and performs the actual rendering through the underlying graphics API (GCMD) on the main thread. This is suboptimal, because all commands you execute on the main thread subtract from important frametime which you could use for other subsystems running on the main thread
Multithreaded Rendering (single client, single worker thread)
Multithread rendering in Unity is implemented as a single client, single worker thread. This works by taking advantage of the abstract GfxDevice interface in Unity. The different graphics API implementations, (such as Vulkan, Metal and GLES) inherit from the GfxDevice
Renderthread
When you enalbed multithreaded rendering you can spot the GfxDeviceClient class functions in call-stacks on a native platform profiler such as XCode. In the Unity Timeline Profiler, it is called the Renderthread
The single client forwards all the rendering commands (RCMD) to the renderthread - a special worker thread only for rendering - which owns the real graphics device GfxDevice and performs the actual rendering throguh the underlying graphics API (GCMD)
Availability
Performance Considerations
You shoudl enable Multithread Rendering whenever possible, as it usually benefits performance greatly.
Profiling Multithreaded Rendering
Often, you need to profile Multithreaded Rendering to improve rendering performance, and it’s necessary to disable the Multithreaded Rendering setting to get correct results. You can also use the script-only player setting PlayerSettings.MTRendering to change Multithreaded Rendering. Alternatively, disable this in the Player Settings of the relevant platforms (see the earlier section on Availability). To disable Multithreaded Rendering in the Editor, use the following command line option: -force-gfx-direct. If you need the client device enabled (for example, to use display lists) use -force-gfx-st instead.
Jobified Rendering (multiple clients, single worker thread)
This render mode was available in Unity 5.4, 5.5 and 5.6, but has since been replaced by Graphics Job
Multiple jobs, each of them running on its own thread, generate intermediate graphics commands (IGCMD). Afterwards, similar to Multithread Rendering (single client, single worker thread), a worker thread processes the buffered intermediate graphics commands and submits graphics commands (GCMD) to the real graphic device GfxDevice.
These jobs have clearly defined inputs (RCMD) because they can run at the same time as user script code, which potentially chagnes the state of any object in the world. Jobs output commands (RCMD) to a different GfxDeviceClient per thread, and they write into their own block-allocating buffers, which the worker thread then executes
Graphics Jobs (multiple clients, no worker thread)
Unity disables Graphics Jobs by default, but you can enable them in the Player Settings. Multiple native command generation threads take advantage of the graphics APIs that support recording graphics commands in a native format on multiple threads. This removes the performance impact of writing and reading commands in a custom format before submitting them to the API. Similar to the other modes, Graphics Jobs generate commands by calling GfxDevice functions. However, since the devices are now platform-specific, Graphics Jobs translate the commands directly into, for example DirectX 12 or Vulkan command buffers.
Availability
Profiling Rendering
When you investigate the rendering system while profiling, disable Multithreaded Rendering, Jobified Jobs, and Graphics Jobs to see the whole render queue executed on the main thread in singlethreaded rendering mode. This makes it easier to measure the timing and see the command queue easier.
GfxThreadableDevice Functions
When you look at GfxDeviceClient functions in a native call stack while profiling, it often adds extra virtual functions from the GfxThreadableDevices class.
These extra functions are variations of the GfxDevice functions that take data that isn’t thread-safe (for example, ShaderLab::PropertySheet) and convert them to data that is thread-safe. When you call SetShaders() in Multithreaded Rendering, the main thread takes a ShaderLab::PropertySheet and turns it into plain serialized data that GfxDevice feeds to SetShadersThreadable() on the renderthread. When you investigate shader performance, measure the timing of the SetShadersThreadable() method to gain information on how long it takes to set actual shaders and compare them to their non-threaded equivalent.
5. Framebuffer
The framebuffer contains the depth, stencil, and color buffers. Color buffers are an essential part and are always present, while other buffers can be present or not depending on the graphics features you use.
Double & Triple Buffering
Color Buffer
The number of framebuffers in used depends mostly on the graphics driver, and there is one color buffer per framebuffer. For example, when you use OpenGL ES on Android, Unity uses one EGLWindowSurface with a color buffer, but Unity doesn't have control over how many color buffers and framebuffers it uses. Typically, Unity uses three framebuffers for triple buffering, but if a device does not support it, it falls back to double buffering and use two framebuffers including two color buffers.
Stencil & Depth Buffer
The stencil buffer and depth buffer are only bound to the framebuffer if graphics features use them. You should disable them if you know that your application does not require them, because a framebuffer occupies a gread deal of graphics memory depending on resolution, and is resource-intensive to create.
Disable Depth and Stencil*
On mobile GPUs, the depth buffer and stencil buffer are two separate buffers with 24-bit for the depth buffer and 8bit for the stencil buffer. They are not combined in one buffer unlike on desktop platforms where the buffers are combined into one 32-bit buffer utilizing 24-bit for the depth buffer and 8-bit for the stencil buffer
Native Resolution
Modern mobiel phones have a very high resolution for their displays. The native resolution is often way over 1080p. Even for modern consoles, 1080p is difficult to support without a decrease in performance.
Screen.SetResolution
Buffer Size
Final Blit
On Android and OpenGLES, Unity creates a framebuffer object with color buffer and depth buffer attachment, which Unity uses for all the rendering. At the end of the frame, Unity blits this framebuffer into the EGLSurface.
When using Vulkan on Android, Unity does not perform the final blit, because doing so interacts with the existing BufferQueue component via the existing ANativeWindow interface, and uses Gralloc HAL for the data
6. Shaders
Mobile Shaders
On mobile devices, it is essential to verify that all fragment shaders are mobile friendly. When you use built-in shaders, you should use the Mobile or Unlit version of the shader. Avoid excessive use of multi-pass shaders (for example, legacy specular) and excessive shader passes (that is, more than 2 passes)
Lightmaps
Where appropriate, you should use the most basic shaders. Make use of the inexpensive Mobile->Unlit (Supports Lightmap) shader to lightmap your Scene.
Project Imports
You can remove every shader that you don't use from the Always included list of shaders in the Graphics Settings (Edit->ProjectSettings->Graphics). Additionally, you can add shaders to the list which always includes them for the lifetime of the application. Tip: If you want finer control over load times, use shader variant collections instead; this lets you take the performance impact of loading at a time you choose during run time, rather than increasing your initial load time.
Default Shaders
Some Unity Shaders are always included in the build by default, such as the Splash Screen, pink error Shader, and the clear screen. These Shaders account for a dozen kilobytes in total, but not in the range of megabytes.
Shader Build Report
After the build you can find data for large shaders in the Editor.log, which includes shader timing and size and looks similar to the following log:
This report tells you a couple of things about the Test shader:
- The shader expands into 482 variants due to #pragma multi_compile and shader_feature
- Unity compresses the shader included in the game data to roughly the sum of the compressed sizes: 0.14 + 0.12 + 0.20 + 0.15 = 0.61MB
- At run time, Unity keeps the compressed data in memory (0.61MB) while the data for your currently used graphics API (for example Metal) is uncompressed which in the above example would account for 2.56MB
Shader Memory
Insepecting the log file shows the compressed disk size for single Shaders. To determine the size of Shaders at run time, you can perform a detailed memory capture with the Unity Profiler. If you complete a deep memory profile, you can inspect Shaderlab, which includes everything associated with Shaders under the Shaderlab root, including buffers, source code, and other allocations related to the compilation of shaders
Shader Keywords
Shader keywords are global. Currently, you can only use 196 keywords, because Unity itself uses 60 internally
When you build Shaders, you can use underscore _ for disabling/enabling purpose functionally to avoid occupying global keywrods (for instance, when using #pragma multi_compile_SUPER_FEATURE).
use shader_feature over multi_compile as it saves memroy by stripping unneeded keywords. Shaders themselves have their own object root, and the Profiler lists them under Shaders.
Shader Variants
Shaders often include a multitude of variants which increase build size and which might not be necessary
Making multiple shader program variants
Shader Variant Collections
Unity can preload shader variant collection during application load time, or you can load them via scripts. If you load them via script, you gain control over the loading process. Optimizing Shader Load Time
If you add a Shader and a variant collection which refers to it, Unity loads all sub-shader (LODs) of the Shader when you warm up the variant collections.
Shader Preloading
Unity can preload Shaders and keep them in memory for the lifetime of the application, which grants control over how much memroy Shaders occupy. Additionally, preloading Shaders reduces. Scene load time issues as you control the time when Unity loads the Shaders
Built-in shaders
Built-in Shaders on mobile are generalized for a specific use-case; for example, Unity made the UI/Default shader specifically for UI elements. You should remove any Shaders from the Always Included Shader list that you do not use.