Streamline分析Android性能
Arm Mobile Studio是一套分析Android(无需root)上App的CPU,GPU的高效性能优化工具,辅助开发人员来定位App性能瓶颈。
由以下4个子工具组成:Performance Advisor、Streamline、Graphics Analyzer和Mali Offline Compiler。
工具名 | 说明 |
Performance Advisor | 命令行工具。读取Streamline截帧文件,得到直观的性能分析报告,并给出优化建议。 |
Streamline |
用于截取CPU、GPU、内存等性能数据,并进行图形化实时展示。 注:Mali的手机才会有GPU信息 |
Graphics Analyzer |
调试OpenGLES或Vulkan图形API,分析overdraw、shader、texture等 注:需为Mali的GPU |
Mali Offline Compiler | 检查shader代码在Mali的GPU上的性能。 |
具体分为Starter Edition(免费版本)和Professional Edition(收费版本),详见版本比较:
FEATURE |
Starter Edition |
Professional Edition |
---|---|---|
Run Arm Mobile Studio tools headlessly within your existing continuous integration systems | No | Yes |
Generate machine-readable reports in JSON format | No | Yes |
Access world-class support from Arm | No | Yes |
Intuitive Performance Advisor reports pinpointing problem areas and providing profiling advice | Yes | Yes |
Mali Offline Compiler shows performance and bottlenecks relating to shaders or kernels | Yes | Yes |
Detailed application profiling with off-the-shelf mobile devices | Yes | Yes |
Full support for all announced Arm 32-bit and 64-bit CPU architectures | Yes | Yes |
Access to detailed CPU and GPU hardware counters | Yes | Yes |
Frame-by-frame analysis of OpenGL ES and Vulkan content | Yes | Yes |
Enhance your profiling experience with custom code annotations | Yes | Yes |
Debug and profile VR Applications | Yes | Yes |
License required for use | Free to use | Purchase required for additional features and use in a continuous integration system |
注:从ArmDeveloper官网上下载Starter Edition(免费版本)Arm Mobile Studio。最新版本为2020.2,详见:release history
Starter Edition(免费版本)Arm Mobile Studio安装后文件如下:
Mali Offline Compiler简单介绍
顶点着色器
执行malioc shader.vert命令,输出如下编译统计信息:
注:在移动端会执行两遍VS:Position variant为position only的vs,Varying variant为完整的vs
像素着色器
执行malioc shader.frag命令,输出如下编译统计信息:
注1:Stack spilling如果显示XXX bytes,表示溢出了XXX字节,本质还是寄存器个数不够(例如if动态分支太多),运行时会带来严重性能问题(一些变量不放在寄存器中,而是放在显存中,读写这些变量就会非常慢)
Xcode Metal Frame Debugger中寄存器不够,会报Register Spilling的warning
注2:当Use late ZS test为false、Uses late ZS update为false时,表示HSR(或Early-Z)是有效的
当Use late ZS test为false、Uses late ZS update为true时,表示HSR(或Early-Z)是失效的
参考:使用Mali Compiler对Unity Shader进行优化
下文重点讲解Streamline性能分析工具。
手机设备
华为Mate30(8核,Mali-G76,8GB)
更多性能指标见:
Cortex-A55 - [1 of 6 counters available] Branch Predictor: Mispredictions Branch Predictor: Possible Predictions Bus: Access Bus: Access (due to read) Bus: Access (due to write) Cycles: Bus Cycles Cycles: CPU Cycles Data TLB: Translation table walk Errors: Memory Errors: Pre-decode Exceptions: FIQ Exceptions: IRQ Exceptions: Taken Instruction TLB: Translation table walk Instructions (Executed): All Instructions (Executed): Branch (Any) Instructions (Executed): Branch (Conditional) Instructions (Executed): Branch (Conditional, Mispredicted) Instructions (Executed): Branch (Immediate) Instructions (Executed): Branch (Indirect, Address predicted) Instructions (Executed): Branch (Indirect, Mispredicted Address) Instructions (Executed): Branch (Indirect, Mispredicted) Instructions (Executed): Branch (Mispredicted) Instructions (Executed): Branch (Return) Instructions (Executed): Branch (Return, Address predicted) Instructions (Executed): Branch (Return, Mispredicted Address) Instructions (Executed): Exception Returns Instructions (Executed): Increment PMSWINC Register Instructions (Executed): Load Instructions (Executed): Store Instructions (Executed): Unaligned Load/Store Instructions (Executed): Write to CONTEXTIDR Instructions (Executed): Write to PC Instructions (Executed): Write to TTBR Instructions (Speculated): All Instructions (Speculated): Branch (immediate) Instructions (Speculated): Branch (indirect) Instructions (Speculated): Branch (return) Instructions (Speculated): Branch (software PC writes) Instructions (Speculated): Crypto Instructions (Speculated): Data Processing (Advanced SIMD) Instructions (Speculated): Data Processing (Floating-point) Instructions (Speculated): Data Processing (Integer) Instructions (Speculated): Load Instructions (Speculated): Load/Store Instructions (Speculated): Store L1 Data Cache: Access L1 Data Cache: Access (due to read) L1 Data Cache: Access (due to write) L1 Data Cache: Enter Write Streaming Mode L1 Data Cache: Refill L1 Data Cache: Refill (due to prefetch) L1 Data Cache: Refill (due to read) L1 Data Cache: Refill (due to write) L1 Data Cache: Refill (from inside cluster) L1 Data Cache: Refill (from outside cluster) L1 Data Cache: Write Streaming Mode L1 Data Cache: Write-back L1 Data TLB: Access L1 Data TLB: Refill L1 Instruction Cache: Access L1 Instruction Cache: Refill L1 Instruction TLB: Access L1 Instruction TLB: Refill L2 Data Cache: Access L2 Data Cache: Access (due to read) L2 Data Cache: Access (due to write) L2 Data Cache: Allocation without refill L2 Data Cache: Refill L2 Data Cache: Refill (due to prefetch) L2 Data Cache: Refill (due to read) L2 Data Cache: Refill (due to write) L2 Data Cache: Stash Dropped L2 Data Cache: Write Streaming Mode L2 Data Cache: Write-back L2 Data/Unified TLB: Access L2 Data/Unified TLB: Access (IPA) L2 Data/Unified TLB: Access (Last Level Walk) L2 Data/Unified TLB: Access (Level 2 Walk) L2 Data/Unified TLB: Refill L2 Data/Unified TLB: Refill (IPA) L2 Data/Unified TLB: Refill (Last Level Walk) L2 Data/Unified TLB: Refill (Level 2 Walk) L3 Data Cache: Access L3 Data Cache: Access (due to read) L3 Data Cache: Allocation without refill L3 Data Cache: Refill L3 Data Cache: Refill (due to prefetch) L3 Data Cache: Refill (due to read) L3 Data Cache: Write Streaming Mode Last Level Cache: Access (due to read) Last Level Cache: Miss (due to read) Memory: Access Memory: Access (due to read) Memory: Access (due to write) Multi-socket Remote Access: Access (due to read) Stalls: Backend Stalls: Backend (Interlock) Stalls: Backend (Interlock, AGU) Stalls: Backend (Interlock, FPU) Stalls: Backend (Interlock, Load) Stalls: Backend (Interlock, Load, Cache-miss) Stalls: Backend (Interlock, Load, TLB-miss) Stalls: Backend (Interlock, Store) Stalls: Backend (Interlock, Store, STB full) Stalls: Backend (Interlock, Store, TLB-miss) Stalls: Frontend Stalls: Frontend (Cache miss) Stalls: Frontend (Pre-decode error) Stalls: Frontend (TLB miss)
Linux CPU Activity: System (Cortex-A55) CPU Activity: System (Other) CPU Activity: User (Cortex-A55) CPU Activity: User (Other) CPU Contention: Wait Memory: Buffer Memory: Cached Memory: Free Memory: Slab Memory: Used
Mali Job Manager Mali GPU Cycles: Fragment queue active Mali GPU Cycles: GPU active Mali GPU Cycles: Non-fragment queue active Mali GPU Tasks: Fragment tasks
Mali Memory System Mali External Bus Accesses: Read transaction Mali External Bus Accesses: Write transaction Mali External Bus Beats: Read beat Mali External Bus Beats: Write beat Mali External Bus Read Latency: 0-127 cycles Mali External Bus Read Latency: 128-191 cycles Mali External Bus Read Latency: 192-255 cycles Mali External Bus Read Latency: 256-319 cycles Mali External Bus Read Latency: 320-383 cycles Mali External Bus Stalls: Read stall cycles Mali External Bus Stalls: Write stall cycles Mali L2 Cache Lookups: Read lookup Mali L2 Cache Lookups: Write lookup
Mali Shader Core Mali Core Cycles: Execution core active Mali Core Cycles: Fragment active Mali Core Cycles: Fragment FPKB active Mali Core Cycles: Non-fragment active Mali Core External Reads: Fragment external read beats Mali Core External Reads: Load/store external read beats Mali Core External Reads: Texture external read beats Mali Core Instructions: Diverged instructions Mali Core Instructions: Executed instructions Mali Core L2 Reads: Fragment L2 read beats Mali Core L2 Reads: Load/store L2 read beats Mali Core L2 Reads: Texture L2 read beats Mali Core Load/Store Cycles: Atomic access cycles Mali Core Load/Store Cycles: Full read cycles Mali Core Load/Store Cycles: Full write cycles Mali Core Load/Store Cycles: Partial read cycles Mali Core Load/Store Cycles: Partial write cycles Mali Core Primitives: Rasterized primitives Mali Core Quads: Early ZS killed quads Mali Core Quads: Early ZS tested quads Mali Core Quads: Early ZS updated quads Mali Core Quads: FPK occluder quads Mali Core Quads: Late ZS killed quads Mali Core Quads: Late ZS tested quads Mali Core Quads: Rasterized fine quads Mali Core Texture Cycles: Cache lookups Mali Core Texture Cycles: Texturing active Mali Core Texture Line Fetches: Compressed line fetches Mali Core Texture Line Fetches: Line fetches Mali Core Texture Quads: Descriptor misses Mali Core Texture Quads: Mipmapped texture issues Mali Core Texture Quads: Texture issues Mali Core Texture Quads: Texture requests Mali Core Texture Quads: Trilinear filtered issues Mali Core Tiles: Tiles Mali Core Tiles: Unchanged tiles killed Mali Core Varying Cycles: 16-bit interpolation active Mali Core Varying Cycles: 32-bit interpolation active Mali Core Varying Requests: Interpolation requests Mali Core Warps: All register warps Mali Core Warps: Fragment warps Mali Core Warps: Full quad warps Mali Core Warps: Non-fragment warps Mali Core Warps: Partial fragment warps Mali Core Writes: Load/store other write beats Mali Core Writes: Load/store writeback write beats Mali Core Writes: Tile buffer write beats
Mali Tiler Mali Input Primitives: Line primitives Mali Input Primitives: Point primitives Mali Input Primitives: Triangle primitives Mali Primitive Culling: Facing and XY plane test culled primitives Mali Primitive Culling: Sample test culled primitives Mali Primitive Culling: Visible primitives Mali Primitive Culling: Z plane test culled primitives Mali Tiler Shading Requests: Position shading requests Mali Tiler Shading Requests: Varying shading requests
Other - [6 of 6 counters available] Branch Predictor: Mispredictions Branch Predictor: Possible Predictions Bus: Access Cycles: Bus Cycles Cycles: CPU Cycles Errors: Memory Exceptions: Taken Instructions (Executed): All Instructions (Executed): Branch (Immediate) Instructions (Executed): Branch (Return) Instructions (Executed): Exception Returns Instructions (Executed): Increment PMSWINC Register Instructions (Executed): Load Instructions (Executed): Store Instructions (Executed): Unaligned Load/Store Instructions (Executed): Write to CONTEXTIDR Instructions (Executed): Write to PC Instructions (Executed): Write to TTBR Instructions (Speculated): All L1 Data Cache: Access L1 Data Cache: Refill L1 Data Cache: Write-back L1 Data TLB: Refill L1 Instruction Cache: Access L1 Instruction Cache: Refill L1 Instruction TLB: Refill L2 Data Cache: Access L2 Data Cache: Refill L2 Data Cache: Write-back Memory: Access
Perf Software Alignment Faults: Faults Clock: CPU Clock Clock: Task Clock Emulation Faults: Faults Page Faults: Faults Page Faults: Major Faults Page Faults: Minor Faults Process: Context Switches Process: CPU Migrations
小米10(8核,Adreno (TM) 650,8GB)
更多性能指标见:
Cortex-A77 - [1 of 6 counters available] Branch Predictor: Mispredictions Branch Predictor: Possible Predictions Bus: Access Bus: Access (due to read) Bus: Access (due to write) Cycles: Bus Cycles Cycles: CPU Cycles Data TLB: Translation table walk Errors: Memory Exceptions: Data Abort Exceptions: FIQ Exceptions: HVC Exceptions: Instruction Abort Exceptions: IRQ Exceptions: SMC Exceptions: SVC Exceptions: Taken Exceptions: Trap (Data Abort) Exceptions: Trap (FIQ) Exceptions: Trap (Instruction Abort) Exceptions: Trap (IRQ) Exceptions: Trap (Other) Exceptions: Undefined Instruction TLB: Translation table walk Instructions (Executed): All Instructions (Executed): Branch (Any) Instructions (Executed): Branch (Mispredicted) Instructions (Executed): Exception Returns Instructions (Executed): Increment PMSWINC Register Instructions (Executed): Write to CONTEXTIDR Instructions (Executed): Write to TTBR Instructions (Speculated): All Instructions (Speculated): Barrier (DMB) Instructions (Speculated): Barrier (DSB) Instructions (Speculated): Barrier (ISB) Instructions (Speculated): Branch (immediate) Instructions (Speculated): Branch (indirect) Instructions (Speculated): Branch (return) Instructions (Speculated): Branch (software PC writes) Instructions (Speculated): Crypto Instructions (Speculated): Data Processing (Advanced SIMD) Instructions (Speculated): Data Processing (Floating-point) Instructions (Speculated): Data Processing (Integer) Instructions (Speculated): Load Instructions (Speculated): Load (Acquire) Instructions (Speculated): Load-Exclusive Instructions (Speculated): Load/Store Instructions (Speculated): Store Instructions (Speculated): Store (Release) Instructions (Speculated): Store-Exclusive Instructions (Speculated): Store-Exclusive (Failures) Instructions (Speculated): Store-Exclusive (Successes) L1 Data Cache: Access L1 Data Cache: Access (due to read) L1 Data Cache: Access (due to write) L1 Data Cache: Invalidation L1 Data Cache: Refill L1 Data Cache: Refill (due to read) L1 Data Cache: Refill (due to write) L1 Data Cache: Refill (from inside cluster) L1 Data Cache: Refill (from outside cluster) L1 Data Cache: Write-back L1 Data Cache: Write-back (due to clean) L1 Data Cache: Write-back (due to reuse) L1 Data TLB: Access L1 Data TLB: Access (due to read) L1 Data TLB: Access (due to write) L1 Data TLB: Refill L1 Data TLB: Refill (due to read) L1 Data TLB: Refill (due to write) L1 Instruction Cache: Access L1 Instruction Cache: Refill L1 Instruction TLB: Access L1 Instruction TLB: Refill L2 Data Cache: Access L2 Data Cache: Access (due to read) L2 Data Cache: Access (due to write) L2 Data Cache: Allocation without refill L2 Data Cache: Invalidation L2 Data Cache: Refill L2 Data Cache: Refill (due to read) L2 Data Cache: Refill (due to write) L2 Data Cache: Write-back L2 Data Cache: Write-back (due to clean) L2 Data Cache: Write-back (due to reuse) L2 Data/Unified TLB: Access L2 Data/Unified TLB: Access (due to read) L2 Data/Unified TLB: Access (due to write) L2 Data/Unified TLB: Refill L2 Data/Unified TLB: Refill (due to read) L2 Data/Unified TLB: Refill (due to write) L3 Data Cache: Access L3 Data Cache: Access (due to read) L3 Data Cache: Allocation without refill L3 Data Cache: Refill Last Level Cache: Access (due to read) Last Level Cache: Miss (due to read) Memory: Access Memory: Access (due to read) Memory: Access (due to unaligned read or write) Memory: Access (due to unaligned read) Memory: Access (due to unaligned write) Memory: Access (due to write) Multi-socket Remote Access: Access Stalls: Backend Stalls: Frontend
Kryo 460/485/495/585 Silver - [1 of 6 counters available] Branch Predictor: Mispredictions Branch Predictor: Possible Predictions Bus: Access Bus: Access (due to read) Bus: Access (due to write) Cycles: Bus Cycles Cycles: CPU Cycles Data TLB: Translation table walk Errors: Memory Errors: Pre-decode Exceptions: FIQ Exceptions: IRQ Exceptions: Taken Instruction TLB: Translation table walk Instructions (Executed): All Instructions (Executed): Branch (Any) Instructions (Executed): Branch (Conditional) Instructions (Executed): Branch (Conditional, Mispredicted) Instructions (Executed): Branch (Immediate) Instructions (Executed): Branch (Indirect, Address predicted) Instructions (Executed): Branch (Indirect, Mispredicted Address) Instructions (Executed): Branch (Indirect, Mispredicted) Instructions (Executed): Branch (Mispredicted) Instructions (Executed): Branch (Return) Instructions (Executed): Branch (Return, Address predicted) Instructions (Executed): Branch (Return, Mispredicted Address) Instructions (Executed): Exception Returns Instructions (Executed): Increment PMSWINC Register Instructions (Executed): Load Instructions (Executed): Store Instructions (Executed): Unaligned Load/Store Instructions (Executed): Write to CONTEXTIDR Instructions (Executed): Write to PC Instructions (Executed): Write to TTBR Instructions (Speculated): All Instructions (Speculated): Branch (immediate) Instructions (Speculated): Branch (indirect) Instructions (Speculated): Branch (return) Instructions (Speculated): Branch (software PC writes) Instructions (Speculated): Crypto Instructions (Speculated): Data Processing (Advanced SIMD) Instructions (Speculated): Data Processing (Floating-point) Instructions (Speculated): Data Processing (Integer) Instructions (Speculated): Load Instructions (Speculated): Load/Store Instructions (Speculated): Store L1 Data Cache: Access L1 Data Cache: Access (due to read) L1 Data Cache: Access (due to write) L1 Data Cache: Enter Write Streaming Mode L1 Data Cache: Refill L1 Data Cache: Refill (due to prefetch) L1 Data Cache: Refill (due to read) L1 Data Cache: Refill (due to write) L1 Data Cache: Refill (from inside cluster) L1 Data Cache: Refill (from outside cluster) L1 Data Cache: Write Streaming Mode L1 Data Cache: Write-back L1 Data TLB: Access L1 Data TLB: Refill L1 Instruction Cache: Access L1 Instruction Cache: Refill L1 Instruction TLB: Access L1 Instruction TLB: Refill L2 Data Cache: Access L2 Data Cache: Access (due to read) L2 Data Cache: Access (due to write) L2 Data Cache: Allocation without refill L2 Data Cache: Refill L2 Data Cache: Refill (due to prefetch) L2 Data Cache: Refill (due to read) L2 Data Cache: Refill (due to write) L2 Data Cache: Stash Dropped L2 Data Cache: Write Streaming Mode L2 Data Cache: Write-back L2 Data/Unified TLB: Access L2 Data/Unified TLB: Access (IPA) L2 Data/Unified TLB: Access (Last Level Walk) L2 Data/Unified TLB: Access (Level 2 Walk) L2 Data/Unified TLB: Refill L2 Data/Unified TLB: Refill (IPA) L2 Data/Unified TLB: Refill (Last Level Walk) L2 Data/Unified TLB: Refill (Level 2 Walk) L3 Data Cache: Access L3 Data Cache: Access (due to read) L3 Data Cache: Allocation without refill L3 Data Cache: Refill L3 Data Cache: Refill (due to prefetch) L3 Data Cache: Refill (due to read) L3 Data Cache: Write Streaming Mode Last Level Cache: Access (due to read) Last Level Cache: Miss (due to read) Memory: Access Memory: Access (due to read) Memory: Access (due to write) Multi-socket Remote Access: Access (due to read) Stalls: Backend Stalls: Backend (Interlock) Stalls: Backend (Interlock, AGU) Stalls: Backend (Interlock, FPU) Stalls: Backend (Interlock, Load) Stalls: Backend (Interlock, Load, Cache-miss) Stalls: Backend (Interlock, Load, TLB-miss) Stalls: Backend (Interlock, Store) Stalls: Backend (Interlock, Store, STB full) Stalls: Backend (Interlock, Store, TLB-miss) Stalls: Frontend Stalls: Frontend (Cache miss) Stalls: Frontend (Pre-decode error) Stalls: Frontend (TLB miss)
Linux CPU Activity: System (Cortex-A77) CPU Activity: System (Kryo 460/485/495/585 Silver) CPU Activity: User (Cortex-A77) CPU Activity: User (Kryo 460/485/495/585 Silver) CPU Contention: Wait Memory: Buffer Memory: Cached Memory: Free Memory: Slab Memory: Used
Perf Software Alignment Faults: Faults Clock: CPU Clock Clock: Task Clock Emulation Faults: Faults Page Faults: Faults Page Faults: Major Faults Page Faults: Minor Faults Process: Context Switches Process: CPU Migrations
Thermal Query Android Thermal Throttling: Throttling State
连接手机设备
开始Profile
保存Profile数据
Save按钮(红框):保存当前profile数据,然后在不杀进程情况下开始新的profile
Stop按钮(篮框):保存当前profile数据,然后杀掉进程
重要说明:利用Save按钮(红框),uam在局内无法获取数据。
对录制好的性能数据添加符号表
Timeline视图
Heat Map
查看所有性能指标:
查看进程下所有线程情况:
选中某个时间点来查看线程在此刻的性能情况:
Core Map
Cluster Map
Samples
Processes
Call Paths
Total: Samples (#/%):函数及其内部子函数被采样到的CPU Counter数和百分比。 注:函数中的Sleep、Wait等挂起操作,会挂起CPU,不会导致CPU Counter数增加。因此,函数耗时长不代表CPU Counter数就大。
Self: Samples (#/%):函数自身被采样到的CPU Counter数和百分比。
如果一个函数有100个Samples,意味着在性能分析期间,采样到这个函数被调用了100次。这可以帮助分析者识别哪些函数被频繁调用,可能是性能瓶颈的地方。
Functions
Code
在Call Paths、Funtions页签下选中某个函数栈帧,点击右键菜单 -- Select Code,就会显示这个函数的源代码。
选中工具栏上的红框按钮,可以把函数的汇编显示出来。
Log
图中那条为profile时,在Timeline上创建的Bookmark,双击可以跳到该Bookmark处。
查看某段时间的性能数据
在录制时,可通过快捷菜单“Create Bookmark at ...m ...s”来插入书签来进行标记。
录制后,根据书签位置,使用左标尺和右标尺来选定区域,来查看这段时间的性能数据。
扩展阅读