可可西

Streamline分析Android性能

Arm Mobile Studio是一套分析Android(无需root)上App的CPU,GPU的高效性能优化工具,辅助开发人员来定位App性能瓶颈。

由以下4个子工具组成:Performance Advisor、Streamline、Graphics Analyzer和Mali Offline Compiler。

工具名 说明
Performance Advisor 命令行工具。读取Streamline截帧文件,得到直观的性能分析报告,并给出优化建议。
Streamline

用于截取CPU、GPU、内存等性能数据,并进行图形化实时展示。

注:Mali的手机才会有GPU信息

Graphics Analyzer

调试OpenGLES或Vulkan图形API,分析overdraw、shader、texture等

注:需为Mali的GPU

Mali Offline Compiler 检查shader代码在Mali的GPU上的性能。

 

 

具体分为Starter Edition(免费版本)Professional Edition(收费版本),详见版本比较

FEATURE

Starter Edition

Professional Edition

Run Arm Mobile Studio tools headlessly within your existing continuous integration systems  No  Yes
Generate machine-readable reports in JSON format  No  Yes
Access world-class support from Arm  No  Yes
Intuitive Performance Advisor reports pinpointing problem areas and providing profiling advice  Yes  Yes
Mali Offline Compiler shows performance and bottlenecks relating to shaders or kernels  Yes  Yes
Detailed application profiling with off-the-shelf mobile devices  Yes  Yes
Full support for all announced Arm 32-bit and 64-bit CPU architectures  Yes  Yes
Access to detailed CPU and GPU hardware counters  Yes  Yes
Frame-by-frame analysis of OpenGL ES and Vulkan content  Yes  Yes
Enhance your profiling experience with custom code annotations  Yes  Yes
Debug and profile VR Applications  Yes  Yes
License required for use Free to use Purchase required for additional features and use in a continuous integration system

注:从ArmDeveloper官网上下载Starter Edition(免费版本)Arm Mobile Studio。最新版本为2020.2,详见:release history

 

Starter Edition(免费版本)Arm Mobile Studio安装后文件如下:

 

Mali Offline Compiler简单介绍

顶点着色器

执行malioc shader.vert命令,输出如下编译统计信息:

注:在移动端会执行两遍VS:Position variant为position only的vs,Varying variant为完整的vs

 

像素着色器

执行malioc shader.frag命令,输出如下编译统计信息:

注1:Stack spilling如果显示XXX bytes,表示溢出了XXX字节,本质还是寄存器个数不够(例如if动态分支太多),运行时会带来严重性能问题(一些变量不放在寄存器中,而是放在显存中,读写这些变量就会非常慢)

       Xcode Metal Frame Debugger中寄存器不够,会报Register Spilling的warning

注2:当Use late ZS test为false、Uses late ZS update为false时,表示HSR(或Early-Z)是有效

         当Use late ZS test为false、Uses late ZS update为true时,表示HSR(或Early-Z)是失效

参考:使用Mali Compiler对Unity Shader进行优化

 

下文重点讲解Streamline性能分析工具。

 

手机设备

华为Mate30(8核,Mali-G76,8GB)

 

 

更多性能指标见: 

Cortex-A55 - [1 of 6 counters available]
Branch Predictor: Mispredictions
Branch Predictor: Possible Predictions
Bus: Access
Bus: Access (due to read)
Bus: Access (due to write)
Cycles: Bus Cycles
Cycles: CPU Cycles
Data TLB: Translation table walk
Errors: Memory
Errors: Pre-decode
Exceptions: FIQ
Exceptions: IRQ
Exceptions: Taken
Instruction TLB: Translation table walk
Instructions (Executed): All
Instructions (Executed): Branch (Any)
Instructions (Executed): Branch (Conditional)
Instructions (Executed): Branch (Conditional, Mispredicted)
Instructions (Executed): Branch (Immediate)
Instructions (Executed): Branch (Indirect, Address predicted)
Instructions (Executed): Branch (Indirect, Mispredicted Address)
Instructions (Executed): Branch (Indirect, Mispredicted)
Instructions (Executed): Branch (Mispredicted)
Instructions (Executed): Branch (Return)
Instructions (Executed): Branch (Return, Address predicted)
Instructions (Executed): Branch (Return, Mispredicted Address)
Instructions (Executed): Exception Returns
Instructions (Executed): Increment PMSWINC Register
Instructions (Executed): Load
Instructions (Executed): Store
Instructions (Executed): Unaligned Load/Store
Instructions (Executed): Write to CONTEXTIDR
Instructions (Executed): Write to PC
Instructions (Executed): Write to TTBR
Instructions (Speculated): All
Instructions (Speculated): Branch (immediate)
Instructions (Speculated): Branch (indirect)
Instructions (Speculated): Branch (return)
Instructions (Speculated): Branch (software PC writes)
Instructions (Speculated): Crypto
Instructions (Speculated): Data Processing (Advanced SIMD)
Instructions (Speculated): Data Processing (Floating-point)
Instructions (Speculated): Data Processing (Integer)
Instructions (Speculated): Load
Instructions (Speculated): Load/Store
Instructions (Speculated): Store
L1 Data Cache: Access
L1 Data Cache: Access (due to read)
L1 Data Cache: Access (due to write)
L1 Data Cache: Enter Write Streaming Mode
L1 Data Cache: Refill
L1 Data Cache: Refill (due to prefetch)
L1 Data Cache: Refill (due to read)
L1 Data Cache: Refill (due to write)
L1 Data Cache: Refill (from inside cluster)
L1 Data Cache: Refill (from outside cluster)
L1 Data Cache: Write Streaming Mode
L1 Data Cache: Write-back
L1 Data TLB: Access
L1 Data TLB: Refill
L1 Instruction Cache: Access
L1 Instruction Cache: Refill
L1 Instruction TLB: Access
L1 Instruction TLB: Refill
L2 Data Cache: Access
L2 Data Cache: Access (due to read)
L2 Data Cache: Access (due to write)
L2 Data Cache: Allocation without refill
L2 Data Cache: Refill
L2 Data Cache: Refill (due to prefetch)
L2 Data Cache: Refill (due to read)
L2 Data Cache: Refill (due to write)
L2 Data Cache: Stash Dropped
L2 Data Cache: Write Streaming Mode
L2 Data Cache: Write-back
L2 Data/Unified TLB: Access
L2 Data/Unified TLB: Access (IPA)
L2 Data/Unified TLB: Access (Last Level Walk)
L2 Data/Unified TLB: Access (Level 2 Walk)
L2 Data/Unified TLB: Refill
L2 Data/Unified TLB: Refill (IPA)
L2 Data/Unified TLB: Refill (Last Level Walk)
L2 Data/Unified TLB: Refill (Level 2 Walk)
L3 Data Cache: Access
L3 Data Cache: Access (due to read)
L3 Data Cache: Allocation without refill
L3 Data Cache: Refill
L3 Data Cache: Refill (due to prefetch)
L3 Data Cache: Refill (due to read)
L3 Data Cache: Write Streaming Mode
Last Level Cache: Access (due to read)
Last Level Cache: Miss (due to read)
Memory: Access
Memory: Access (due to read)
Memory: Access (due to write)
Multi-socket Remote Access: Access (due to read)
Stalls: Backend
Stalls: Backend (Interlock)
Stalls: Backend (Interlock, AGU)
Stalls: Backend (Interlock, FPU)
Stalls: Backend (Interlock, Load)
Stalls: Backend (Interlock, Load, Cache-miss)
Stalls: Backend (Interlock, Load, TLB-miss)
Stalls: Backend (Interlock, Store)
Stalls: Backend (Interlock, Store, STB full)
Stalls: Backend (Interlock, Store, TLB-miss)
Stalls: Frontend
Stalls: Frontend (Cache miss)
Stalls: Frontend (Pre-decode error)
Stalls: Frontend (TLB miss)
Linux CPU Activity: System (Cortex-A55) CPU Activity: System (Other) CPU Activity: User (Cortex-A55) CPU Activity: User (Other) CPU Contention: Wait Memory: Buffer Memory: Cached Memory: Free Memory: Slab Memory: Used
Mali Job Manager Mali GPU Cycles: Fragment queue active Mali GPU Cycles: GPU active Mali GPU Cycles: Non-fragment queue active Mali GPU Tasks: Fragment tasks
Mali Memory System Mali External Bus Accesses: Read transaction Mali External Bus Accesses: Write transaction Mali External Bus Beats: Read beat Mali External Bus Beats: Write beat Mali External Bus Read Latency: 0-127 cycles Mali External Bus Read Latency: 128-191 cycles Mali External Bus Read Latency: 192-255 cycles Mali External Bus Read Latency: 256-319 cycles Mali External Bus Read Latency: 320-383 cycles Mali External Bus Stalls: Read stall cycles Mali External Bus Stalls: Write stall cycles Mali L2 Cache Lookups: Read lookup Mali L2 Cache Lookups: Write lookup
Mali Shader Core Mali Core Cycles: Execution core active Mali Core Cycles: Fragment active Mali Core Cycles: Fragment FPKB active Mali Core Cycles: Non-fragment active Mali Core External Reads: Fragment external read beats Mali Core External Reads: Load/store external read beats Mali Core External Reads: Texture external read beats Mali Core Instructions: Diverged instructions Mali Core Instructions: Executed instructions Mali Core L2 Reads: Fragment L2 read beats Mali Core L2 Reads: Load/store L2 read beats Mali Core L2 Reads: Texture L2 read beats Mali Core Load/Store Cycles: Atomic access cycles Mali Core Load/Store Cycles: Full read cycles Mali Core Load/Store Cycles: Full write cycles Mali Core Load/Store Cycles: Partial read cycles Mali Core Load/Store Cycles: Partial write cycles Mali Core Primitives: Rasterized primitives Mali Core Quads: Early ZS killed quads Mali Core Quads: Early ZS tested quads Mali Core Quads: Early ZS updated quads Mali Core Quads: FPK occluder quads Mali Core Quads: Late ZS killed quads Mali Core Quads: Late ZS tested quads Mali Core Quads: Rasterized fine quads Mali Core Texture Cycles: Cache lookups Mali Core Texture Cycles: Texturing active Mali Core Texture Line Fetches: Compressed line fetches Mali Core Texture Line Fetches: Line fetches Mali Core Texture Quads: Descriptor misses Mali Core Texture Quads: Mipmapped texture issues Mali Core Texture Quads: Texture issues Mali Core Texture Quads: Texture requests Mali Core Texture Quads: Trilinear filtered issues Mali Core Tiles: Tiles Mali Core Tiles: Unchanged tiles killed Mali Core Varying Cycles: 16-bit interpolation active Mali Core Varying Cycles: 32-bit interpolation active Mali Core Varying Requests: Interpolation requests Mali Core Warps: All register warps Mali Core Warps: Fragment warps Mali Core Warps: Full quad warps Mali Core Warps: Non-fragment warps Mali Core Warps: Partial fragment warps Mali Core Writes: Load/store other write beats Mali Core Writes: Load/store writeback write beats Mali Core Writes: Tile buffer write beats
Mali Tiler Mali Input Primitives: Line primitives Mali Input Primitives: Point primitives Mali Input Primitives: Triangle primitives Mali Primitive Culling: Facing and XY plane test culled primitives Mali Primitive Culling: Sample test culled primitives Mali Primitive Culling: Visible primitives Mali Primitive Culling: Z plane test culled primitives Mali Tiler Shading Requests: Position shading requests Mali Tiler Shading Requests: Varying shading requests
Other - [6 of 6 counters available] Branch Predictor: Mispredictions Branch Predictor: Possible Predictions Bus: Access Cycles: Bus Cycles Cycles: CPU Cycles Errors: Memory Exceptions: Taken Instructions (Executed): All Instructions (Executed): Branch (Immediate) Instructions (Executed): Branch (Return) Instructions (Executed): Exception Returns Instructions (Executed): Increment PMSWINC Register Instructions (Executed): Load Instructions (Executed): Store Instructions (Executed): Unaligned Load/Store Instructions (Executed): Write to CONTEXTIDR Instructions (Executed): Write to PC Instructions (Executed): Write to TTBR Instructions (Speculated): All L1 Data Cache: Access L1 Data Cache: Refill L1 Data Cache: Write-back L1 Data TLB: Refill L1 Instruction Cache: Access L1 Instruction Cache: Refill L1 Instruction TLB: Refill L2 Data Cache: Access L2 Data Cache: Refill L2 Data Cache: Write-back Memory: Access
Perf Software Alignment Faults: Faults Clock: CPU Clock Clock: Task Clock Emulation Faults: Faults Page Faults: Faults Page Faults: Major Faults Page Faults: Minor Faults Process: Context Switches Process: CPU Migrations

 

小米10(8核,Adreno (TM) 650,8GB)

 

 

 

更多性能指标见: 

Cortex-A77 - [1 of 6 counters available]
Branch Predictor: Mispredictions
Branch Predictor: Possible Predictions
Bus: Access
Bus: Access (due to read)
Bus: Access (due to write)
Cycles: Bus Cycles
Cycles: CPU Cycles
Data TLB: Translation table walk
Errors: Memory
Exceptions: Data Abort
Exceptions: FIQ
Exceptions: HVC
Exceptions: Instruction Abort
Exceptions: IRQ
Exceptions: SMC
Exceptions: SVC
Exceptions: Taken
Exceptions: Trap (Data Abort)
Exceptions: Trap (FIQ)
Exceptions: Trap (Instruction Abort)
Exceptions: Trap (IRQ)
Exceptions: Trap (Other)
Exceptions: Undefined
Instruction TLB: Translation table walk
Instructions (Executed): All
Instructions (Executed): Branch (Any)
Instructions (Executed): Branch (Mispredicted)
Instructions (Executed): Exception Returns
Instructions (Executed): Increment PMSWINC Register
Instructions (Executed): Write to CONTEXTIDR
Instructions (Executed): Write to TTBR
Instructions (Speculated): All
Instructions (Speculated): Barrier (DMB)
Instructions (Speculated): Barrier (DSB)
Instructions (Speculated): Barrier (ISB)
Instructions (Speculated): Branch (immediate)
Instructions (Speculated): Branch (indirect)
Instructions (Speculated): Branch (return)
Instructions (Speculated): Branch (software PC writes)
Instructions (Speculated): Crypto
Instructions (Speculated): Data Processing (Advanced SIMD)
Instructions (Speculated): Data Processing (Floating-point)
Instructions (Speculated): Data Processing (Integer)
Instructions (Speculated): Load
Instructions (Speculated): Load (Acquire)
Instructions (Speculated): Load-Exclusive
Instructions (Speculated): Load/Store
Instructions (Speculated): Store
Instructions (Speculated): Store (Release)
Instructions (Speculated): Store-Exclusive
Instructions (Speculated): Store-Exclusive (Failures)
Instructions (Speculated): Store-Exclusive (Successes)
L1 Data Cache: Access
L1 Data Cache: Access (due to read)
L1 Data Cache: Access (due to write)
L1 Data Cache: Invalidation
L1 Data Cache: Refill
L1 Data Cache: Refill (due to read)
L1 Data Cache: Refill (due to write)
L1 Data Cache: Refill (from inside cluster)
L1 Data Cache: Refill (from outside cluster)
L1 Data Cache: Write-back
L1 Data Cache: Write-back (due to clean)
L1 Data Cache: Write-back (due to reuse)
L1 Data TLB: Access
L1 Data TLB: Access (due to read)
L1 Data TLB: Access (due to write)
L1 Data TLB: Refill
L1 Data TLB: Refill (due to read)
L1 Data TLB: Refill (due to write)
L1 Instruction Cache: Access
L1 Instruction Cache: Refill
L1 Instruction TLB: Access
L1 Instruction TLB: Refill
L2 Data Cache: Access
L2 Data Cache: Access (due to read)
L2 Data Cache: Access (due to write)
L2 Data Cache: Allocation without refill
L2 Data Cache: Invalidation
L2 Data Cache: Refill
L2 Data Cache: Refill (due to read)
L2 Data Cache: Refill (due to write)
L2 Data Cache: Write-back
L2 Data Cache: Write-back (due to clean)
L2 Data Cache: Write-back (due to reuse)
L2 Data/Unified TLB: Access
L2 Data/Unified TLB: Access (due to read)
L2 Data/Unified TLB: Access (due to write)
L2 Data/Unified TLB: Refill
L2 Data/Unified TLB: Refill (due to read)
L2 Data/Unified TLB: Refill (due to write)
L3 Data Cache: Access
L3 Data Cache: Access (due to read)
L3 Data Cache: Allocation without refill
L3 Data Cache: Refill
Last Level Cache: Access (due to read)
Last Level Cache: Miss (due to read)
Memory: Access
Memory: Access (due to read)
Memory: Access (due to unaligned read or write)
Memory: Access (due to unaligned read)
Memory: Access (due to unaligned write)
Memory: Access (due to write)
Multi-socket Remote Access: Access
Stalls: Backend
Stalls: Frontend
Kryo 460/485/495/585 Silver - [1 of 6 counters available] Branch Predictor: Mispredictions Branch Predictor: Possible Predictions Bus: Access Bus: Access (due to read) Bus: Access (due to write) Cycles: Bus Cycles Cycles: CPU Cycles Data TLB: Translation table walk Errors: Memory Errors: Pre-decode Exceptions: FIQ Exceptions: IRQ Exceptions: Taken Instruction TLB: Translation table walk Instructions (Executed): All Instructions (Executed): Branch (Any) Instructions (Executed): Branch (Conditional) Instructions (Executed): Branch (Conditional, Mispredicted) Instructions (Executed): Branch (Immediate) Instructions (Executed): Branch (Indirect, Address predicted) Instructions (Executed): Branch (Indirect, Mispredicted Address) Instructions (Executed): Branch (Indirect, Mispredicted) Instructions (Executed): Branch (Mispredicted) Instructions (Executed): Branch (Return) Instructions (Executed): Branch (Return, Address predicted) Instructions (Executed): Branch (Return, Mispredicted Address) Instructions (Executed): Exception Returns Instructions (Executed): Increment PMSWINC Register Instructions (Executed): Load Instructions (Executed): Store Instructions (Executed): Unaligned Load/Store Instructions (Executed): Write to CONTEXTIDR Instructions (Executed): Write to PC Instructions (Executed): Write to TTBR Instructions (Speculated): All Instructions (Speculated): Branch (immediate) Instructions (Speculated): Branch (indirect) Instructions (Speculated): Branch (return) Instructions (Speculated): Branch (software PC writes) Instructions (Speculated): Crypto Instructions (Speculated): Data Processing (Advanced SIMD) Instructions (Speculated): Data Processing (Floating-point) Instructions (Speculated): Data Processing (Integer) Instructions (Speculated): Load Instructions (Speculated): Load/Store Instructions (Speculated): Store L1 Data Cache: Access L1 Data Cache: Access (due to read) L1 Data Cache: Access (due to write) L1 Data Cache: Enter Write Streaming Mode L1 Data Cache: Refill L1 Data Cache: Refill (due to prefetch) L1 Data Cache: Refill (due to read) L1 Data Cache: Refill (due to write) L1 Data Cache: Refill (from inside cluster) L1 Data Cache: Refill (from outside cluster) L1 Data Cache: Write Streaming Mode L1 Data Cache: Write-back L1 Data TLB: Access L1 Data TLB: Refill L1 Instruction Cache: Access L1 Instruction Cache: Refill L1 Instruction TLB: Access L1 Instruction TLB: Refill L2 Data Cache: Access L2 Data Cache: Access (due to read) L2 Data Cache: Access (due to write) L2 Data Cache: Allocation without refill L2 Data Cache: Refill L2 Data Cache: Refill (due to prefetch) L2 Data Cache: Refill (due to read) L2 Data Cache: Refill (due to write) L2 Data Cache: Stash Dropped L2 Data Cache: Write Streaming Mode L2 Data Cache: Write-back L2 Data/Unified TLB: Access L2 Data/Unified TLB: Access (IPA) L2 Data/Unified TLB: Access (Last Level Walk) L2 Data/Unified TLB: Access (Level 2 Walk) L2 Data/Unified TLB: Refill L2 Data/Unified TLB: Refill (IPA) L2 Data/Unified TLB: Refill (Last Level Walk) L2 Data/Unified TLB: Refill (Level 2 Walk) L3 Data Cache: Access L3 Data Cache: Access (due to read) L3 Data Cache: Allocation without refill L3 Data Cache: Refill L3 Data Cache: Refill (due to prefetch) L3 Data Cache: Refill (due to read) L3 Data Cache: Write Streaming Mode Last Level Cache: Access (due to read) Last Level Cache: Miss (due to read) Memory: Access Memory: Access (due to read) Memory: Access (due to write) Multi-socket Remote Access: Access (due to read) Stalls: Backend Stalls: Backend (Interlock) Stalls: Backend (Interlock, AGU) Stalls: Backend (Interlock, FPU) Stalls: Backend (Interlock, Load) Stalls: Backend (Interlock, Load, Cache-miss) Stalls: Backend (Interlock, Load, TLB-miss) Stalls: Backend (Interlock, Store) Stalls: Backend (Interlock, Store, STB full) Stalls: Backend (Interlock, Store, TLB-miss) Stalls: Frontend Stalls: Frontend (Cache miss) Stalls: Frontend (Pre-decode error) Stalls: Frontend (TLB miss)
Linux CPU Activity: System (Cortex-A77) CPU Activity: System (Kryo 460/485/495/585 Silver) CPU Activity: User (Cortex-A77) CPU Activity: User (Kryo 460/485/495/585 Silver) CPU Contention: Wait Memory: Buffer Memory: Cached Memory: Free Memory: Slab Memory: Used
Perf Software Alignment Faults: Faults Clock: CPU Clock Clock: Task Clock Emulation Faults: Faults Page Faults: Faults Page Faults: Major Faults Page Faults: Minor Faults Process: Context Switches Process: CPU Migrations
Thermal Query Android Thermal Throttling: Throttling State

 

连接手机设备

 

开始Profile

 

保存Profile数据

Save按钮(红框):保存当前profile数据,然后在不杀进程情况下开始新的profile 

Stop按钮(篮框):保存当前profile数据,然后杀掉进程

 

重要说明:利用Save按钮(红框),uam在局内无法获取数据。

 

对录制好的性能数据添加符号表

 

Timeline视图

Heat Map

查看所有性能指标:

 

查看进程下所有线程情况:

 

选中某个时间点来查看线程在此刻的性能情况:

 

Core Map

 

Cluster Map

 

 

Samples

 

 

Processes

 

 

Call Paths

Total: Samples (#/%):函数及其内部子函数被采样到的CPU Counter数和百分比。   注:函数中的Sleep、Wait等挂起操作,会挂起CPU,不会导致CPU Counter数增加。因此,函数耗时长不代表CPU Counter数就大。

Self: Samples (#/%):函数自身被采样到的CPU Counter数和百分比。

如果一个函数有100个Samples,意味着在性能分析期间,采样到这个函数被调用了100次。这可以帮助分析者识别哪些函数被频繁调用,可能是性能瓶颈的地方。

 

Functions

 

Code

 

在Call Paths、Funtions页签下选中某个函数栈帧,点击右键菜单 -- Select Code,就会显示这个函数的源代码。

选中工具栏上的红框按钮,可以把函数的汇编显示出来。

 

Log

 

图中那条为profile时,在Timeline上创建的Bookmark,双击可以跳到该Bookmark处。

 

查看某段时间的性能数据

 

在录制时,可通过快捷菜单“Create Bookmark at ...m ...s”来插入书签来进行标记。

录制后,根据书签位置,使用左标尺和右标尺来选定区域,来查看这段时间的性能数据。

 

扩展阅读

ARM Mobile Studio性能优化(一)

ARM Mobile Studio性能优化(二)

ARM Mobile Studio性能优化(三)

 

posted on 2022-07-11 21:53  可可西  阅读(2529)  评论(0编辑  收藏  举报

导航