linux perf - 性能测试和优化工具
Perf简介
Perf是Linux kernel自带的系统性能优化工具。虽然它的版本还只是0.0.2,Perf已经显现出它强大的实力,足以与目前Linux流行的OProfile相媲美了。
Perf 的优势在于与Linux Kernel的紧密结合,它可以最先应用到加入Kernel的new feature。而像OProfile, GProf等通常会“慢一拍”。Perf的基本原理跟OProfile等类似,也是在CPU的PMU registers中Get/Set performance counters来获得诸如instructions executed, cache-missed suffered, branches mispredicted等信息。Linux kernel对这些registers进行了一系列抽象,所以你可以按进程,按CPU或者按counter group等不同类别来查看Sample信息。
使用Perf
Perf的使用流程和OProfile很像。所以如果你会用OProfile的话,用Perf就很简单。这里只是简单翻译一下在[1]中的Perf examples中举的例子。有更多发现的话以后会继续更新。
$ perf record -f -- git gc Counting objects: 1283571, done. Compressing objects: 100% (206724/206724), done. Writing objects: 100% (1283571/1283571), done. Total 1283571 (delta 1070675), reused 1281443 (delta 1068566) [ perf record: Captured and wrote 31.054 MB perf.data (~1356768 samples) ]
$ perf report --sort comm,dso,symbol | head -10 # Samples: 1355726 # # Overhead Command Shared Object Symbol # ........ ............... ....................................... ...... # 31.53% git /usr/bin/git [.] 0x0000000009804f 13.41% git-prune /usr/bin/git-prune [.] 0x000000000ad06d 10.05% git /lib/tls/i686/cmov/libc-2.8.90.so [.] _nl_make_l10nflist 5.36% git-prune /usr/lib/libz.so.1.2.3.3 [.] 0x00000000009d51 4.48% git /lib/tls/i686/cmov/libc-2.8.90.so [.] memcpy
perf record相当于opcontrol –-start, 而perf report相当于opreport.
Perf用例
查看所有可用的counters用'perf list’:
titan:~> perf list [...] kmem:kmalloc [Tracepoint event] kmem:kmem_cache_alloc [Tracepoint event] kmem:kmalloc_node [Tracepoint event] kmem:kmem_cache_alloc_node [Tracepoint event] kmem:kfree [Tracepoint event] kmem:kmem_cache_free [Tracepoint event] kmem:mm_page_free_direct [Tracepoint event] kmem:mm_pagevec_free [Tracepoint event] kmem:mm_page_alloc [Tracepoint event] kmem:mm_page_alloc_zone_locked [Tracepoint event] kmem:mm_page_pcpu_drain [Tracepoint event] kmem:mm_page_alloc_extfrag [Tracepoint event]
你可以用以上counter的任意组合来跑你的测试程序。比如,用以下命令来看跑
hackbench时page alloc/free的次数。
titan:~> perf stat -e kmem:mm_page_pcpu_drain -e kmem:mm_page_alloc -e kmem:mm_pagevec_free -e kmem:mm_page_free_direct ./hackbench 10 Time: 0.575 Performance counter stats for './hackbench 10': 13857 kmem:mm_page_pcpu_drain 27576 kmem:mm_page_alloc 6025 kmem:mm_pagevec_free 20934 kmem:mm_page_free_direct 0.613972165 seconds time elapsed
Perf可以帮你统计N次结果的数值波动情况:
titan:~> perf stat --repeat 5 -e kmem:mm_page_pcpu_drain -e kmem:mm_page_alloc -e kmem:mm_pagevec_free -e kmem:mm_page_free_direct ./hackbench 10 Time: 0.627 Time: 0.644 Time: 0.564 Time: 0.559 Time: 0.626Performance counter stats for './hackbench 10' (5 runs): 12920 kmem:mm_page_pcpu_drain ( +- 3.359% ) 25035 kmem:mm_page_alloc ( +- 3.783% ) 6104 kmem:mm_pagevec_free ( +- 0.934% ) 18376 kmem:mm_page_free_direct ( +- 4.941% ) 0.643954516 seconds time elapsed ( +- 2.363% )
有了以上的统计数据,你可以开始sample某一个你关心的tracepoint(比如page
allocations):
titan:~/git> perf record -f -e kmem:mm_page_alloc -c 1 ./git gc Counting objects: 1148, done. Delta compression using up to 2 threads. Compressing objects: 100% (450/450), done. Writing objects: 100% (1148/1148), done. Total 1148 (delta 690), reused 1148 (delta 690) [ perf record: Captured and wrote 0.267 MB perf.data (~11679 samples) ]
查看哪个function引起了page allocations:
titan:~/git> perf report # Samples: 10646 # # Overhead Command Shared Object # ........ ............... .......................... # 23.57% git-repack /lib64/libc-2.5.so 21.81% git /lib64/libc-2.5.so 14.59% git ./git 11.79% git-repack ./git 7.12% git /lib64/ld-2.5.so 3.16% git-repack /lib64/libpthread-2.5.so 2.09% git-repack /bin/bash 1.97% rm /lib64/libc-2.5.so 1.39% mv /lib64/ld-2.5.so 1.37% mv /lib64/libc-2.5.so 1.12% git-repack /lib64/ld-2.5.so 0.95% rm /lib64/ld-2.5.so 0.90% git-update-serv /lib64/libc-2.5.so 0.73% git-update-serv /lib64/ld-2.5.so 0.68% perf /lib64/libpthread-2.5.so 0.64% git-repack /usr/lib64/libz.so.1.2.3
更进一步的查看:
titan:~/git> perf report --sort comm,dso,symbol
# Samples: 10646 # # Overhead Command Shared Object Symbol # ........ ............... .......................... ...... # 9.35% git-repack ./git [.] insert_obj_hash 9.12% git ./git [.] insert_obj_hash 7.31% git /lib64/libc-2.5.so [.] memcpy 6.34% git-repack /lib64/libc-2.5.so [.] _int_malloc 6.24% git-repack /lib64/libc-2.5.so [.] memcpy 5.82% git-repack /lib64/libc-2.5.so [.] __GI___fork 5.47% git /lib64/libc-2.5.so [.] _int_malloc 2.99% git /lib64/libc-2.5.so [.] memset
同时,call-graph(函数调用图)也可以被记录下来,并且能告诉你每个函数所占用的百分比。
titan:~/git> perf record -f -g -e kmem:mm_page_alloc -c 1 ./git gc Counting objects: 1148, done. Delta compression using up to 2 threads. Compressing objects: 100% (450/450), done. Writing objects: 100% (1148/1148), done. Total 1148 (delta 690), reused 1148 (delta 690) [ perf record: Captured and wrote 0.963 MB perf.data (~42069 samples) ]titan:~/git> perf report -g # Samples: 10686 # # Overhead Command Shared Object # ........ ............... .......................... # 23.25% git-repack /lib64/libc-2.5.so | |--50.00%-- _int_free | |--37.50%-- __GI___fork | make_child | |--12.50%-- ptmalloc_unlock_all2 | make_child | --6.25%-- __GI_strcpy 21.61% git /lib64/libc-2.5.so | |--30.00%-- __GI_read | | | --83.33%-- git_config_from_file | git_config | | [...]
用以下命令可以查看整个系统10秒内的page allocation次数:
titan:~/git> perf stat -a -e kmem:mm_page_pcpu_drain -e kmem:mm_page_alloc -e kmem:mm_pagevec_free -e kmem:mm_page_free_direct sleep 10
Performance counter stats for 'sleep 10':171585 kmem:mm_page_pcpu_drain 322114 kmem:mm_page_alloc 73623 kmem:mm_pagevec_free 254115 kmem:mm_page_free_direct 10.000591410 seconds time elapsed
用以下命令查看每隔1秒,系统page allocation的波动状况:
titan:~/git> perf stat --repeat 10 -a -e kmem:mm_page_pcpu_drain -e kmem:mm_page_alloc -e kmem:mm_pagevec_free -e kmem:mm_page_free_direct sleep 1Performance counter stats for 'sleep 1' (10 runs): 17254 kmem:mm_page_pcpu_drain ( +- 3.709% ) 34394 kmem:mm_page_alloc ( +- 4.617% ) 7509 kmem:mm_pagevec_free ( +- 4.820% ) 25653 kmem:mm_page_free_direct ( +- 3.672% ) 1.058135029 seconds time elapsed ( +- 3.089% )
通过反汇编往往能找出是哪行代码生成的指令会引起问题。
titan:~/git> perf annotate __GI___fork ------------------------------------------------ Percent | Source code & Disassembly of libc-2.5.so ------------------------------------------------ : : : Disassembly of section .plt: : Disassembly of section .text: : : 00000031a2e95560 <__fork>: [...] 0.00 : 31a2e95602: b8 38 00 00 00 mov $0x38,�x 0.00 : 31a2e95607: 0f 05 syscall 83.42 : 31a2e95609: 48 3d 00 f0 ff ff cmp $0xfffffffffffff000,%rax 0.00 : 31a2e9560f: 0f 87 4d 01 00 00 ja 31a2e95762 <__fork+0x202> 0.00 : 31a2e95615: 85 c0 test �x,�x
以上结果显示__GI__forks的83.42%的时间来源于0x38的系统调用。
值得优化某个特定的函数吗?
你也许想知道是否值得去优化你程序中的某个特定函数。一个很好的例子是git mailing list中讨论的关于SHA1 哈希算法优化的问题,我们可以用perf来预判优化的结果。具体参见Linus的回信[2].
"perf report --sort comm,dso,symbol" profiling shows the following for 'git fsck --full' on the kernel repo, using the Mozilla SHA1: 47.69% git /home/torvalds/git/git [.] moz_SHA1_Update 22.98% git /lib64/libz.so.1.2.3 [.] inflate_fast 7.32% git /lib64/libc-2.10.1.so [.] __GI_memcpy 4.66% git /lib64/libz.so.1.2.3 [.] inflate 3.76% git /lib64/libz.so.1.2.3 [.] adler32 2.86% git /lib64/libz.so.1.2.3 [.] inflate_table 2.41% git /home/torvalds/git/git [.] lookup_object 1.31% git /lib64/libc-2.10.1.so [.] _int_malloc 0.84% git /home/torvalds/git/git [.] patch_delta 0.78% git [kernel] [k] hpet_next_event
很明显,SHA1加密算法的性能在这里很关键。
如何测量latency
如果你在build kernel时enabled了
CONFIG_PERF_COUNTER=y CONFIG_EVENT_TRACING=y
那你可以加-tip参数来使用几个新的performance counter来测量scheduler的lantencies。
perf stat -e sched:sched_stat_wait -e task-clock ./hackbench 20
以上命令能够得出等待CPU用了多少时间。你可以重复10次这样的操作:
aldebaran:/home/mingo> perf stat --repeat 10 -e / sched:sched_stat_wait:r -e task-clock ./hackbench 20 Time: 0.251 Time: 0.214 Time: 0.254 Time: 0.278 Time: 0.245 Time: 0.308 Time: 0.242 Time: 0.222 Time: 0.268 Time: 0.244
Performance counter stats for './hackbench 20' (10 runs):59826 sched:sched_stat_wait # 0.026 M/sec ( +- 5.540% ) 2280.099643 task-clock-msecs # 7.525 CPUs ( +- 1.620% )0.303013390 seconds time elapsed ( +- 3.189% )
读取scheduling的events counter
# perf list 2>&1 | grep sched: sched:sched_kthread_stop [Tracepoint event] sched:sched_kthread_stop_ret [Tracepoint event] sched:sched_wait_task [Tracepoint event] sched:sched_wakeup [Tracepoint event] sched:sched_wakeup_new [Tracepoint event] sched:sched_switch [Tracepoint event] sched:sched_migrate_task [Tracepoint event] sched:sched_process_free [Tracepoint event] sched:sched_process_exit [Tracepoint event] sched:sched_process_wait [Tracepoint event] sched:sched_process_fork [Tracepoint event] sched:sched_signal_send [Tracepoint event] sched:sched_stat_wait [Tracepoint event] sched:sched_stat_sleep [Tracepoint event] sched:sched_stat_iowait [Tracepoint event]
对于latency analysis而言,stat_wait/sleep/iowait是值得注意的event。如果你想看所有delays和它们的mix/max/avg,你可以:
perf record -e sched:sched_stat_wait:r -f -R -c 1 ./hackbench 20 perf trace
perf stats for doing nothinghttp://blog.csdn.net/bluebeach/article/details/5912062Perf stats for "doing nothing"
I've recently discovered the
perf
Linux tool. I heard that oprofile was deprecated and that there is a new tool, and I noted down to try it sometime.Updated: more languages, fixed typos, more details, some graphs. Apologies if this shows twice in your feed.
The problem with perf stats is that I hate bloat, or even perceived bloat. Even when it doesn't affect me in any way, the concept of wasted cycles makes me really sad.
You probably can guess where this is going… I said, well, let's see what perf says about a simple "null" program. Surely doing nothing should be just a small number of instructions, right?
Note: I think that perf also records kernel-side code, because the lowest I could get was about ~50K instructions for starting a null program in assembler that doesn't use libc and just executes the
syscall
asm instruction. However, these ~50K instructions are noise the moment you start to use more high-level languages. Yes, this is expected, but the I was still shocked. And there's lots of delta between languages I'd expected to behave somewhat identical.Again, this is not important in the real world. At all. They are just numbers, and probably the noise (due to short runtime) has lots of influence on the resulting numbers. And I might have screwed up the measurements somehow.
Test setup
Each program was the equivalent of 'exit 0' in the appropriate form for the language. During the measurements, the machine was as much as possible idle (single-user mode, measurements run at real-time priority, etc.). For compiled languages,
-O2
was used. For scripts, a simple#!/path/to/interpreter
(without options, except in the case of Python, see below) was used. Each program/script was run 500 times (perf's-r 500
) and I've checked that the variations were small (±0.80% on the metrics I used).You can find all the programs I've used at http://git.k1024.org/perf-null.git/, the current tests are for the tag version perf-null-0.1.
The raw data for the below tables/graphs is at log-4.
Results
Compiled languages
Language Cycles Instructions asm 63K 51K c-dietlibc 74K 57K c-libc-static 177K 107K c-libc-shared 506K 300K c++-static 178K 107K c++-dynamic 1,750K 1,675K haskell-single 2,229K 1,338K haskell-threaded 2,629K 1,522K ocaml-bytecode 3,271K 2,741K ocaml-native 1,042K 666K Going from dietlibc to glibc doubles the number of instructions, and for libc going from static to dynamic linking again roughly doubles it. I didn't manage to compile a program dynamically-linked against dietlibc.
C++ is interesting. Linked statically, it is in the same ballpark as C, but when linked dynamically, it executes an order of magnitude more instructions. I would guess that the initialisation of the standard C++ library is complex?
Haskell, which has a GC and quite a complex runtime, executes slightly less instructions than C++, but uses more cycles. Not bad, given the capabilities of the runtime. The two versions of the Haskell program are with the single-threaded runtime and with the multi-threaded one; not much difference. A fully statically-linked Haskell binary (not recommended usually) goes below 1M instructions, but not by much.
OCaml is a very nice surprise. The bytecode runtime is a bit slow to startup, but the (native) compiled version is quite fast to start: only 2× number of instructions and cycles compared to C, for an advanced language. And twice as fast as Haskell ☺. Nice!
Shells
Language Cycles Instructions dash 766K 469K bash 1,680K 1,044K mksh 1,258K 942K mksh-static 504K 322K So, dash takes ~470K instructions to start, which is way below the C++ count and a bit higher than the C one. Hence, I'd guess that dash is implemented in C ☺.
Next, bash is indeed slower on startup than dash, and by slightly more than 2× (both instructions and cycles). So yes, switching
/bin/sh
from bash to dash makes sense.I wasn't aware of
mksh
, so thanks for the comments. It is, in the static variant, more efficient that dash, by about 1.5×. However, the dynamically linked version doesn't look too great (dash is also dynamically linked; I would guess a statically-linked dash "beats" mksh-static).Text processing
I've added perl here (even though it's a 'full' language) just for comparison; it's also in the next section.
Language Cycles Instructions mawk 849K 514K gawk 1,363K 980K perl 2,946K 2,213K A normal spread. I knew the reason why mawk is
Priority: required
is that it's faster than gawk, but I wouldn't have guessed it's almost twice as fast.Interpreted languages
Here is where the fun starts…
Language Cycles Instructions lua 5.1 1,947K 1,485K lua 5.2 1,724K 1,335K lua jit 1,209K 803K perl 2,946K 2,213K tcl 8.4 5,011K 4,552K tcl 8.5 6,888K 6,022K tcl 8.6 8,196K 7,236K ruby 1.8 7,013K 6,128K ruby 1.9.3 35,870K 35,022K python 2.6 -S 11,752K 10,247K python 2.7 -S 11,438K 10,198K python 3.2 -S 29,003K 27,409K pypy -S 21,106K 10,036K python 2.6 25,143K 21,989K python 2.7 47,325K 50,217K python 2.7 -O 47,341K 50,185K python 3.2 113,567K 124,133K python 3.2 -O 113,424K 124,133K pypy 90,779K 68,455K The numbers here are not quite what I expected. There's a huge delta between the fastest (hi Lua!) and the slowest (bye Python!).
I wasn't familiar with Lua, so I tested it thanks to the comments. It is, I think, the only language which actually improves from one version to the next (bonus points), and where the JIT version also make is faster. In context, lua jit starts faster than C++.
Perl is the one that goes above C++'s instructions count, but not by much. From the point of view of the system, a Perl 'hello world' is only about 1.3×-1.6x slower than a C++ one. Not bad, not bad.
Next category is composed of TCL and Ruby, both of which had older versions 2-3× slower than Perl, but whose most recent versions are even more slower. TCL has an almost constant slowdown across versions (5M, 6.9M, 8.2M cycles), but Ruby seems to have taken a significant step backwards: 1.9.3 is 5× slower than 1.8. I wonder why? As for TCL, I didn't expect it to be slower to startup than Perl; good to know.
Last category is Python. Oh my. If you run
perf stat python -c 'pass'
you get some unbelievable numbers, like 50M instructions to do, well, nothing. Yes, it has a GC, yes, it does import modules at runtime, but still… On closer investigation, thesite
module and the imports it does do eat a lot of time. Running a simplerpython -S
brings it back to a more reasonable 10M instructions, which is in-line with the other interpreted languages.However, even with the -S taken into account, Python also slows down across versions: a tiny improvement from 2.6 to 2.7, but (like Ruby) a 3× slowdown from 2.7 to 3.2. Trying the “optimised” version (
-O
) doesn't help at all. Trying pypy, which was based on Python 2.7, makes it around 2× slower to startup (both with and without-S
).So in the interpreted languages, it seems only Lua is trying to improve, the rest of the languages are piling up bloat with every version. Note: I should have tried multiple perl versions too.
Java
Java is in its own category; you guess why ☺, right?
GCJ was version 4.6, whereas by
java
below I meanOpenJDK Runtime Environment (IcedTea6 1.11) (6b24-1.11-4)
.
Language Cycles Instructions null-gcj 97,156K 74,576K java -jamvm 85,535K 80,102K java -server 147,174K 136,803K java -zero 132,967K 124,977K java -cacao 229,799K 205,312K Using gcj to compile to “native code” (not sure whether that's native-native or something else) results in a binary that uses less than 100M cycles to start, but the jamvm VM is faster than that (85M cycles). Not bad for java! Python 3.2 is slower to startup—yes, I think the world has gone crazy.
However, the other VMs are a few times slower: server (the default one) is ~150M cycles, and cacao is ~230M cycles. Wow.
The other thing about java is that it was the only one that couldn't be put nicely in a file that you just ‘exec’ (there is
binfmt_misc
indeed, but that doesn't allow different Java classes to use different Java VMs, so I don't count this), as opposed to every single other thing I tested here. Someone didn't grow on Unix?Comparative analysis
Since there are almost 4 orders of magnitude difference between all the things tested here, a graph of cycles or instructions is not really useful. However, cycles/instruction, branches percentage and branches miss-predicted percentage can be. Hence first the cycles/instructions:
Pypy is jumping out of the graph here, with the top value of over 2 cycles/instruction. Lua JIT is also bigger than Lua non-JIT, so maybe there's something to this (mostly joking, two data points don't make a series). On the other hand, Python wins as best cycles/instruction (0.91). Lots of ILP, to get below 1?
Java gets, irrespective of VM, consistently near 1.0-1.1. C++ gets very different numbers between static linking (1.666) and dynamic linking (1.045), whereas C has basically identical numbers. mksh also has a difference between dynamic and static linking. Hmm…
Ruby, TCL and Python have consistent values across versions.
And that's about what I can see from that graph. Next up, percentage of branches out of total instructions and percentage of branches missed:
Note that the two lines shouldn't really be on the same graph; for the branch %, the 100% is the total instructions count, but for the branch miss %, the 100% is the total branch count. Anyway.
There are two low-value outliers:
- dynamically-linked C++ has a low branch percentage (17.46%) and a very low branch miss percentage (only 4.32%)
- gcj-compiled java has a very low branch miss percentage (only 2.82%!!!), even though is has a “regular” branch percentage (20.85%)
So it seems the gcj libraries are well optimised? I'm not familiar enough with this topic, but on the graph it does indeed stand out.
On the other end, mksh-static has a high branch miss percentage: 11.60%, which jumps clearly ahead of all the others; this might be why it has a high cycles/instruction count, due to all the stalls in misprediction; one has to wonder why it confuses the branch predictor?
I find it interesting that the overall branch count is very similar across languages, both when most of the cost is in the kernel (e.g. asm) and when the user-space cost heavily over-weighs the kernel (e.g. Java). The average is 20.85%, minimum is 17.46%, max 22.93%, standard deviation (if I used gnumeric correctly) is just 0.01. This seems a bit suspicious to me ☺. On the other hand, the mispredicted branches percentage varies much more: from a measly 2.82% to 11.60% (5x difference).
Summary
So to recap, counting just instructions:
- going from dietlibc to glibc: 2× increase
- going from statically-linked libc to dynamically-linked libc: doubles it again
- going from C to C++: 5× increase
- C++ to Perl: 1.3×
- Perl to Ruby: 3×
- Ruby to Python (-S): 1.6x
- Python -S to regular Python: 5x
- Python to Java: 1×-2×, depending on version/runtime
- branch percentage (per total instructions) is quite consistent across all of the programs
Overall, you get roughly three orders of magnitude slower startup between a plain C program using dietlibc and Python. And all, to do basically nothing.
On the other hand, I learned some interesting things while doing it, so it wasn't quite for nothing ☺.