[基准测试]----lmbench
引言
要评价一个系统的性能,通常有不同的指标,相应的会有不同的测试方法和测试工具,一般来说为了确保测试结果的公平和权威性,会选用比较成熟的商业测试软件。但在特定情形下,只是想要简单比较不同系统或比较一些函数库性能时,也能够从开源世界里选用一些优秀的工具来完成这个任务,本文就通过lmbench 简要介绍系统综合性能测试。
测试软件
Lmbench是一套简易,可移植的,符合ANSI/C标准为UNIX/POSIX而制定的微型测评工具。一般来说,它衡量两个关键特征:反应时间和带宽。Lmbench旨在使系统开发者深入了解关键操作的基础成本。
软件说明:
lmbench是个用于评价系统综合性能的多平台开源benchmark,能够测试包括文档读写、内存操作、进程创建销毁开销、网络等性能,测试方法简单。
Lmbench是个多平台软件,因此能够对同级别的系统进行比较测试,反映不同系统的优劣势,通过选择不同的库函数我们就能够比较库函数的性能;更为重要的是,作为一个开源软件,lmbench提供一个测试框架,假如测试者对测试项目有更高的测试需要,能够通过少量的修改源代码达到目的(比如现在只能评测进程创建、终止的性能和进程转换的开销,通过修改部分代码即可实现线程级别的性能测试)。
下载:
www.bitmover.com/lmbench,最新版本3.0-a9
lmbench官网链接
LMbench的主要功能:
带宽测评工具
—读取缓存文件
—拷贝内存
—读内存
—写内存
—管道
—TCP
反应时间测评工具
—上下文切换
—网络: 连接的建立,管道,TCP,UDP和RPC hot potato
—文件系统的建立和删除
—进程创建
—信号处理
—上层的系统调用
—内存读入反应时间
其他
—处理器时钟比率计算
LMbench的主要特性:
—对于操作系统的可移植性测试
评测工具是由C语言编写的,具有较好的可移植性(尽管它们更易于被GCC编译)。这对于产生系统间逐一明细的对比结果是有用的。
—自适应调整
Lmbench对于应激性行为是非常有用的。当遇到BloatOS比所有竞争者慢4倍的情况时,这个工具会将资源进行分配来修正这个问题。
— 数据库计算结果
数据库的计算结果包括了从大多数主流的计算机工作站制造商上的运行结果。
—存储器延迟计算结果
存储器延迟测试展示了所有系统(数据)的缓存延迟,例如一级,二级和三级缓存,还有内存和TLB表的未命中延迟。另外,缓存的大小可以被正确划分成一些结果集并被读出。硬件族与上面的描述相象。这种测评工具已经找到了操作系统分页策略的中的一些错误。
—上下文转换计算结果
很多人好象喜欢上下文转换的数量。这种测评工具并不是特别注重仅仅引用“在缓存中”的数量。它时常在进程数量和大小间进行变化,并且在当前内容不在缓存中的时候,将结果以一种对用户可见的方式进行划分。您也可以得到冷缓存上下文切换的实际开销。
— 回归测试
Sun公司和SGI公司已经使用这种测评工具以寻找和补救存在于性能上的问题。
Intel公司在开发P6的过程中,使用了它们。
Linux在Linux的性能调整中使用了它们。
— 新的测评工具
源代码是比较小的,可读并且容易扩展。它可以按常规组合成不同的形式以测试其他内容。举例来说,如包括处理连接建立的库函数的网络测量,服务器关闭等。
目录结构
[root@jiangyi01.sqa.zmf /tmp/lmbench3]
#ls
lmbench3 lmbench3.tar.gz
[root@jiangyi01.sqa.zmf /tmp/lmbench3]
#cd lmbench3/
[root@jiangyi01.sqa.zmf /tmp/lmbench3/lmbench3]
#ls
ACKNOWLEDGEMENTS CHANGES COPYING-2 hbench-REBUTTAL README SCCS src
bin COPYING doc Makefile results scripts
配置文件
[root@jiangyi01.sqa.zmf /tmp/lmbench3/lmbench3]
#ll bin/x86_64-linux-gnu/*`hostname`
-rw-r--r-- 1 root root 719 Mar 8 17:18 bin/x86_64-linux-gnu/CONFIG.jiangyi01.sqa.zmf
-rwxr-xr-x 1 root root 1232 Mar 7 20:52 bin/x86_64-linux-gnu/INFO.jiangyi01.sqa.zmf
生成配置文件脚本
[root@jiangyi01.sqa.zmf /tmp/lmbench3/lmbench3]
#ll scripts/config-run
-r-xr-xr-x 1 14557 501 21018 Mar 8 17:18 scripts/config-run
生成配置文件脚本
make results 命令实际上是调用了 scripts/config-run
[root@jiangyi01.sqa.zmf /tmp/lmbench3/lmbench3]
#make results
cd src && make results
make[1]: Entering directory `/tmp/lmbench3/lmbench3/src'
gmake[2]: Entering directory `/tmp/lmbench3/lmbench3/src'
gmake[2]: Nothing to be done for `all'.
gmake[2]: Leaving directory `/tmp/lmbench3/lmbench3/src'
gmake[2]: Entering directory `/tmp/lmbench3/lmbench3/src'
gmake[2]: Nothing to be done for `opt'.
gmake[2]: Leaving directory `/tmp/lmbench3/lmbench3/src'
=====================================================================
L M B E N C H C ON F I G U R A T I O N
----------------------------------------
You need to configure some parameters to lmbench. Once you have configured
these parameters, you may do multiple runs by saying
"make rerun"
in the src subdirectory.
NOTICE: please do not have any other activity on the system if you can
help it. Things like the second hand on your xclock or X perfmeters
are not so good when benchmarking. In fact, X is not so good when
benchmarking.
=====================================================================
Hang on, we are calculating your timing granularity.
OK, it looks like you can time stuff down to 5000 usec resolution.
Hang on, we are calculating your timing overhead.
OK, it looks like your gettimeofday() costs 0 usecs.
Hang on, we are calculating your loop overhead.
OK, it looks like your benchmark loop costs 0.00000197 usecs.
=====================================================================
If you are running on an MP machine and you want to try running
multiple copies of lmbench in parallel, you can specify how many here.
Using this option will make the benchmark run 100x slower (sorry).
NOTE: WARNING! This feature is experimental and many results are
known to be incorrect or random!
MULTIPLE COPIES [default 1] 1
Options to control job placement
1) Allow scheduler to place jobs
2) Assign each benchmark process with any attendent child processes
to its own processor
3) Assign each benchmark process with any attendent child processes
to its own processor, except that it will be as far as possible
from other processes
4) Assign each benchmark and attendent processes to their own
processors
5) Assign each benchmark and attendent processes to their own
processors, except that they will be as far as possible from
each other and other processes
6) Custom placement: you assign each benchmark process with attendent
child processes to processors
7) Custom placement: you assign each benchmark and attendent
processes to processors
Note: some benchmarks, such as bw_pipe, create attendent child
processes for each benchmark process. For example, bw_pipe
needs a second process to send data down the pipe to be read
by the benchmark process. If you have three copies of the
benchmark process running, then you actually have six processes;
three attendent child processes sending data down the pipes and
three benchmark processes reading data and doing the measurements.
Job placement selection: 1
=====================================================================
Several benchmarks operate on a range of memory. This memory should be
sized such that it is at least 4 times as big as the external cache[s]
on your system. It should be no more than 80% of your physical memory.
The bigger the range, the more accurate the results, but larger sizes
take somewhat longer to run the benchmark.
MB [default 67535] 100
Checking to see if you have 100 MB; please wait for a moment...
100MB OK
100MB OK
100MB OK
Hang on, we are calculating your cache line size.
OK, it looks like your cache line is 128 bytes.
=====================================================================
lmbench measures a wide variety of system performance, and the full suite
of benchmarks can take a long time on some platforms. Consequently, we
offer the capability to run only predefined subsets of benchmarks, one
for operating system specific benchmarks and one for hardware specific
benchmarks. We also offer the option of running only selected benchmarks
which is useful during operating system development.
Please remember that if you intend to publish the results you either need
to do a full run or one of the predefined OS or hardware subsets.
SUBSET (ALL|HARWARE|OS|DEVELOPMENT) [default all] h
=====================================================================
This benchmark measures, by default, memory latency for a number of
different strides. That can take a long time and is most useful if you
are trying to figure out your cache line size or if your cache line size
is greater than 128 bytes.
If you are planning on sending in these results, please don't do a fast
run.
Answering yes means that we measure memory latency with a 128 byte stride.
FASTMEM [default no]
=====================================================================
This benchmark measures, by default, file system latency. That can
take a long time on systems with old style file systems (i.e., UFS,
FFS, etc.). Linux' ext2fs and Sun's tmpfs are fast enough that this
test is not painful.
If you are planning on sending in these results, please don't do a fast
run.
If you want to skip the file system latency tests, answer "yes" below.
SLOWFS [default no]
=====================================================================
This benchmark can measure disk zone bandwidths and seek times. These can
be turned into whizzy graphs that pretty much tell you everything you might
need to know about the performance of your disk.
This takes a while and requires read access to a disk drive.
Write is not measured, see disk.c to see how if you want to do so.
If you want to skip the disk tests, hit return below.
If you want to include disk tests, then specify the path to the disk
device, such as /dev/sda. For each disk that is readable, you'll be
prompted for a one line description of the drive, i.e.,
Iomega IDE ZIP
or
HP C3725S 2GB on 10MB/sec NCR SCSI bus
DISKS [default none]
=====================================================================
If you are running on an idle network and there are other, identically
configured systems, on the same wire (no gateway between you and them),
and you have rsh access to them, then you should run the network part
of the benchmarks to them. Please specify any such systems as a space
separated list such as: ether-host fddi-host hippi-host.
REMOTE [default none]
=====================================================================
Calculating mhz, please wait for a moment...
I think your CPU mhz is
2194 MHz, 0.4558 nanosec clock
but I am frequently wrong. If that is the wrong Mhz, type in your
best guess as to your processor speed. It doesn't have to be exact,
but if you know it is around 800, say 800.
Please note that some processors, such as the P4, have a core which
is double-clocked, so on those processors the reported clock speed
will be roughly double the advertised clock rate. For example, a
1.8GHz P4 may be reported as a 3592MHz processor.
Processor mhz [default 2194 MHz, 0.4558 nanosec clock]
=====================================================================
We need a place to store a 100 Mbyte file as well as create and delete a
large number of small files. We default to /usr/tmp. If /usr/tmp is a
memory resident file system (i.e., tmpfs), pick a different place.
Please specify a directory that has enough space and is a local file
system.
FSDIR [default /usr/tmp]
=====================================================================
lmbench outputs status information as it runs various benchmarks.
By default this output is sent to /dev/tty, but you may redirect
it to any file you wish (such as /dev/null...).
Status output file [default /dev/tty]
=====================================================================
There is a database of benchmark results that is shipped with new
releases of lmbench. Your results can be included in the database
if you wish. The more results the better, especially if they include
remote networking. If your results are interesting, i.e., for a new
fast box, they may be made available on the lmbench web page, which is
http://www.bitmover.com/lmbench
Mail results [default yes] n
OK, no results mailed.
=====================================================================
Confguration done, thanks.
There is a mailing list for discussing lmbench hosted at BitMover.
Send mail to majordomo@bitmover.com to join the list.
Using config in CONFIG.jiangyi01.sqa.zmf
Wed Mar 8 16:30:53 CST 2017
Latency measurements
Wed Mar 8 16:31:10 CST 2017
Local networking
Wed Mar 8 16:31:14 CST 2017
Bandwidth measurements
Wed Mar 8 16:31:27 CST 2017
Calculating effective TLB size
Wed Mar 8 16:31:29 CST 2017
Calculating memory load parallelism
Wed Mar 8 16:32:12 CST 2017
McCalpin's STREAM benchmark
Wed Mar 8 16:32:14 CST 2017
Calculating memory load latency
Wed Mar 8 16:52:16 CST 2017
make[1]: Leaving directory `/tmp/lmbench3/lmbench3/src'
读取结果
[root@jiangyi01.sqa.zmf /tmp/lmbench3/lmbench3]
#make see
cd results && make summary percent 2>/dev/null | more
make[1]: Entering directory `/tmp/lmbench3/lmbench3/results'
L M B E N C H 3 . 0 S U M M A R Y
------------------------------------
(Alpha software, do not distribute)
Basic system parameters
------------------------------------------------------------------------------
Host OS Description Mhz tlb cache mem scal
pages line par load
bytes
--------- ------------- ----------------------- ---- ----- ----- ------ ----
jiangyi01 Linux 3.10.0- x86_64-linux-gnu 2194 32 128 6.4300 1
Processor, Processes - times in microseconds - smaller is better
------------------------------------------------------------------------------
Host OS Mhz null null open slct sig sig fork exec sh
call I/O stat clos TCP inst hndl proc proc proc
--------- ------------- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ----
jiangyi01 Linux 3.10.0- 2195 0.07 0.15 0.99 1.96 4.12 0.16 1.05
Basic integer operations - times in nanoseconds - smaller is better
-------------------------------------------------------------------
Host OS intgr intgr intgr intgr intgr
bit add mul div mod
--------- ------------- ------ ------ ------ ------ ------
jiangyi01 Linux 3.10.0- 0.4600 0.0700 1.4100 10.3 11.5
Basic float operations - times in nanoseconds - smaller is better
-----------------------------------------------------------------
Host OS float float float float
add mul div bogo
--------- ------------- ------ ------ ------ ------
jiangyi01 Linux 3.10.0- 1.3700 2.2800 6.5400 6.3900
Basic double operations - times in nanoseconds - smaller is better
------------------------------------------------------------------
Host OS double double double double
add mul div bogo
--------- ------------- ------ ------ ------ ------
jiangyi01 Linux 3.10.0- 1.3700 2.2800 10.2 10.0
File & VM system latencies in microseconds - smaller is better
-------------------------------------------------------------------------------
Host OS 0K File 10K File Mmap Prot Page 100fd
Create Delete Create Delete Latency Fault Fault selct
--------- ------------- ------ ------ ------ ------ ------- ----- ------- -----
jiangyi01 Linux 3.10.0- 0.309 1.549
*Local* Communication bandwidths in MB/s - bigger is better
-----------------------------------------------------------------------------
Host OS Pipe AF TCP File Mmap Bcopy Bcopy Mem Mem
UNIX reread reread (libc) (hand) read write
--------- ------------- ---- ---- ---- ------ ------ ------ ------ ---- -----
jiangyi01 Linux 3.10.0- 2405.2 4445.5 6114 5489.
Memory latencies in nanoseconds - smaller is better
(WARNING - may not be correct, check graphs)
------------------------------------------------------------------------------
Host OS Mhz L1 $ L2 $ Main mem Rand mem Guesses
--------- ------------- --- ---- ---- -------- -------- -------
jiangyi01 Linux 3.10.0- 2194 1.8250 5.4860 49.3 117.6
make[1]: Leaving directory `/tmp/lmbench3/lmbench3/results'
技术参数
参数说明
我这里对每个测试结果参数的说明不全,更加全面的请看REF链接
(1)Basic system parameters(系统基本参数)
Tlb pages:TLB(Translation Lookaside Buffer)的页面数
Cache line bytes :(cache的行字节数)
Mem par
memory hierarchy parallelism
Scal load:并行的lmbench数
(2)Processor, Processes(处理器、进程操作时间)
Null call:简单系统调用(取进程号)
Null I/O:简单IO操作(空读写的平均)
Stat:取文档状态的操作
Open clos:打开然后立即关闭关闭文档操作
Slct tcp
Select:配置
Sig inst:配置信号
Sig hndl:捕获处理信号
Fork proc :Fork进程后直接退出
Exec proc:Fork后执行execve调用再退出
Sh proc:Fork后执行shell再退出
(3)Basic integer/float/double operations
略
(4)Context switching 上下文切换时间
2p/16K: 表示2个并行处理16K大小的数据
(5)Local Communication latencies(本地通信延时,通过不同通信方式发送后自己立即读)
Pipe:管道通信
AF UNIX
Unix协议
UDP
UDP
RPC/UDP
TCP
RPC/TCP
TCP conn
TCP建立connect并关闭描述字
(6)File & VM system latencies(文档、内存延时)
File Create & Delete:创建并删除文档
MMap Latency:内存映射
Prot Fault
Protect fault
Page Fault:缺页
100fd selct:对100个文档描述符配置select的时间
(7)Local Communication bandwidths(本地通信带宽)
Pipe:管道操作
AF UNIX
Unix协议
TCP
TCP通信
File reread:文档重复读
MMap reread:内存映射重复读
Bcopy(libc):内存拷贝
Bcopy(hand):内存拷贝
Mem read:内存读
Mem write:内存写
(8)Memory latencies(内存操作延时)
L1:缓存1
L2:缓存2
Main Mem:连续内存
Rand Mem:内存随机访问延时
Guesses
假如L1和L2近似,会显示“No L1 cache?”
假如L2和Main Mem近似,会显示“No L2 cache?”