blktrace分析IO

前言

由于在iostat输出中，只能看到service time + wait time，因为对于评估一个磁盘或者云磁盘而言，service time才是衡量磁盘性能的核心指标和直接指标。很不幸iostat无法提供这个指标，但是blktrace可以。

blktrace是一柄神器，很多工具都是基于该神器的：ioprof，seekwatcher，iowatcher，这个工具基本可以满足我们的对块设备请求的所有了解。

blktrace的原理

一个I/O请求，从应用层到底层块设备，路径如下图所示：

从上图可以看出IO路径是很复杂的。这么复杂的IO路径我们是无法用短短一篇小博文介绍清楚的。我们将IO路径简化一下：

一个I/O请求进入block layer之后，可能会经历下面的过程：

Remap: 可能被DM(Device Mapper)或MD(Multiple Device, Software RAID) remap到其它设备
Split: 可能会因为I/O请求与扇区边界未对齐、或者size太大而被分拆(split)成多个物理I/O
Merge: 可能会因为与其它I/O请求的物理位置相邻而合并(merge)成一个I/O
被IO Scheduler依照调度策略发送给driver
被driver提交给硬件，经过HBA、电缆（光纤、网线等）、交换机（SAN或网络）、最后到达存储设备，设备完成IO请求之后再把结果发回。

blktrace 能够记录下IO所经历的各个步骤:

我们一起看下blktrace的输出长什么样子：

第一个字段：8,0 这个字段是设备号 major device ID和minor device ID。
第二个字段：3 表示CPU
第三个字段：11 序列号
第四个字段：0.009507758 Time Stamp是时间偏移
第五个字段：PID 本次IO对应的进程ID
第六个字段：Event，这个字段非常重要，反映了IO进行到了那一步
第七个字段：R表示 Read， W是Write，D表示block，B表示Barrier Operation
第八个字段：223490+56，表示的是起始block number 和 number of blocks，即我们常说的Offset 和 Size
第九个字段：进程名

其中第六个字段非常有用：每一个字母都代表了IO请求所经历的某个阶段。

Q – 即将生成IO请求
|
G – IO请求生成
|
I – IO请求进入IO Scheduler队列
|
D – IO请求进入driver
|
C – IO请求执行完毕

根据以上步骤对应的时间戳就可以计算出I/O请求在每个阶段所消耗的时间：

Q2G – 生成IO请求所消耗的时间，包括remap和split的时间；
G2I – IO请求进入IO Scheduler所消耗的时间，包括merge的时间；
I2D – IO请求在IO Scheduler中等待的时间；
D2C – IO请求在driver和硬件上所消耗的时间；
Q2C – 整个IO请求所消耗的时间(Q2I + I2D + D2C = Q2C)，相当于iostat的await。

如果I/O性能慢的话，以上指标有助于进一步定位缓慢发生的地方：
D2C可以作为硬件性能的指标；
I2D可以作为IO Scheduler性能的指标。

一个IO的生命周期大约是：

● I/O enters block layer – it can be:

– Remapped onto another device (MD, DM)

– Split into 2 separate I/Os (alignment, size, ...)

– Added to the request queue

– Merged with a previous entry on the queue All I/Os end up on a request queue at some point

● At some later time, the I/O is issued to a device driver, and submitted to a device

● Later, the I/O is completed by the device, and its driver

Q2Q — time between requests sent to the block layer
Q2G — time from a block I/O is queued to the time it gets a request allocated for it
G2I — time from a request is allocated to the time it is Inserted into the device's queue
Q2M — time from a block I/O is queued to the time it gets merged with an existing request
I2D — time from a request is inserted into the device's queue to the time it is actually issued to the device
M2D — time from a block I/O is merged with an exiting request until the request is issued to the device
D2C — service time of the request by the device
Q2C — total time spent in the block layer for a request

注意，整个IO路径，分成很多段，每一段开始的时候，都会有一个时间戳，根据上一段开始的时间和下一段开始的时间，就可以得到IO 路径各段花费的时间。

注意，我们心心念念的service time，也就是反应块设备处理能力的指标，就是从D到C所花费的时间，简称D2C。

而iostat输出中的await，即整个IO从生成请求到IO请求执行完毕，即从Q到C所花费的时间，我们简称Q2C。

我们知道Linux 有I/O scheduler，调度器的效率如何，I2D是重要的指标。

blktrace、blkparse和btt

我们接下来简单介绍这些工具的使用，其中这三个命令都是属于blktrace这个包的，他们是一家人。

使用blktrace需要挂载debugfs：

$ mount -t debugfs debugfs /sys/kernel/debug

首先通过如下命令，可以查看磁盘上的实时信息：

blktrace -d /dev/sdb -o – | blkparse -i –

这个命令会连绵不绝地出现很多输出，当你输入ctrl＋C的时候，会停止。

当然了，你也可以先用如下命令采集信息，待所有信息采集完毕后，统一分析所有采集到的数据。搜集信息的命令如下：

blktrace -d /dev/sdb

注意，这个命令并不是只输出一个文件，他会根据CPU的个数上，每一个CPU都会输出一个文件，你也可以用-o参数指定自己的输出文件名。如下所示：

-rw-r--r-- 1 manu manu  1.3M Jul  6 19:58 sdb.blktrace.0
-rw-r--r-- 1 manu manu  823K Jul  6 19:58 sdb.blktrace.1
-rw-r--r-- 1 manu manu  2.8M Jul  6 19:58 sdb.blktrace.10
-rw-r--r-- 1 manu manu  1.9M Jul  6 19:58 sdb.blktrace.11
-rw-r--r-- 1 manu manu  474K Jul  6 19:58 sdb.blktrace.12
-rw-r--r-- 1 manu manu  271K Jul  6 19:58 sdb.blktrace.13
-rw-r--r-- 1 manu manu  578K Jul  6 19:58 sdb.blktrace.14
-rw-r--r-- 1 manu manu  375K Jul  6 19:58 sdb.blktrace.15
-rw-r--r-- 1 manu manu  382K Jul  6 19:58 sdb.blktrace.16
-rw-r--r-- 1 manu manu  478K Jul  6 19:58 sdb.blktrace.17
-rw-r--r-- 1 manu manu  839K Jul  6 19:58 sdb.blktrace.18
-rw-r--r-- 1 manu manu  848K Jul  6 19:58 sdb.blktrace.19
-rw-r--r-- 1 manu manu  1.6M Jul  6 19:58 sdb.blktrace.2
-rw-r--r-- 1 manu manu  652K Jul  6 19:58 sdb.blktrace.20
-rw-r--r-- 1 manu manu  738K Jul  6 19:58 sdb.blktrace.21
-rw-r--r-- 1 manu manu  594K Jul  6 19:58 sdb.blktrace.22
-rw-r--r-- 1 manu manu  527K Jul  6 19:58 sdb.blktrace.23
-rw-r--r-- 1 manu manu 1005K Jul  6 19:58 sdb.blktrace.3
-rw-r--r-- 1 manu manu  1.2M Jul  6 19:58 sdb.blktrace.4
-rw-r--r-- 1 manu manu  511K Jul  6 19:58 sdb.blktrace.5
-rw-r--r-- 1 manu manu  2.3M Jul  6 19:58 sdb.blktrace.6
-rw-r--r-- 1 manu manu  1.3M Jul  6 19:58 sdb.blktrace.7
-rw-r--r-- 1 manu manu  2.1M Jul  6 19:58 sdb.blktrace.8
-rw-r--r-- 1 manu manu  1.1M Jul  6 19:58 sdb.blktrace.9

有了输出，我们可以通过blkparse -i sdb来分析采集的数据：

$ blktrace -d /dev/sdb
$ blkparse -i sdb
  # 第一个IO开始：
  8,16   1        1     0.000000000 18166  Q   R 0 + 1 [dd]
  8,16   1        0     0.000009827     0  m   N cfq18166S  /user.slice alloced
  8,16   1        2     0.000010451 18166  G   R 0 + 1 [dd]
  8,16   1        3     0.000011056 18166  P   N [dd]
  8,16   1        4     0.000012255 18166  I   R 0 + 1 [dd]
  8,16   1        0     0.000013477     0  m   N cfq18166S  /user.slice insert_request
  8,16   1        0     0.000014526     0  m   N cfq18166S  /user.slice add_to_rr
  8,16   1        5     0.000016643 18166  U   N [dd] 1
  8,16   1        0     0.000017522     0  m   N cfq workload slice:300
  8,16   1        0     0.000018880     0  m   N cfq18166S  /user.slice set_active wl_class:0 wl_type:2
  8,16   1        0     0.000020594     0  m   N cfq18166S  /user.slice fifo=          (null)
  8,16   1        0     0.000021462     0  m   N cfq18166S  /user.slice dispatch_insert
  8,16   1        0     0.000022898     0  m   N cfq18166S  /user.slice dispatched a request
  8,16   1        0     0.000023786     0  m   N cfq18166S  /user.slice activate rq, drv=1
  8,16   1        6     0.000023977 18166  D   R 0 + 1 [dd]
  8,16   0        1     0.014270153     0  C   R 0 + 1 [0]
  # 第一个IO结束。
  8,16   0        0     0.014278115     0  m   N cfq18166S  /user.slice complete rqnoidle 0
  8,16   0        0     0.014280044     0  m   N cfq18166S  /user.slice set_slice=100
  8,16   0        0     0.014282217     0  m   N cfq18166S  /user.slice arm_idle: 8 group_idle: 0
  8,16   0        0     0.014282728     0  m   N cfq schedule dispatch
  # 第二个IO开始：
  8,16   1        7     0.014298247 18166  Q   R 1 + 1 [dd]
  8,16   1        8     0.014300522 18166  G   R 1 + 1 [dd]
  8,16   1        9     0.014300984 18166  P   N [dd]
  8,16   1       10     0.014301996 18166  I   R 1 + 1 [dd]
  8,16   1        0     0.014303864     0  m   N cfq18166S  /user.slice insert_request
  8,16   1       11     0.014304981 18166  U   N [dd] 1
  8,16   1        0     0.014306368     0  m   N cfq18166S  /user.slice dispatch_insert
  8,16   1        0     0.014307793     0  m   N cfq18166S  /user.slice dispatched a request
  8,16   1        0     0.014308763     0  m   N cfq18166S  /user.slice activate rq, drv=1
  8,16   1       12     0.014308962 18166  D   R 1 + 1 [dd]
  8,16   0        2     0.014518615     0  C   R 1 + 1 [0]
  # 第二个IO结束。
  8,16   0        0     0.014523548     0  m   N cfq18166S  /user.slice complete rqnoidle 0
  8,16   0        0     0.014525334     0  m   N cfq18166S  /user.slice arm_idle: 8 group_idle: 0
  8,16   0        0     0.014525676     0  m   N cfq schedule dispatch
  # 第三个IO开始：
  8,16   1       13     0.014531022 18166  Q   R 2 + 1 [dd]
  ...

注意，blkparse仅仅是将blktrace输出的信息转化成人可以阅读和理解的输出，但是，信息太多，太杂，人完全没法得到关键信息。这时候btt就横空出世了，这个工具可以将blktrace采集回来的数据，进行分析，得到对人更有用的信息。事实上，btt也是我们的终点。

注：
在以上数据中，有一些记录的event类型是”m”，那是IO Scheduler的调度信息，对研究IO Scheduler问题有意义：

cfq18166S – cfq是IO Scheduler的名称，18166是进程号，”S”表示Sync(同步IO)，如果异步IO则用“A”表示(Async)；
它们的第三列sequence number都是0；

它们表示IO Scheduler内部的关键函数，上例中是cfq，代码参见block/cfq-iosched.c，以下是各关键字所对应的内部函数：

alloced <<< cfq_find_alloc_queue()
insert_request <<< cfq_insert_request()
add_to_rr <<< cfq_add_cfqq_rr()
cfq workload slice:300 <<< choose_wl_class_and_type()
set_active wl_class:0 wl_type:2 <<< __cfq_set_active_queue()
fifo= (null) <<< cfq_check_fifo()
dispatch_insert <<< cfq_dispatch_insert()
dispatched a request <<< cfq_dispatch_requests()
activate rq, drv=1 <<< cfq_activate_request()
complete rqnoidle 0 <<< cfq_completed_request()
set_slice=100 <<< cfq_set_prio_slice()
arm_idle: 8 group_idle: 0 <<< cfq_arm_slice_timer()
cfq schedule dispatch <<< cfq_schedule_dispatch()

利用btt分析blktrace数据

blkparse只是将blktrace数据转成可以人工阅读的格式，由于数据量通常很大，人工分析并不轻松。btt是对blktrace数据进行自动分析的工具。

btt不能分析实时数据，只能对blktrace保存的数据文件进行分析。使用方法：
把原本按CPU分别保存的文件合并成一个，合并后的文件名为sdb.blktrace.bin：
$ blkparse -i sdb -d sdb.blktrace.bin
执行btt对sdb.blktrace.bin进行分析：
$ btt -i sdb.blktrace.bin

下面是一个btt实例：

...
            ALL           MIN           AVG           MAX           N
--------------- ------------- ------------- ------------- -----------
 
Q2Q               0.000138923   0.000154010   0.014298247       94558
Q2G               0.000001154   0.000001661   0.000017313       94559
G2I               0.000000883   0.000001197   0.000012011       94559
I2D               0.000004722   0.000005761   0.000027286       94559
D2C               0.000118840   0.000129201   0.014246176       94558
Q2C               0.000125953   0.000137820   0.014270153       94558
 
==================== Device Overhead ====================
 
       DEV |       Q2G       G2I       Q2M       I2D       D2C
---------- | --------- --------- --------- --------- ---------
 (  8, 16) |   1.2050%   0.8688%   0.0000%   4.1801%  93.7461%
---------- | --------- --------- --------- --------- ---------
   Overall |   1.2050%   0.8688%   0.0000%   4.1801%  93.7461%

我们看到93.7461%的时间消耗在D2C，也就是硬件层，这是正常的，我们说过D2C是衡量硬件性能的指标，这里单个IO平均0.129201毫秒，已经是相当快了，单个IO最慢14.246176 毫秒，不算坏。Q2G和G2I都很小，完全正常。I2D稍微有点大，应该是cfq scheduler的调度策略造成的，你可以试试其它scheduler，比如deadline，比较两者的差异，然后选择最适合你应用特点的那个。

注:

1、一些状态的对应关系

Act Description                                 SetPosition
A  IO was remapped to a different device   blk-core.c/trace_block_remap
B  IO bounced                              bounce.c/trace_block_bio_bounce
C  IO completion                           blk-core.c/trace_block_rq_complete
D  IO issued to driver                     elevator.c/trace_block_rq_issue
F  IO front merged with request on queue   blk-core.c/trace_block_bio_frontmerge
G  Get request                             blk-core.c/trace_block_getrq
I  IO inserted onto request queue          elevator.c/trace_block_rq_insert
M  IO back merged with request on queue    blk-core.c/trace_block_bio_backmerge
P  Plug request                            blk-core.c/trace_block_plug
Q  IO handled by request queue code        blk-core.c/trace_block_bio_queue
S  Sleep request                           blk-core.c/trace_block_sleeprq
T  Unplug due to timeout                   blk-core.c/trace_block_unplug_timer
U  Unplug request                          blk-core.c/trace_block_unplug_io
X  Split                                   bio.c/trace_block_split

2、Trace actions：

C – complete A previously issued request has been completed. The output
will detail the sector and size of that request, as well as the success or
failure of it.

D – issued A request that previously resided on the block layer queue or in
the io scheduler has been sent to the driver.

I – inserted A request is being sent to the io scheduler for addition to the
internal queue and later service by the driver. The request is fully formed
at this time.

Q – queued This notes intent to queue io at the given location. No real re-
quests exists yet.

B – bounced The data pages attached to this bio are not reachable by the
hardware and must be bounced to a lower memory location. This causes
a big slowdown in io performance, since the data must be copied to/from
kernel buffers. Usually this can be fixed with using better hardware -
either a better io controller, or a platform with an IOMMU.

m – message Text message generated via kernel call to blk add trace msg.

M – back merge A previously inserted request exists that ends on the bound-
ary of where this io begins, so the io scheduler can merge them together.

F – front merge Same as the back merge, except this io ends where a previ-
ously inserted requests starts.

G – get request To send any type of request to a block device, a struct request
container must be allocated first.

S – sleep No available request structures were available, so the issuer has to
wait for one to be freed.

P – plug When io is queued to a previously empty block device queue, Linux
will plug the queue in anticipation of future ios being added before this
data is needed.

U – unplug Some request data already queued in the device, start sending
requests to the driver. This may happen automatically if a timeout period
has passed (see next entry) or if a number of requests have been added to
the queue.

T – unplug due to timer If nobody requests the io that was queued after
plugging the queue, Linux will automatically unplug it after a defined
period has passed.

X – split On raid or device mapper setups, an incoming io may straddle a
device or internal zone and needs to be chopped up into smaller pieces
for service. This may indicate a performance problem due to a bad setup
of that raid/dm device, but may also just be part of normal boundary
conditions. dm is notably bad at this and will clone lots of io.

A – remap For stacked devices, incoming io is remapped to device below it in
the io stack. The remap action details what exactly is being remapped to what.

参考:

http://bean-li.github.io/blktrace-to-report/

http://linuxperf.com/?p=161

http://bbs.chinaunix.net/thread-1976867-1-1.html

http://www.cnblogs.com/cobbliu/p/7002933.html

/usr/share/doc/blktrace-1.0.1/blktrace.pdf
/usr/share/doc/blktrace-1.0.1/btt.pdf

http://blog.yufeng.info/archives/751

http://duch.mimuw.edu.pl/~lichota/09-10/Optymalizacja-open-source/Materialy/10%20-%20Dysk/gelato_ICE06apr_blktrace_brunelle_hp.pdf

http://fibrevillage.com/storage/539-blktrace-and-btt-example-to-debug-and-tune-disk-i-o-on-linux

posted on 2017-10-17 15:26 Still water run deep 阅读(1883) 评论(0) 编辑收藏举报

刷新页面返回顶部

Still waters run deep

blktrace分析IO

前言

blktrace的原理

blktrace、blkparse和btt

利用btt分析blktrace数据

导航

公告