ceph 性能测试
我在物理机上创建了5台虚拟机,搭建了一个ceph集群,结构如图:
具体的安装步骤参考文档:http://docs.ceph.org.cn/start/
http://www.centoscn.com/CentosServer/test/2015/0521/5489.html
一、磁盘读写性能
1. 单个osd磁盘写性能
[root@lrr-ceph1 osd]# echo 3 > /proc/sys/vm/drop_caches #清除缓存页,目录项和inodes
[root@lrr-ceph1 osd]# dd if=/dev/zero of=/var/lib/ceph/osd/lrr01 bs=1G count=1 oflag=direct #执行写命令
注:两个OSD同时写性能
[root@lrr-ceph1 osd]# for i in `mount | grep osd | awk '{print $3}'`; do (dd if=/dev/zero of=$i/lrr01 bs=1G count=1 oflag=direct $) ; done
2. 单个OSD同时读性能
[root@lrr-ceph1 osd]# dd if=/var/lib/ceph/osd/lrr01 of=/dev/null bs=2G count=1 iflag=direct
0+1 records in
0+1 records out
1073741824 bytes (1.1 GB) copied, 7.13509 s, 150 MB/s
注:两个OSD同时读性能
[root@lrr-ceph1 osd]# for i in `mount | grep osd | awk '{print $3}'`; do (dd if=$i/lrr01 of=/dev/null bs=1G count=1 iflag=direct &); done
二、CEPH 性能测试方法
ceph性能的测试包括:RADOS性能测试和RBD性能测试;
Rados性能测试工具:使用ceph自带的rados bench工具、使用rados losd-gen工具;
RBD性能测试工具:rbd bench-write进行块设备写性能测试、fio+rbd ioengine测试、fio +libaio测试。
1. Rados性能测试
1.1 使用ceph自带的rados bench工具进行测试
该工具的语法是:rados bench -p <pool_name> <seconds> <write|seq|rand> -b <block size> -t --no-cleanup
pool_name:测试所针对的存储池;
seconds:测试所持续的秒数;
<write|seq|rand>:操作模式,write:写,seq:顺序读;rand:随机读;
-b:block size,即块大小,默认为 4M;
-t:读/写并行数,默认为 16;
--no-cleanup 表示测试完成后不删除测试用数据。在做读测试之前,需要使用该参数来运行一遍写测试来产生测试数据,在全部测试结束后可以运行 rados -p <pool_name> cleanup 来清理所有测试数据。
i 写测试:
[root@lrr-ceph2 ~]# rados bench -p rbd 10 write --no-cleanup
Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_lrr-ceph2_4445
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)
0 0 0 0 0 0 - 0
1 16 16 0 0 0 - 0
2 16 16 0 0 0 - 0
。。。。
17 13 19 6 1.41038 0 - 10.7886
18 13 19 6 1.33207 0 - 10.7886
19 13 19 6 1.26201 0 - 10.7886
Total time run: 19.698032
Total writes made: 19
Write size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 3.85825
Stddev Bandwidth: 0.551976
Max bandwidth (MB/sec): 1.71429
Min bandwidth (MB/sec): 0
Average IOPS: 0
Stddev IOPS: 0
Max IOPS: 0
Min IOPS: 0
Average Latency(s): 15.5797
Stddev Latency(s): 5.09105
Max latency(s): 19.6971
Min latency(s): 5.51094
ii 顺序读测试:
iii 随机读测试:
1.2 rados load-gen 工具
该工具的语法为:
# rados -p rbd load-gen --num-objects 初始生成测试用的对象数,默认 200 --min-object-size 测试对象的最小大小,默认 1KB,单位byte --max-object-size 测试对象的最大大小,默认 5GB,单位byte --min-op-len 压测IO的最小大小,默认 1KB,单位byte --max-op-len 压测IO的最大大小,默认 2MB,单位byte --max-ops 一次提交的最大IO数,相当于iodepth --target-throughput 一次提交IO的历史累计吞吐量上限,默认 5MB/s,单位B/s --max-backlog 一次提交IO的吞吐量上限,默认10MB/s,单位B/s --read-percent 读写混合中读的比例,默认80,范围[0, 100] --run-length 运行的时间,默认60s,单位秒
在 ceph1上运行
rados -p pool100 load-gen --read-percent 0 --min-object-size 1073741824 --max-object-size 1073741824 --max-ops 1 --read-percent 0 --min-op-len 4194304 --max-op-len 4194304 --target-throughput 1073741824 --max_backlog 1073741824
的结果为:
WRITE : oid=obj-y0UPAZyRQNhnabq off=929764660 len=4194304 op 19 completed, throughput=16MB/sec WRITE : oid=obj-nPcOZAc4ebBcnyN off=143211384 len=4194304 op 20 completed, throughput=20MB/sec WRITE : oid=obj-sWGUAzzASPjCcwF off=343875215 len=4194304 op 21 completed, throughput=24MB/sec WRITE : oid=obj-79r25fxxSMgVm11 off=383617425 len=4194304 op 22 completed, throughput=28MB/sec
该命令的含义是:在 1G 的对象上,以 iodepth = 1 顺序写入 block size 为 4M 的总量为 1G 的数据。其平均结果大概在 24MB/s,基本和 rados bench 的结果相当。
在 client 上,同样的配置,顺序写的BW大概在 20MB/s,顺序读的 BW 大概在 100 MB/s。
可见,与 rados bench 相比,rados load-gen 的特点是可以产生混合类型的测试负载,而 rados bench 只能产生一种类型的负载。但是 load-gen 只能输出吞吐量,只合适做类似于 4M 这样的大block size 数据测试,输出还不包括延迟。
2 rbd性能测试
2.1 使用rbd bench-write 进行块设备写性能测试
2.1.1 客户端准备
在执行如下命令来准备 Ceph 客户端:
root@client:/var# rbd create bd2 --size 1024 root@client:/var# rbd info --image bd2 rbd image 'bd2': size 1024 MB in 256 objects order 22 (4096 kB objects) block_name_prefix: rb.0.3841.74b0dc51 format: 1 root@client:/var# rbd map bd2 root@client:/var# rbd showmapped id pool image snap device 1 pool1 bd1 - /dev/rbd1 2 rbd bd2 - /dev/rbd2 root@client:/var# mkfs.xfs /dev/rbd2 log stripe unit (4194304 bytes) is too large (maximum is 256KiB) log stripe unit adjusted to 32KiB meta-data=/dev/rbd2 isize=256 agcount=9, agsize=31744 blks = sectsz=512 attr=2, projid32bit=0 data = bsize=4096 blocks=262144, imaxpct=25 = sunit=1024 swidth=1024 blks naming =version 2 bsize=4096 ascii-ci=0 log =internal log bsize=4096 blocks=2560, version=2 = sectsz=512 sunit=8 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 root@client:/var# mkdir -p /mnt/ceph-bd2 root@client:/var# mount /dev/rbd2 /mnt/ceph-bd2/ root@client:/var# df -h /mnt/ceph-bd2/ Filesystem Size Used Avail Use% Mounted on /dev/rbd2 1014M 33M 982M 4% /mnt/ceph-bd2
2.1.2 测试
rbd bench-write 的语法为:rbd bench-write <RBD image name>,可以带如下参数:
- --io-size:单位 byte,默认 4096 bytes = 4K
- --io-threads:线程数,默认 16
- --io-total:总写入字节,单位为字节,默认 1024M
- --io-pattern <seq|rand>:写模式,默认为 seq 即顺序写
分别在集群 OSD 节点上和客户端上做测试:
(1)在 OSD 节点上做测试
root@ceph1:~# rbd bench-write bd2 --io-total 171997300 bench-write io_size 4096 io_threads 16 bytes 171997300 pattern seq SEC OPS OPS/SEC BYTES/SEC 1 280 273.19 2237969.65 2 574 286.84 2349818.65 ... 71 20456 288.00 2358395.28 72 20763 288.29 2360852.64 elapsed: 72 ops: 21011 ops/sec: 288.75 bytes/sec: 2363740.27
此时,块大小为 4k,IOPS 为 289,BW 为 2.36 MB/s (怎么 BW 是 block_size * IOPS 的两倍呢?)。
(2)在客户端上做测试
root@client:/home/s1# rbd bench-write pool.host/image.ph2 --io-total 1719973000 --io-size 4096000 bench-write io_size 4096000 io_threads 16 bytes 1719973000 pattern seq SEC OPS OPS/SEC BYTES/SEC 1 5 3.41 27937685.86 2 19 9.04 68193147.96 3 28 8.34 62237889.75 5 36 6.29 46538807.31 ... 39 232 5.86 40792216.64 40 235 5.85 40666942.19 elapsed: 41 ops: 253 ops/sec: 6.06 bytes/sec: 41238190.87
此时 block size 为 4M,IOPS 为 6, BW 为 41.24 MB/s。
root@client:/home/s1# rbd bench-write pool.host/image.ph2 --io-total 1719973000 bench-write io_size 4096 io_threads 16 bytes 1719973000 pattern seq SEC OPS OPS/SEC BYTES/SEC 1 331 329.52 2585220.17 2 660 329.57 2521925.67 3 1004 333.17 2426190.82 4 1331 332.26 2392607.58 5 1646 328.68 2322829.13 6 1986 330.88 2316098.66
此时 block size 为 4K,IOPS 为 330 左右, BW 为 24 MB/s 左右。
备注:从 rbd bench-write vs dd performance confusion 中看起来,rados bench-write 似乎有bug。我所使用的Ceph 是0.80.11 版本,可能补丁还没有合进来。
2 使用fio+rbd ioengine
运行 apt-get install fio 来安装 fio 工具。创建 fio 配置文件:
root@client:/home/s1# cat write.fio [write-4M] description="write test with block size of 4M" ioengine=rbd clientname=admin pool=rbd rbdname=bd2 iodepth=32 runtime=120 rw=write #write 表示顺序写,randwrite 表示随机写,read 表示顺序读,randread 表示随机读 bs=4M
运行 fio 命令,但是出错:
root@client:/home/s1# fio write.fio fio: engine rbd not loadable fio: failed to load engine rbd Bad option <clientname=admin> Bad option <pool=rbd> Bad option <rbdname=bd2> fio: job write-4M dropped fio: file:ioengines.c:99, func=dlopen, error=rbd: cannot open shared object file: No such file or directory
其原因是因为没有安装 fio librbd IO 引擎,因此当前 fio 无法支持 rbd ioengine:
root@client:/home/s1# fio --enghelp Available IO engines: cpuio mmap sync psync vsync pvsync null net netsplice libaio rdma posixaio falloc e4defrag splice sg binject
在运行 apt-get install librbd-dev 命令安装 librbd 后,fio 还是报同样的错误。参考网上资料,下载 fio 代码重新编译 fio:
$ git clone git://git.kernel.dk/fio.git $ cd fio $ ./configure [...] Rados Block Device engine yes [...] $ make
此时 fio 的 ioengine 列表中也有 rbd 了。fio 使用 rbd IO 引擎后,它会读取 ceph.conf 中的配置去连接 Ceph 集群。
下面是 fio 命令和结果:
root@client:/home/s1/fio# ./fio ../write.fio write-4M: (g=0): rw=write, bs=4M-4M/4M-4M/4M-4M, ioengine=rbd, iodepth=32 fio-2.11-12-g82e6 Starting 1 process rbd engine: RBD version: 0.1.8 Jobs: 1 (f=1): [W(1)] [100.0% done] [0KB/128.0MB/0KB /s] [0/32/0 iops] [eta 00m:00s] write-4M: (groupid=0, jobs=1): err= 0: pid=19190: Sat Jun 4 22:30:00 2016 Description : ["write test with block size of 4M"] write: io=1024.0MB, bw=17397KB/s, iops=4, runt= 60275msec slat (usec): min=129, max=54100, avg=1489.10, stdev=4907.83 clat (msec): min=969, max=15690, avg=7399.86, stdev=1328.55 lat (msec): min=969, max=15696, avg=7401.35, stdev=1328.67 clat percentiles (msec): | 1.00th=[ 971], 5.00th=[ 6325], 10.00th=[ 6325], 20.00th=[ 6521], | 30.00th=[ 6718], 40.00th=[ 7439], 50.00th=[ 7439], 60.00th=[ 7635], | 70.00th=[ 7832], 80.00th=[ 8291], 90.00th=[ 8356], 95.00th=[ 8356], | 99.00th=[14615], 99.50th=[15664], 99.90th=[15664], 99.95th=[15664], | 99.99th=[15664] bw (KB /s): min=245760, max=262669, per=100.00%, avg=259334.50, stdev=6250.72 lat (msec) : 1000=1.17%, >=2000=98.83% cpu : usr=0.24%, sys=0.03%, ctx=50, majf=0, minf=8 IO depths : 1=2.3%, 2=5.5%, 4=12.5%, 8=25.0%, 16=50.4%, 32=4.3%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=97.0%, 8=0.0%, 16=0.0%, 32=3.0%, 64=0.0%, >=64=0.0% issued : total=r=0/w=256/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=32 Run status group 0 (all jobs): WRITE: io=1024.0MB, aggrb=17396KB/s, minb=17396KB/s, maxb=17396KB/s, mint=60275msec, maxt=60275msec Disk stats (read/write): sda: ios=0/162, merge=0/123, ticks=0/19472, in_queue=19472, util=6.18%