ceph集群crush进阶使用
九 ceph集群crush进阶使用
9.1 ceph集群运行图
ceph集群中由mon服务器维护的五种运行图:
1. Monitor map #监视器运行图;
2. OSD map #OSD运行图;
3 . PG map #PG运行图;
4. Crush map (Controllers replication under scalable hashing) # 可控的、可复制的、可伸缩的一致性hash算法,crush运行图,当新建存储池时会基于OSD map创建新的PG组合列表用于存储数据,动态更新运行状态;
5. MDS map #cephfs metatdata运行图 ;
9.2 crush算法
Unifom
List
Tree
Straw
Straw2 #默认使用
9.3 PG与OSD映射调整
默认情况下,crush算法自行对创建的pool中的PG分配OSD,但是可以手动基于权重设置crush算法分配数据的倾向性,比如1T的磁盘权重是1,2T的磁盘就是2,推荐使用相同大小的设备。
9.3.1 查看当前状态
weight:表示设备的容量相对值,比如1TB对应1.00,那么500G的OSD的weight就应该是0.5,weight是基于磁盘空间分配PG的数量,让crush算法尽可能往磁盘空间大的OSD多分配PG,向磁盘空间小的OSD分配较少的PG。
Reweight:参数的目的是重新平衡ceph的CRUSH算法随机分配的PG,默认的分配是概率上的均衡,即使OSD都是一样的磁盘空间也会产生一些PG分布不均匀的情况,此时可以通过调整reweight参数,让ceph集群立即重新平衡当前磁盘的PG,以达到数据均衡分布的目录,reweight是PG已经分配完成,要在ceph集群重新平衡PG的分布。值范围0-1。
点击查看代码
ceph@ceph-deploy:~/ceph-cluster$ ceph osd df
ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS
0 hdd 0.01949 1.00000 20 GiB 318 MiB 27 MiB 4 KiB 291 MiB 20 GiB 1.56 0.72 114 up
1 hdd 0.01949 1.00000 20 GiB 388 MiB 28 MiB 6 KiB 360 MiB 20 GiB 1.89 0.87 115 up
2 hdd 0.01949 1.00000 20 GiB 451 MiB 33 MiB 25 KiB 418 MiB 20 GiB 2.20 1.02 124 up
3 hdd 0.01949 1.00000 20 GiB 434 MiB 31 MiB 25 KiB 403 MiB 20 GiB 2.12 0.98 128 up
4 hdd 0.01949 1.00000 20 GiB 377 MiB 34 MiB 8 KiB 342 MiB 20 GiB 1.84 0.85 116 up
5 hdd 0.01949 1.00000 20 GiB 545 MiB 23 MiB 2 KiB 522 MiB 19 GiB 2.66 1.23 109 up
6 hdd 0.01949 1.00000 20 GiB 433 MiB 18 MiB 9 KiB 415 MiB 20 GiB 2.11 0.98 124 up
7 hdd 0.01949 1.00000 20 GiB 548 MiB 45 MiB 24 KiB 503 MiB 19 GiB 2.68 1.24 120 up
8 hdd 0.01949 1.00000 20 GiB 495 MiB 26 MiB 5 KiB 469 MiB 20 GiB 2.42 1.12 109 up
TOTAL 180 GiB 3.9 GiB 264 MiB 113 KiB 3.6 GiB 176 GiB 2.16
MIN/MAX VAR: 0.72/1.24 STDDEV: 0.35
9.3.2 修改weight值
ceph@ceph-deploy:~/ceph-cluster$ ceph osd crush reweight osd.7 0.07
reweighted item id 7 name 'osd.7' to 0.07 in crush map
9.3.3 验证修改weight值
点击查看代码
ceph@ceph-deploy:~/ceph-cluster$ ceph osd df
ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS
0 hdd 0.01949 1.00000 20 GiB 338 MiB 32 MiB 4 KiB 306 MiB 20 GiB 1.65 0.74 109 up
1 hdd 0.01949 1.00000 20 GiB 397 MiB 25 MiB 6 KiB 372 MiB 20 GiB 1.94 0.87 117 up
2 hdd 0.01949 1.00000 20 GiB 446 MiB 34 MiB 25 KiB 412 MiB 20 GiB 2.18 0.98 127 up
3 hdd 0.01949 1.00000 20 GiB 447 MiB 32 MiB 25 KiB 414 MiB 20 GiB 2.18 0.98 129 up
4 hdd 0.01949 1.00000 20 GiB 378 MiB 29 MiB 8 KiB 350 MiB 20 GiB 1.85 0.83 112 up
5 hdd 0.01949 1.00000 20 GiB 569 MiB 31 MiB 2 KiB 538 MiB 19 GiB 2.78 1.25 112 up
6 hdd 0.01949 1.00000 20 GiB 439 MiB 16 MiB 9 KiB 423 MiB 20 GiB 2.14 0.96 65 up
7 hdd 0.06999 1.00000 20 GiB 598 MiB 60 MiB 24 KiB 538 MiB 19 GiB 2.92 1.31 228 up
8 hdd 0.01949 1.00000 20 GiB 493 MiB 16 MiB 5 KiB 477 MiB 20 GiB 2.41 1.08 60 up
TOTAL 180 GiB 4.0 GiB 274 MiB 113 KiB 3.7 GiB 176 GiB 2.23
MIN/MAX VAR: 0.74/1.31 STDDEV: 0.39
9.3.4 修改reweight值
ceph@ceph-deploy:~/ceph-cluster$ ceph osd reweight 6 0.6
reweighted osd.6 to 0.6 (9999)
9.3.5 验证修改reweight值
点击查看代码
ceph@ceph-deploy:~/ceph-cluster$ ceph osd df
ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS
0 hdd 0.01949 1.00000 20 GiB 339 MiB 32 MiB 4 KiB 307 MiB 20 GiB 1.65 0.74 109 up
1 hdd 0.01949 1.00000 20 GiB 397 MiB 25 MiB 6 KiB 372 MiB 20 GiB 1.94 0.87 117 up
2 hdd 0.01949 1.00000 20 GiB 451 MiB 34 MiB 25 KiB 417 MiB 20 GiB 2.20 0.98 127 up
3 hdd 0.01949 1.00000 20 GiB 451 MiB 32 MiB 25 KiB 419 MiB 20 GiB 2.20 0.98 129 up
4 hdd 0.01949 1.00000 20 GiB 383 MiB 29 MiB 8 KiB 354 MiB 20 GiB 1.87 0.83 112 up
5 hdd 0.01949 1.00000 20 GiB 569 MiB 31 MiB 2 KiB 539 MiB 19 GiB 2.78 1.24 112 up
6 hdd 0.01949 0.59999 20 GiB 443 MiB 16 MiB 9 KiB 427 MiB 20 GiB 2.16 0.97 38 up
7 hdd 0.06999 1.00000 20 GiB 604 MiB 60 MiB 24 KiB 544 MiB 19 GiB 2.95 1.32 247 up
8 hdd 0.01949 1.00000 20 GiB 493 MiB 16 MiB 5 KiB 477 MiB 20 GiB 2.41 1.07 64 up
TOTAL 180 GiB 4.0 GiB 274 MiB 113 KiB 3.8 GiB 176 GiB 2.24
MIN/MAX VAR: 0.74/1.32 STDDEV: 0.40
9.4 crush运行图管理
导出的crush运行图为二进制格式,无法通过文本编辑器直接打开,需要使用crushtool工具转换为文本格式后才能通过vim等文本编辑器打开和编辑。
9.4.1 导出crush运行图
root@ceph-deploy:~# mkdir -pv /data/ceph
mkdir: created directory '/data/ceph'
root@ceph-deploy:~# ceph osd getcrushmap -o /data/ceph/crushmap
73
9.4.2 将运行图转换为文本
点击查看代码
root@ceph-deploy:~# apt -y install ceph-base
root@ceph-deploy:~# crushtool -d /data/ceph/crushmap > /data/ceph/crushmap.txt
root@ceph-deploy:~# file /data/ceph/crushmap.txt
/data/ceph/crushmap.txt: ASCII text
9.4.3 crush运行图样例
root@ceph-deploy:~# cat /data/ceph/crushmap.txt tunable choose_local_tries 0 tunable choose_local_fallback_tries 0 tunable choose_total_tries 50 tunable chooseleaf_descend_once 1 tunable chooseleaf_vary_r 1 tunable chooseleaf_stable 1 tunable straw_calc_version 1 tunable allowed_bucket_algs 54 #当前的设备列表 device 0 osd.0 class hdd device 1 osd.1 class hdd device 2 osd.2 class hdd device 3 osd.3 class hdd device 4 osd.4 class hdd device 5 osd.5 class hdd device 6 osd.6 class hdd device 7 osd.7 class hdd device 8 osd.8 class hdd # types #当前支持的bucket类型 type 0 osd #osd守护进程,对应到有个磁盘设备 type 1 host #一个主机 type 2 chassis #刀片服务器的机箱 type 3 rack #包含若干个服务器的机柜/机架 type 4 row #包含若干个机柜的一排机柜 type 5 pdu #机柜的接入电源插座 type 6 pod #一个机房中的若干个小房间 type 7 room #包含若干个机柜的房间,一个数据中心有好多这样的房间组成 type 8 datacenter #一个数据中心或IDS type 9 zone #可用区 type 10 region #一个区域,比如AWS type 11 root #ubcket分层的最顶部,跟 # buckets host ceph-node-01 { id -3 # do not change unnecessarily #ceph 生成的OSD ID,非必要不要改 id -4 class hdd # do not change unnecessarily # weight 0.058 alg straw2 #crush算法,管理OSD角色 hash 0 # rjenkins1 #使用哪个hash算法,0表示选择rjenkins1这种hash算法 item osd.0 weight 0.019 # osd.0权重比例,crush会自动根据磁盘空间计算,不同的磁盘空间的权重不一样 item osd.1 weight 0.019 item osd.2 weight 0.019 } host ceph-node-02 { id -5 # do not change unnecessarily id -6 class hdd # do not change unnecessarily # weight 0.058 alg straw2 hash 0 # rjenkins1 item osd.3 weight 0.019 item osd.4 weight 0.019 item osd.5 weight 0.019 } host ceph-node-03 { id -7 # do not change unnecessarily id -8 class hdd # do not change unnecessarily # weight 0.109 alg straw2 hash 0 # rjenkins1 item osd.6 weight 0.019 item osd.7 weight 0.070 item osd.8 weight 0.019 } root default { id -1 # do not change unnecessarily id -2 class hdd # do not change unnecessarily # weight 0.226 alg straw2 hash 0 # rjenkins1 item ceph-node-01 weight 0.058 item ceph-node-02 weight 0.058 item ceph-node-03 weight 0.109 } # rules rule replicated_rule { #副本池的默认配置 id 0 type replicated min_size 1 max_size 10 #默认最大副本为10 step take default #基于default定义的主机分配OSD step chooseleaf firstn 0 type host #选择主机,故障域类型为主机 step emit #弹出配置即返回给客户端 } rule erasure-code { #纠删码池默认配置 id 1 type erasure min_size 3 max_size 4 step set_chooseleaf_tries 5 step set_choose_tries 100 step take default setp chooseleaf indep 0 type host step emit } # end crush map
9.4.4 编辑crush运行图
修改max_size 10 为 max_size 8
9.4.5 将文本转换为crush二进制格式
root@ceph-deploy:~# crushtool -c /data/ceph/crushmap.txt -o /data/ceph/newcrushmap
1.
9.4.6 导入新的crush运行图
root@ceph-deploy:~# ceph osd setcrushmap -i /data/ceph/newcrushmap
1.
9.4.7 验证crush运行图是否生效
点击查看代码
root@ceph-deploy:~# ceph osd crush rule dump
[
{
"rule_id": 0,
"rule_name": "replicated_rule",
"ruleset": 0,
"type": 1,
"min_size": 1,
"max_size": 8,
"steps": [
{
"op": "take",
"item": -1,
"item_name": "default"
},
{
"op": "chooseleaf_firstn",
"num": 0,
"type": "host"
},
{
"op": "emit"
}
]
}
]
9.5 crush数据分类管理
ceph crush 算法分配PG的时候可以将PG分配到不同主机的OSD上,以实现以主机为单位的高可用,这也是默认机制,但是无法保证不同PG尾部不同机柜或者机房的主机,如果需要实现基于机柜或者更高级的IDC等方式的数据高可用,而且也不能实现A项目的数据在SSD,B项目的数据在机械盘,如果想要实现此功能需要导出crush运行图并手动编辑,之后在导入并覆盖原有的crush运行图。
9.5.1 导出crush运行图
root@ceph-deploy:~# mkdir -pv /data/ceph
mkdir: created directory '/data/ceph'
root@ceph-deploy:~# ceph osd getcrushmap -o /data/ceph/crushmap
73
9.5.2 将运行图转为文本
root@ceph-deploy:~# apt -y install ceph-base
root@ceph-deploy:~# crushtool -d /data/ceph/crushmap > /data/ceph/crushmap.txt
root@ceph-deploy:~# file /data/ceph/crushmap.txt
/data/ceph/crushmap.txt: ASCII text
9.5.3 添加自定义配置
注意:
主机名称不能重复
buckets must be defined before rules
# ssd node
host ceph-sshnode-01 {
id -103 # do not change unnecessarily
id -104 class hdd # do not change unnecessarily
# weight 0.098
alg straw2
hash 0 # rjenkins1
item osd.0 weight 0.019
}
host ceph-sshnode-02 {
id -105 # do not change unnecessarily
id -106 class hdd # do not change unnecessarily
# weight 0.098
alg straw2
hash 0 # rjenkins1
item osd.5 weight 0.019
}
host ceph-sshnode-03 {
id -107 # do not change unnecessarily
id -108 class hdd # do not change unnecessarily
# weight 0.098
alg straw2
hash 0 # rjenkins1
item osd.8 weight 0.019
}
# bucket
root ssd {
id -127 # do not change unnecessarily
id -11 class hdd # do not change unnecessarily
# weight 1.952
alg straw
hash 0 # rjenkins1
item ceph-sshnode-01 weight 0.088
item ceph-sshnode-02 weight 0.088
item ceph-sshnode-03 weight 0.088
}
#ssd rules
rule ssd_rule {
id 20
type replicated
min_size 1
max_size 5
step take ssd
step chooseleaf firstn 0 type host
step emit
}
9.5.4 转为crush二进制格式
root@ceph-deploy:~# crushtool -c /data/ceph/crushmap.txt -o /data/ceph/newcrushmap-01
9.5.5 导入新的crush图
root@ceph-deploy:~# ceph osd setcrushmap -i /data/ceph/newcrushmap-01
76
9.5.6 验证crush运行图是否生效
root@ceph-deploy:~# ceph osd crush rule dump
[
{
"rule_id": 0,
"rule_name": "replicated_rule",
"ruleset": 0,
"type": 1,
"min_size": 1,
"max_size": 8,
"steps": [
{
"op": "take",
"item": -1,
"item_name": "default"
},
{
"op": "chooseleaf_firstn",
"num": 0,
"type": "host"
},
{
"op": "emit"
}
]
},
{
"rule_id": 20,
"rule_name": "ssd_rule",
"ruleset": 20,
"type": 1,
"min_size": 1,
"max_size": 5,
"steps": [
{
"op": "take",
"item": -127,
"item_name": "ssd"
},
{
"op": "chooseleaf_firstn",
"num": 0,
"type": "host"
},
{
"op": "emit"
}
]
}
]
9.5.7 测试创建存储池
root@ceph-deploy:~# ceph osd pool create ssdpool 32 32 ssd_rule
pool 'ssdpool' created
9.5.8 验证pgp状态
点击查看代码
root@ceph-deploy:~# ceph pg ls-by-pool ssdpool | awk '{print $1,$2,$15}'
PG OBJECTS ACTING
28.0 0 [8,0,5]p8
28.1 0 [5,8,0]p5
28.2 0 [8,0,5]p8
28.3 0 [8,5,0]p8
28.4 0 [0,5,8]p0
28.5 0 [5,8,0]p5
28.6 0 [5,8,0]p5
28.7 0 [8,0,5]p8
28.8 0 [0,5,8]p0
28.9 0 [8,5,0]p8
28.a 0 [5,0,8]p5
28.b 0 [0,5,8]p0
28.c 0 [8,5,0]p8
28.d 0 [8,5,0]p8
28.e 0 [0,5,8]p0
28.f 0 [5,0,8]p5
28.10 0 [5,0,8]p5
28.11 0 [0,5,8]p0
28.12 0 [5,0,8]p5
28.13 0 [0,8,5]p0
28.14 0 [0,5,8]p0
28.15 0 [0,8,5]p0
28.16 0 [8,0,5]p8
28.17 0 [5,0,8]p5
28.18 0 [5,8,0]p5
28.19 0 [5,0,8]p5
28.1a 0 [5,8,0]p5
28.1b 0 [5,0,8]p5
28.1c 0 [8,5,0]p8
28.1d 0 [5,0,8]p5
28.1e 0 [0,8,5]p0
28.1f 0 [5,0,8]p5
NOTE: afterwards
以上可见新创建的ssdpool的PG分布在osd.0、osd.5、osd.8上,符合添加的规则。
9.6 节点和OSD对应关系
点击查看代码
root@ceph-deploy:~# ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -127 0.26399 root ssd -103 0.08800 host ceph-sshnode-01 0 hdd 0.01900 osd.0 up 1.00000 1.00000 -105 0.08800 host ceph-sshnode-02 5 hdd 0.01900 osd.5 up 1.00000 1.00000 -107 0.08800 host ceph-sshnode-03 8 hdd 0.01900 osd.8 up 1.00000 1.00000 -1 0.22499 root default -3 0.05800 host ceph-node-01 0 hdd 0.01900 osd.0 up 1.00000 1.00000 1 hdd 0.01900 osd.1 up 1.00000 1.00000 2 hdd 0.01900 osd.2 up 1.00000 1.00000 -5 0.05800 host ceph-node-02 3 hdd 0.01900 osd.3 up 1.00000 1.00000 4 hdd 0.01900 osd.4 up 1.00000 1.00000 5 hdd 0.01900 osd.5 up 1.00000 1.00000 -7 0.10899 host ceph-node-03 6 hdd 0.01900 osd.6 up 0.59999 1.00000 7 hdd 0.06999 osd.7 up 1.00000 1.00000 8 hdd 0.01900 osd.8 up 1.00000 1.00000