ceph故障处理 - osd down处理
- 发现osd掉之后,我们首先要确认是哪个主机的哪块盘,来判断是这个盘坏了还是什么原因
[root@test-3-134 devops]# ceph -s
cluster:
id: 380a1e72-da89-4041-8478-76383f5f6378
health: HEALTH_WARN
11 daemons have recently crashed
services:
mon: 3 daemons, quorum test-3-134,test-3-137,test-3-139 (age 7w)
mgr: test-3-139(active, since 4w), standbys: test-3-31, test-3-137
osd: 84 osds: 82 up (since 4d), 82 in (since 4d) # 这里可以看到是掉了两块盘
rgw: 1 daemon active (test-3-31)
task status:
data:
pools: 8 pools, 640 pgs
objects: 5.55M objects, 20 TiB
usage: 39 TiB used, 395 TiB / 434 TiB avail
pgs: 639 active+clean
1 active+clean+scrubbing+deep
io:
client: 35 MiB/s rd, 86 MiB/s wr, 285 op/s rd, 465 op/s wr
来看一下是哪两块
[root@test-3-134 devops]# ceph osd tree
-15 32.74818 host test-3-32
87 hdd 3.63869 osd.87 up 1.00000 1.00000
88 hdd 3.63869 osd.88 up 1.00000 1.00000
89 hdd 3.63869 osd.89 up 1.00000 1.00000
90 hdd 3.63869 osd.90 up 1.00000 1.00000
91 hdd 3.63869 osd.91 up 1.00000 1.00000
92 hdd 3.63869 osd.92 up 1.00000 1.00000
93 hdd 3.63869 osd.93 up 1.00000 1.00000
94 hdd 3.63869 osd.94 up 1.00000 1.00000
95 hdd 3.63869 osd.95 down 0 1.00000
-13 32.74818 host test-3-33
78 hdd 3.63869 osd.78 up 1.00000 1.00000
79 hdd 3.63869 osd.79 up 1.00000 1.00000
80 hdd 3.63869 osd.80 up 1.00000 1.00000
81 hdd 3.63869 osd.81 up 1.00000 1.00000
82 hdd 3.63869 osd.82 down 0 1.00000
83 hdd 3.63869 osd.83 up 1.00000 1.00000
84 hdd 3.63869 osd.84 up 1.00000 1.00000
85 hdd 3.63869 osd.85 up 1.00000 1.00000
86 hdd 3.63869 osd.86 up 1.00000 1.00000
登录对应机器确认下是哪块盘
[root@test-3-32 ~]# ceph-volume lvm list |grep -E "osd\.|dev"
====== osd.95 ======
[block] /dev/ceph-f4e2366c-d871-4910-a044-ed52de2a397e/osd-block-3e44b34d-3881-4e34-ad9e-5e1906617c07
block device /dev/ceph-f4e2366c-d871-4910-a044-ed52de2a397e/osd-block-3e44b34d-3881-4e34-ad9e-5e1906617c07
crush device class None
devices /dev/sdk
[root@test-3-32 ~]# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sdf 8:80 0 3.7T 0 disk
sdd 8:48 0 3.7T 0 disk
└─ceph--589b2759--4c59--49c0--8081--612bfa3ed91a-osd--block--39f28dca--a05d--49fa--a92c--ab650d24a4d4 253:4 0 3.7T 0 lvm
sdk 8:160 0 3.7T 0 disk
└─ceph--f4e2366c--d871--4910--a044--ed52de2a397e-osd--block--3e44b34d--3881--4e34--ad9e--5e1906617c07 253:1 0 3.7T 0 lvm
2.我们发现盘还在,首先尝试能否重启ceph-osd服务 ,这里已经拉起来了
[root@test-3-32 ~]# systemctl | grep ceph
● ceph-osd@95.service loaded failed failed Ceph object storage daemon osd.95
[root@test-3-32 ~]# systemctl restart ceph-osd@95.service
[root@test-3-32 ~]# systemctl status ceph-osd@95.service
● ceph-osd@95.service - Ceph object storage daemon osd.95
Loaded: loaded (/usr/lib/systemd/system/ceph-osd@.service; enabled-runtime; vendor preset: disabled)
Active: active (running) since Mon 2022-08-08 15:04:30 CST; 10s ago
Process: 1062974 ExecStartPre=/usr/lib/ceph/ceph-osd-prestart.sh --cluster ${CLUSTER} --id %i (code=exited, status=0/SUCCESS)
Main PID: 1062980 (ceph-osd)
CGroup: /system.slice/system-ceph\x2dosd.slice/ceph-osd@95.service
└─1062980 /usr/bin/ceph-osd -f --cluster ceph --id 95 --setuser ceph --setgroup ceph
3.如果重启无望或者盘漂移,重新卸载安装
3.1 看看日志 是不是有磁盘报错
egrep -i 'medium|i\/o error|sector|Prefailure' /var/log/messages
3.2.直接踢掉
osdid=82
systemctl stop ceph-osd@$osdid.service
ceph osd out osd.$osdid
ceph osd crush rm osd.$osdid
ceph auth del osd.$osdid
ceph osd down osd.$osdid
ceph osd rm osd.$osdid
ceph osd rm $osdid
or 还没有仔细看这两个区别,试了一下两个都可以下掉osd
OSD=54
ceph osd ok-to-stop osd.$OSD
ceph osd safe-to-destroy osd.$OSD
ceph osd down osd.$OSD
ceph osd purge osd.$OSD --yes-i-really-mean-it
3.3.重新格式化再添加
# 有时候需要取消下 dmsetup ls
# dmsetup remove ceph--c0df59cb--3c80--4caf--8af4--dd43e0be7786-osd--block--53ade74d--be95--4997--8f24--d9cd34e6ee41
mkfs.xfs -f /dev/sdm
ceph-deploy osd create --data /dev/sdg test-3-33
# or
# ceph-deploy --overwrite-conf osd create --data /dev/sdn dx-lt-yd-zhejiang-jinhua-5-10-104-1-130
参考文档
Ceph OSD为DOWN时修复:https://blog.csdn.net/baidu_26495369/article/details/80325315