ceph故障处理 - osd down处理

  1. 发现osd掉之后,我们首先要确认是哪个主机的哪块盘,来判断是这个盘坏了还是什么原因
[root@test-3-134 devops]# ceph -s
  cluster:
    id:     380a1e72-da89-4041-8478-76383f5f6378
    health: HEALTH_WARN
            11 daemons have recently crashed
 
  services:
    mon: 3 daemons, quorum test-3-134,test-3-137,test-3-139 (age 7w)
    mgr: test-3-139(active, since 4w), standbys: test-3-31, test-3-137
    osd: 84 osds: 82 up (since 4d), 82 in (since 4d)   # 这里可以看到是掉了两块盘
    rgw: 1 daemon active (test-3-31)
 
  task status:
 
  data:
    pools:   8 pools, 640 pgs
    objects: 5.55M objects, 20 TiB
    usage:   39 TiB used, 395 TiB / 434 TiB avail
    pgs:     639 active+clean
             1   active+clean+scrubbing+deep
 
  io:
    client:   35 MiB/s rd, 86 MiB/s wr, 285 op/s rd, 465 op/s wr

来看一下是哪两块

[root@test-3-134 devops]# ceph osd tree  
-15        32.74818     host test-3-32                          
 87   hdd   3.63869         osd.87                                           up  1.00000 1.00000 
 88   hdd   3.63869         osd.88                                           up  1.00000 1.00000 
 89   hdd   3.63869         osd.89                                           up  1.00000 1.00000 
 90   hdd   3.63869         osd.90                                           up  1.00000 1.00000 
 91   hdd   3.63869         osd.91                                           up  1.00000 1.00000 
 92   hdd   3.63869         osd.92                                           up  1.00000 1.00000 
 93   hdd   3.63869         osd.93                                           up  1.00000 1.00000 
 94   hdd   3.63869         osd.94                                           up  1.00000 1.00000 
 95   hdd   3.63869         osd.95                                         down        0 1.00000 
-13        32.74818     host test-3-33                          
 78   hdd   3.63869         osd.78                                           up  1.00000 1.00000 
 79   hdd   3.63869         osd.79                                           up  1.00000 1.00000 
 80   hdd   3.63869         osd.80                                           up  1.00000 1.00000 
 81   hdd   3.63869         osd.81                                           up  1.00000 1.00000 
 82   hdd   3.63869         osd.82                                         down        0 1.00000 
 83   hdd   3.63869         osd.83                                           up  1.00000 1.00000 
 84   hdd   3.63869         osd.84                                           up  1.00000 1.00000 
 85   hdd   3.63869         osd.85                                           up  1.00000 1.00000 
 86   hdd   3.63869         osd.86                                           up  1.00000 1.00000 

登录对应机器确认下是哪块盘

[root@test-3-32 ~]# ceph-volume lvm list |grep -E "osd\.|dev"
====== osd.95 ======
  [block]       /dev/ceph-f4e2366c-d871-4910-a044-ed52de2a397e/osd-block-3e44b34d-3881-4e34-ad9e-5e1906617c07
      block device              /dev/ceph-f4e2366c-d871-4910-a044-ed52de2a397e/osd-block-3e44b34d-3881-4e34-ad9e-5e1906617c07
      crush device class        None
      devices                   /dev/sdk
[root@test-3-32 ~]# lsblk 
NAME                                                                                                  MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
sdf                                                                                                     8:80   0   3.7T  0 disk  
sdd                                                                                                     8:48   0   3.7T  0 disk  
└─ceph--589b2759--4c59--49c0--8081--612bfa3ed91a-osd--block--39f28dca--a05d--49fa--a92c--ab650d24a4d4 253:4    0   3.7T  0 lvm   
sdk                                                                                                     8:160  0   3.7T  0 disk  
└─ceph--f4e2366c--d871--4910--a044--ed52de2a397e-osd--block--3e44b34d--3881--4e34--ad9e--5e1906617c07 253:1    0   3.7T  0 lvm   

2.我们发现盘还在,首先尝试能否重启ceph-osd服务 ,这里已经拉起来了

[root@test-3-32 ~]# systemctl | grep ceph
● ceph-osd@95.service                                                                                                                                           loaded failed     failed       Ceph object storage daemon osd.95
[root@test-3-32 ~]# systemctl restart ceph-osd@95.service
[root@test-3-32 ~]# systemctl status ceph-osd@95.service
● ceph-osd@95.service - Ceph object storage daemon osd.95
   Loaded: loaded (/usr/lib/systemd/system/ceph-osd@.service; enabled-runtime; vendor preset: disabled)
   Active: active (running) since Mon 2022-08-08 15:04:30 CST; 10s ago
  Process: 1062974 ExecStartPre=/usr/lib/ceph/ceph-osd-prestart.sh --cluster ${CLUSTER} --id %i (code=exited, status=0/SUCCESS)
 Main PID: 1062980 (ceph-osd)
   CGroup: /system.slice/system-ceph\x2dosd.slice/ceph-osd@95.service
           └─1062980 /usr/bin/ceph-osd -f --cluster ceph --id 95 --setuser ceph --setgroup ceph

3.如果重启无望或者盘漂移,重新卸载安装

3.1 看看日志 是不是有磁盘报错

egrep -i 'medium|i\/o error|sector|Prefailure' /var/log/messages

3.2.直接踢掉

osdid=82
systemctl stop ceph-osd@$osdid.service
ceph osd out osd.$osdid
 ceph osd crush rm osd.$osdid
 ceph auth del osd.$osdid
 ceph osd down osd.$osdid
 ceph osd rm osd.$osdid
 ceph osd rm $osdid

or   还没有仔细看这两个区别,试了一下两个都可以下掉osd

OSD=54
ceph osd ok-to-stop osd.$OSD
ceph osd safe-to-destroy osd.$OSD
 ceph osd down osd.$OSD
ceph osd purge  osd.$OSD --yes-i-really-mean-it

3.3.重新格式化再添加

# 有时候需要取消下  dmsetup ls
# dmsetup remove ceph--c0df59cb--3c80--4caf--8af4--dd43e0be7786-osd--block--53ade74d--be95--4997--8f24--d9cd34e6ee41
mkfs.xfs -f /dev/sdm 
ceph-deploy osd create --data  /dev/sdg test-3-33
# or
# ceph-deploy --overwrite-conf osd create --data  /dev/sdn  dx-lt-yd-zhejiang-jinhua-5-10-104-1-130

参考文档

Ceph OSD为DOWN时修复:https://blog.csdn.net/baidu_26495369/article/details/80325315

posted @ 2022-08-08 15:48  鸣昊  阅读(1901)  评论(0编辑  收藏  举报