异常掉电osd无法启动
osd 无法启动
问题描述
osd 状态 down
[root@node-1 ~]# ceph osd tree
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 8.68958 root default
-2 2.17239 host node-4
3 1.08620 osd.3 up 1.00000 1.00000
4 1.08620 osd.4 down 0 1.00000
-3 2.17239 host node-2
2 1.08620 osd.2 up 1.00000 1.00000
6 1.08620 osd.6 up 1.00000 1.00000
-4 2.17239 host node-3
1 1.08620 osd.1 up 1.00000 1.00000
7 1.08620 osd.7 down 0 1.00000
-5 2.17239 host node-1
0 1.08620 osd.0 up 1.00000 1.00000
5 1.08620 osd.5 up 1.00000 1.00000
You have mail in /var/spool/mail/root
[root@node-1 ~]#
检查步骤
登陆对应 osd 节点, 查看对应 osd 服务
[root@node-3 ceph]# systemctl status ceph-osd@7.service
● ceph-osd@7.service - Ceph object storage daemon
Loaded: loaded (/usr/lib/systemd/system/ceph-osd@.service; enabled-runtime; vendor preset: disabled)
Active: failed (Result: start-limit) since Fri 2021-10-29 13:26:40 CST; 4min 11s ago
Process: 184542 ExecStart=/usr/bin/ceph-osd -f --cluster ${CLUSTER} --id %i --setuser ceph --setgroup ceph (code=killed, signal=ABRT)
Process: 184488 ExecStartPre=/usr/lib/ceph/ceph-osd-prestart.sh --cluster ${CLUSTER} --id %i (code=exited, status=0/SUCCESS)
Main PID: 184542 (code=killed, signal=ABRT)
Oct 29 13:26:20 node-3 systemd[1]: Unit ceph-osd@7.service entered failed state.
Oct 29 13:26:20 node-3 systemd[1]: ceph-osd@7.service failed.
Oct 29 13:26:40 node-3 systemd[1]: ceph-osd@7.service holdoff time over, scheduling restart.
Oct 29 13:26:40 node-3 systemd[1]: start request repeated too quickly for ceph-osd@7.service
Oct 29 13:26:40 node-3 systemd[1]: Failed to start Ceph object storage daemon.
Oct 29 13:26:40 node-3 systemd[1]: Unit ceph-osd@7.service entered failed state.
Oct 29 13:26:40 node-3 systemd[1]: ceph-osd@7.service failed.
[root@node-3 ceph]#
重启对应 osd 服务, 无法正常拉起
[root@node-3 ceph]# systemctl restart ceph-osd@7.service
Job for ceph-osd@7.service failed because start of the service was attempted too often. See "systemctl status ceph-osd@7.service" and "journalctl -xe" for details.
To force a start use "systemctl reset-failed ceph-osd@7.service" followed by "systemctl start ceph-osd@7.service" again.
[root@node-3 ceph]#
# 根据提示执行
[root@node-3 ceph]# systemctl reset-failed ceph-osd@7.service
# 执行后重启没有报错,但是状态还是不对,下面需要查日志
[root@node-3 ceph]# systemctl restart ceph-osd@7.service
查看日志, /var/log/ceph-osd@7.log
--- end dump of recent events ---
2021-10-29 13:50:07.117742 7ff76e664800 0 set uid:gid to 167:167 (ceph:ceph)
2021-10-29 13:50:07.117757 7ff76e664800 0 ceph version 10.2.7 (50e863e0f4bc8f4b9e31156de690d765af245185), process ceph-osd, pid 203828
2021-10-29 13:50:07.119187 7ff76e664800 0 pidfile_write: ignore empty --pid-file
2021-10-29 13:50:07.146080 7ff76e664800 0 filestore(/var/lib/ceph/osd/ceph-7) backend xfs (magic 0x58465342)
2021-10-29 13:50:07.146462 7ff76e664800 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-7) detect_features: FIEMAP ioctl is disabled via 'filestore fiemap' config option
2021-10-29 13:50:07.146467 7ff76e664800 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-7) detect_features: SEEK_DATA/SEEK_HOLE is disabled via 'filestore seek data hole' config option
2021-10-29 13:50:07.146483 7ff76e664800 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-7) detect_features: splice is supported
2021-10-29 13:50:07.158475 7ff76e664800 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-7) detect_features: syncfs(2) syscall fully supported (by glibc and kernel)
2021-10-29 13:50:07.158520 7ff76e664800 0 xfsfilestorebackend(/var/lib/ceph/osd/ceph-7) detect_feature: extsize is disabled by conf
2021-10-29 13:50:07.159232 7ff76e664800 1 leveldb: Recovering log #767926
2021-10-29 13:50:07.175968 7ff76e664800 1 leveldb: Delete type=0 #767926
2021-10-29 13:50:07.176019 7ff76e664800 1 leveldb: Delete type=3 #767925
2021-10-29 13:50:07.176880 7ff76e664800 0 filestore(/var/lib/ceph/osd/ceph-7) mount: enabling WRITEAHEAD journal mode: checkpoint is not enabled
2021-10-29 13:50:07.196997 7ff76e664800 -1 journal Unable to read past sequence 1152396918 but header indicates the journal has committed up through 1152397338, journal is corrupt
2021-10-29 13:50:07.270433 7ff76e664800 0 <cls> cls/hello/cls_hello.cc:305: loading cls_hello
2021-10-29 13:50:07.270609 7ff76e664800 0 <cls> cls/cephfs/cls_cephfs.cc:202: loading cephfs_size_scan
2021-10-29 13:50:07.285389 7ff76e664800 0 osd.7 3200 crush map has features 2200130813952, adjusting msgr requires for clients
2021-10-29 13:50:07.285399 7ff76e664800 0 osd.7 3200 crush map has features 2200130813952 was 8705, adjusting msgr requires for mons
2021-10-29 13:50:07.285403 7ff76e664800 0 osd.7 3200 crush map has features 2200130813952, adjusting msgr requires for osds
2021-10-29 13:50:15.583331 7fde2734f800 0 set uid:gid to 167:167 (ceph:ceph)
2021-10-29 13:50:15.583345 7fde2734f800 0 ceph version 10.2.7 (50e863e0f4bc8f4b9e31156de690d765af245185), process ceph-osd, pid 204015
2021-10-29 13:50:15.584786 7fde2734f800 0 pidfile_write: ignore empty --pid-file
2021-10-29 13:50:15.616141 7fde2734f800 0 filestore(/var/lib/ceph/osd/ceph-7) backend xfs (magic 0x58465342)
2021-10-29 13:50:15.616528 7fde2734f800 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-7) detect_features: FIEMAP ioctl is disabled via 'filestore fiemap' config option
2021-10-29 13:50:15.616533 7fde2734f800 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-7) detect_features: SEEK_DATA/SEEK_HOLE is disabled via 'filestore seek data hole' config option
2021-10-29 13:50:15.616550 7fde2734f800 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-7) detect_features: splice is supported
2021-10-29 13:50:15.623022 7fde2734f800 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-7) detect_features: syncfs(2) syscall fully supported (by glibc and kernel)
2021-10-29 13:50:15.623068 7fde2734f800 0 xfsfilestorebackend(/var/lib/ceph/osd/ceph-7) detect_feature: extsize is disabled by conf
2021-10-29 13:50:15.623791 7fde2734f800 1 leveldb: Recovering log #767928
2021-10-29 13:50:15.623840 7fde2734f800 1 leveldb: Level-0 table #767930: started
2021-10-29 13:50:15.628917 7fde2734f800 1 leveldb: Level-0 table #767930: 139 bytes OK
2021-10-29 13:50:15.652033 7fde2734f800 1 leveldb: Delete type=0 #767928
关键日志信息: journal Unable to read past sequence 1152396918 but header indicates the journal has committed up through 1152397338, journal is corrupt
一般强断电会出现这个情况!
处理方法
对应 osd 节点, 修改 ceph 配置文件, 然后重启.
vim /etc/ceph/ceph.conf
[osd]
journal_ignore_corruption = true
# 重启成功后需要注释掉