解决ceph节点因断开SSH远程后的造成集群网络不稳定(节点的Mon和OSD进程自动down)的问题

故障描述:ceph节点因为断开SSH网络链接会立刻导致mon和osd守护进程自动down的问题

观察/var/log/ceph/ceph.log的部分关键信息显示如下:

2020-07-27 17:49:01.395696 mon.ceph-node1 (mon.0) 381808 : cluster [WRN] Health check
 update: Reduced data availability: 1 pg inactive, 5 pgs peering (PG_AVAILABILITY)
2020-07-27 17:49:03.369683 mon.ceph-node1 (mon.0) 381809 : cluster [INF] Health check
 cleared: PG_AVAILABILITY (was: Reduced data availability: 1 pg inactive, 5 pgs peeri
ng)
2020-07-27 17:48:55.313287 mgr.ceph-node1 (mgr.6352) 266574 : cluster [DBG] pgmap v29
8025: 320 pgs: 22 active+undersized, 47 active+undersized+degraded, 9 peering, 242 ac
tive+clean; 53 GiB data, 759 GiB used, 11 TiB / 12 TiB avail; 0 B/s wr, 0 op/s; 2669/
40779 objects degraded (6.545%); 0 B/s, 0 objects/s recovering
2020-07-27 17:48:57.314405 mgr.ceph-node1 (mgr.6352) 266575 : cluster [DBG] pgmap v29
8027: 320 pgs: 44 stale+active+clean, 27 active+undersized, 51 active+undersized+degr
aded, 20 peering, 178 active+clean; 53 GiB data, 759 GiB used, 11 TiB / 12 TiB avail;
 0 B/s wr, 0 op/s; 3051/40779 objects degraded (7.482%); 0 B/s, 0 objects/s recoverin
g


2020-07-27 17:51:02.089931 mon.ceph-node1 (mon.0) 382017 : cluster [INF] Health check
 cleared: MON_DOWN (was: 1/3 mons down, quorum ceph-node1,ceph-node2)


2020-07-27 17:51:02.579862 mon.ceph-node1 (mon.0) 382026 : cluster [WRN] overall HEAL
TH_WARN 4 osds down; 1 host (4 osds) down; Long heartbeat ping times on back interfac
e seen, longest is 2171.403 msec; Long heartbeat ping times on front interface seen, 
longest is 2171.434 msec; Degraded data redundancy: 11649/40770 objects degraded (28.
572%), 190 pgs degraded, 181 pgs undersized


2020-07-27 17:52:32.565545 osd.9 (osd.9) 59 : cluster [WRN] slow request osd_op(clien
t.6400.0:370569 3.20 3:06380552:::rbd_header.172d226df4f8:head [watch unwatch cookie 
140360537903920] snapc 0=[] ondisk+write+known_if_redirected e31947) initiated 2020-0
7-27 17:52:01.830706 currently started


2020-07-27 17:55:06.335968 mon.ceph-node1 (mon.0) 382428 : cluster [WRN] Health check
 failed: 2 slow ops, oldest one blocked for 31 sec, mon.ceph-node1 has slow ops (SLOW
_OPS)

2020-07-27 17:56:03.133399 osd.8 (osd.8) 25 : cluster [WRN] Monitor daemon marked osd
.8 down, but it is still running

[WRN]
Health check update: Long heartbeat ping times on front interface seen, longest is 21297.249 msec (OSD_SLOW_PING_TIME_FRONT)

2020-07-28 10:02:39.045969
[WRN]
Health check update: Long heartbeat ping times on back interface seen, longest is 21297.238 msec (OSD_SLOW_PING_TIME_BACK)

在存在故障的节点上通过dmesg命令查看到部分的kernel的硬件信息,一般用于设备故障的诊断时使用

[root@ceph-node3 ~]# dmesg -T | tail
[Tue Jul 28 09:59:55 2020] IPv6: ADDRCONF(NETDEV_UP): ib0: link is not ready
[Tue Jul 28 09:59:55 2020] IPv6: ADDRCONF(NETDEV_UP): ib0: link is not ready
[Tue Jul 28 09:59:55 2020] IPv6: ADDRCONF(NETDEV_CHANGE): ib0: link becomes ready
[Tue Jul 28 10:06:34 2020] IPv6: ADDRCONF(NETDEV_UP): em2: link is not ready
[Tue Jul 28 10:06:34 2020] IPv6: ADDRCONF(NETDEV_UP): em3: link is not ready
[Tue Jul 28 10:06:34 2020] IPv6: ADDRCONF(NETDEV_UP): em4: link is not ready
[Tue Jul 28 10:06:34 2020] IPv6: ADDRCONF(NETDEV_UP): ib1: link is not ready
[Tue Jul 28 10:10:29 2020] IPv6: ADDRCONF(NETDEV_UP): ib0: link is not ready
[Tue Jul 28 10:10:29 2020] IPv6: ADDRCONF(NETDEV_UP): ib0: link is not ready
[Tue Jul 28 10:10:29 2020] IPv6: ADDRCONF(NETDEV_CHANGE): ib0: link becomes ready

对比查看其他ceph节点上的配置文件信息,发现配置参数有点不一致的问题

vim /etc/sysconfig/network-scripts/ifcfg-ib0

CONNECTED_MODE=no
TYPE=InfiniBand
PROXY_METHOD=none
BROWSER_ONLY=no
BOOTPROTO=static
DEFROUTE=yes
IPV4_FAILURE_FATAL=no
IPV6INIT=yes
IPV6_AUTOCONF=yes
IPV6_DEFROUTE=yes
IPV6_FAILURE_FATAL=no
IPV6_ADDR_GEN_MODE=stable-privacy
NAME=ib0
UUID=2ab4abde-b8a5-6cbc-19b1-2bfb193e4e89
DEVICE=ib0
ONBOOT=yes
IPADDR=10.0.0.20
NETMASK=255.255.255.0
#USERS=ROOT	//多个此参数,与其他节点上有不同,于是删除了此参数

修改后重启network服务和NetworkManager服务,发现描述的故障已经解除。再次使用dmesg也查看不到最新的错误信息。USERS=ROOT这个参数的作用暂时还不明确?

posted @ 2020-07-28 15:56  AshJo  阅读(3014)  评论(0编辑  收藏  举报