ceph集群的pg 不一致报错处理
pg 不一致报错处理
1 scrub errors; Possible data damage: 1 pg inconsistent
HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent
OSD_SCRUB_ERRORS 1 scrub errors
PG_DAMAGED Possible data damage: 1 pg inconsistent
pg 1.7fff is active+clean+scrubbing+deep+inconsistent+repair, acting [184,229]
报错信息整理
- 问题GP: 1.7fff
- osd编号: 184 229
修复动作
-
执行常规修复
ceph pg repair 1.7fff
-
查看修复结果
ceph health detail
HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent
OSD_SCRUB_ERRORS 1 scrub errors
PG_DAMAGED Possible data damage: 1 pg inconsistent
pg 1.7fff is active+clean+scrubbing+deep+inconsistent+repair, acting [184,229]报错依然
-
观察集群动作
ceph -w
2020-09-05 09:13:25.818257 osd.184 [ERR] 1.7fff repair : stat mismatch, got 9855/9856 objects, 0/0 clones, 9855/9856 dirty, 0/0 omap, 0/0 pinned, 0/0 hit_set_archive, 0/0 whiteouts, 41285080957/41289275261 bytes, 0/0 hit_set_archive bytes. 2020-09-05 09:13:25.818757 osd.184 [ERR] 1.7fff repair 1 errors, 1 fixed 2020-09-05 09:13:31.318617 mon.cb-mon-38 [INF] Health check cleared: OSD_SCRUB_ERRORS (was: 1 scrub errors) 2020-09-05 09:13:31.321338 mon.cb-mon-38 [INF] Health check cleared: PG_DAMAGED (was: Possible data damage: 1 pg inconsistent) 2020-09-05 09:13:31.321983 mon.cb-mon-38 [INF] Cluster is now healthy 2020-09-05 10:00:00.001158 mon.cb-mon-38 [INF] overall HEALTH_OK
其它修复方式
1. 洗刷一个pg组,执行命令:
ceph pg scrub 1.7fff
ceph pg deep-scrub 1.7fff
ceph pg repair 1.7fff
2.修复关联的osd
ceph osd repair 184
ceph osd repair 229
3.关闭pg所在的主osd
- 查询pg所在主osd
root@manager1:~# ceph pg 1.7fff query|grep primary
"same_primary_since": 1070,
"num_objects_missing_on_primary": 0,
"up_primary": 184,
"acting_primary": 184
"same_primary_since": 0,
"num_objects_missing_on_primary": 0,
"up_primary": -1,
"acting_primary": -1
-
查询osd所在主机
root@manager1:~# ceph osd tree|grep -B25 184 -41 218.29431 host cc-d-19 19 hdd 9.09560 osd.19 up 1.00000 1.00000 39 hdd 9.09560 osd.39 up 1.00000 1.00000 52 hdd 9.09560 osd.52 up 1.00000 1.00000 70 hdd 9.09560 osd.70 up 1.00000 1.00000 87 hdd 9.09560 osd.87 up 1.00000 1.00000 106 hdd 9.09560 osd.106 up 1.00000 1.00000 130 hdd 9.09560 osd.130 up 1.00000 1.00000 151 hdd 9.09560 osd.151 up 1.00000 1.00000 164 hdd 9.09560 osd.164 up 1.00000 1.00000 184 hdd 9.09560 osd.184 up 1.00000 1.00000
-
关闭对应的osd服务 [数据恢复会很慢,也会影响集群速度]
systemctl stop ceph-osd@184
-
恢复完成后再次修复集群即可.
ceph pg repair 2.37c