背景
数据正在recover中,部分pg单副本运行,突然在另个故障域坏了盘,出现两个Pg down了。
处理流程
- ceph health detail 看到osd.168挂了, 尝试重启不能成功
| ceph osd lost 168 --yes-i-really-mean-it |
- 此时pg变成了incomplete状态,查询具体导致incomplete的原因
| ceph pg 2.dcc query |
| { |
| ... |
| "state": "incomplete", |
| "epoch": 69836, |
| "up": [ |
| 7, |
| 165 |
| ], |
| "acting": [ |
| 7, |
| 165 |
| ], |
| "info": { |
| "pgid": "2.dcc", |
| ... |
| "stats": { |
| "version": "67224'8849", |
| "reported_seq": 2622, |
| "reported_epoch": 69836, |
| "state": "incomplete", |
| ... |
| "manifest_stats_invalid": false, |
| "snaptrimq_len": 0, |
| "stat_sum": { |
| ... |
| }, |
| "up": [ |
| 7, |
| 165 |
| ], |
| "acting": [ |
| 7, |
| 165 |
| ], |
| "avail_no_missing": [], |
| "object_location_counts": [], |
| "blocked_by": [ |
| 59, |
| 168 |
| ], |
| "up_primary": 7, |
| "acting_primary": 7, |
| "purged_snaps": [] |
| }, |
| "empty": 0, |
| "dne": 0, |
| "incomplete": 1, |
| "last_epoch_started": 67047, |
| "hit_set_history": { |
| "current_last_update": "0'0", |
| "history": [] |
| } |
| }, |
| ... |
| "recovery_state": [ |
| { |
| "name": "Started/Primary/Peering/Incomplete", |
| "enter_time": "2023-03-17T11:40:48.384585+0800", |
| "comment": "not enough complete instances of this PG" |
| }, |
| { |
| "name": "Started/Primary/Peering", |
| "enter_time": "2023-03-17T11:40:48.384493+0800", |
| "past_intervals": [ |
| { |
| "first": "66182", |
| "last": "67465", |
| "all_participants": [ |
| { |
| "osd": 7 |
| }, |
| { |
| "osd": 59 |
| }, |
| { |
| "osd": 74 |
| }, |
| { |
| "osd": 165 |
| }, |
| { |
| "osd": 168 |
| } |
| ], |
| "intervals": [ |
| { |
| "first": "67046", |
| "last": "67226", |
| "acting": "168" |
| }, |
| { |
| "first": "67463", |
| "last": "67465", |
| "acting": "7" |
| } |
| ] |
| } |
| ], |
| "probing_osds": [ |
| "7", |
| "74", |
| "165" |
| ], |
| "down_osds_we_would_probe": [ |
| 59, |
| 168 |
| ], |
| "peering_blocked_by": [], |
| "peering_blocked_by_detail": [ |
| { |
| "detail": "peering_blocked_by_history_les_bound" |
| } |
| ] |
| }, |
| { |
| "name": "Started", |
| "enter_time": "2023-03-17T11:40:48.384439+0800" |
| } |
| ], |
| "agent_state": {} |
| } |
- 可以看到当前up在7和165上。osd.7由于刚换了盘没有数据,osd.165是osd.168坏了之后重新映射的。
此时只能从59上找数据。
- 找到osd.59的host,down掉osd的服务
| systemctl stop ceph-osd@59 |
| ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-74 --pgid 2.dcc --op list --no-mon-config |
注意:
Mount failed with ‘(11) Resource temporarily unavailable
解决:这个代表ceph-objectstore-tool工具使用时,data-path目录指定的osd服务是运行的,需要先把osd服务down掉systemctl stop ceph-osd@x,然后再执行命令
| ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-39 --pgid 2.dcc --op export --file 2.dcc --no-mon-config |
- 把pg 2.dcc的数据复制到当前up状态的osd.7和osd.165所在host上,并把数据导入osd, 同样需要先stop这两个osd的服务
| ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-7 --pgid 2.dcc --op import --file 2.dcc --no-mon-config |
| ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-7 --pgid 2.dcc --op mark-complete --no-mon-config |
- 启动osd服务
systemctl start ceph-osd@7
实在找不到数据的标记删除
| ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-xx --pgid 2.xxx --op remove --force |
- 此时
ceph -s
有object unfound
| ceph pg xxx mark_unfound_lost revert | delete |
revert: 将object revert到之前的版本(单副本运行期间写入的数据会丢失)
delete: 将object 删除
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· TypeScript + Deepseek 打造卜卦网站:技术与玄学的结合
· 阿里巴巴 QwQ-32B真的超越了 DeepSeek R-1吗?
· 【译】Visual Studio 中新的强大生产力特性
· 【设计模式】告别冗长if-else语句:使用策略模式优化代码结构
· AI与.NET技术实操系列(六):基于图像分类模型对图像进行分类