pg down的处理流程

背景

数据正在recover中,部分pg单副本运行,突然在另个故障域坏了盘,出现两个Pg down了。

处理流程

  • ceph health detail 看到osd.168挂了, 尝试重启不能成功
ceph osd lost 168 --yes-i-really-mean-it
  • 此时pg变成了incomplete状态,查询具体导致incomplete的原因
ceph pg 2.dcc query
{
...
"state": "incomplete",
"epoch": 69836,
"up": [
7,
165
],
"acting": [
7,
165
],
"info": {
"pgid": "2.dcc",
...
"stats": {
"version": "67224'8849",
"reported_seq": 2622,
"reported_epoch": 69836,
"state": "incomplete",
...
"manifest_stats_invalid": false,
"snaptrimq_len": 0,
"stat_sum": {
...
},
"up": [
7,
165
],
"acting": [
7,
165
],
"avail_no_missing": [],
"object_location_counts": [],
"blocked_by": [
59,
168
],
"up_primary": 7,
"acting_primary": 7,
"purged_snaps": []
},
"empty": 0,
"dne": 0,
"incomplete": 1,
"last_epoch_started": 67047,
"hit_set_history": {
"current_last_update": "0'0",
"history": []
}
},
...
"recovery_state": [
{
"name": "Started/Primary/Peering/Incomplete",
"enter_time": "2023-03-17T11:40:48.384585+0800",
"comment": "not enough complete instances of this PG"
},
{
"name": "Started/Primary/Peering",
"enter_time": "2023-03-17T11:40:48.384493+0800",
"past_intervals": [
{
"first": "66182",
"last": "67465",
"all_participants": [
{
"osd": 7
},
{
"osd": 59
},
{
"osd": 74
},
{
"osd": 165
},
{
"osd": 168
}
],
"intervals": [
{
"first": "67046",
"last": "67226",
"acting": "168"
},
{
"first": "67463",
"last": "67465",
"acting": "7"
}
]
}
],
"probing_osds": [
"7",
"74",
"165"
],
"down_osds_we_would_probe": [
59,
168
],
"peering_blocked_by": [],
"peering_blocked_by_detail": [
{
"detail": "peering_blocked_by_history_les_bound"
}
]
},
{
"name": "Started",
"enter_time": "2023-03-17T11:40:48.384439+0800"
}
],
"agent_state": {}
}
  • 可以看到当前up在7和165上。osd.7由于刚换了盘没有数据,osd.165是osd.168坏了之后重新映射的。
    此时只能从59上找数据。
  • 找到osd.59的host,down掉osd的服务
systemctl stop ceph-osd@59
  • 查看数据osd是否有pg 2.dcc的数据
ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-74 --pgid 2.dcc --op list --no-mon-config

注意:

Mount failed with ‘(11) Resource temporarily unavailable

解决:这个代表ceph-objectstore-tool工具使用时,data-path目录指定的osd服务是运行的,需要先把osd服务down掉systemctl stop ceph-osd@x,然后再执行命令

  • 导出pg 2.dcc的数据
ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-39 --pgid 2.dcc --op export --file 2.dcc --no-mon-config
  • 把pg 2.dcc的数据复制到当前up状态的osd.7和osd.165所在host上,并把数据导入osd, 同样需要先stop这两个osd的服务
ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-7 --pgid 2.dcc --op import --file 2.dcc --no-mon-config
  • 把pg 2.dcc标记完成complete
ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-7 --pgid 2.dcc --op mark-complete --no-mon-config
  • 启动osd服务systemctl start ceph-osd@7

实在找不到数据的标记删除

ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-xx --pgid 2.xxx --op remove --force
  • 此时ceph -s 有object unfound
ceph pg xxx mark_unfound_lost revert | delete

revert: 将object revert到之前的版本(单副本运行期间写入的数据会丢失)

delete: 将object 删除

posted @   ishmaelwanglin  阅读(858)  评论(0编辑  收藏  举报
相关博文:
阅读排行:
· TypeScript + Deepseek 打造卜卦网站:技术与玄学的结合
· 阿里巴巴 QwQ-32B真的超越了 DeepSeek R-1吗?
· 【译】Visual Studio 中新的强大生产力特性
· 【设计模式】告别冗长if-else语句:使用策略模式优化代码结构
· AI与.NET技术实操系列(六):基于图像分类模型对图像进行分类
点击右上角即可分享
微信分享提示