Controller cache pinned for missing or offline

背景

ceph集群的osd突然down,storcli看到磁盘是offline状态

排查步骤

  • 查看raid各卷组状态
sudo storcli64 /c0/eall/sall show
CLI Version = 007.2309.0000.0000 Sep 16, 2022
Operating system = Linux 5.4.0-137-generic
Controller = 0
Status = Failure
Description = Show Drive Information Failed.
Detailed Status :
===============
--------------------------------
Drive Status ErrCd ErrMsg
--------------------------------
/c0/e0/s1 Success 0 -
/c0/e0/s2 Success 0 -
/c0/e0/s4 Success 0 -
/c0/e0/s5 Success 0 -
/c0/e0/s7 Success 0 -
/c0/e0/s8 Success 0 -
/c0/e0/s10 Success 0 -
/c0/e0/s11 Failure 46 -
/c0/e0/s14 Success 0 -
/c0/e0/s15 Success 0 -
--------------------------------
Drive Information :
=================
----------------------------------------------------------------------------------
EID:Slt DID State DG Size Intf Med SED PI SeSz Model Sp Type
----------------------------------------------------------------------------------
0:1 7 Onln 1 7.276 TB SATA HDD N N 512B ST8000NM0055-1RM112 U -
0:2 14 Onln 2 7.276 TB SATA HDD N N 512B ST8000NM0055-1RM112 U -
0:4 13 Onln 3 7.276 TB SATA HDD N N 512B ST8000NM0055-1RM112 U -
0:5 4 Onln 4 7.276 TB SATA HDD N N 512B ST8000NM0055-1RM112 U -
0:7 3 Onln 5 7.276 TB SATA HDD N N 512B ST8000NM0055-1RM112 U -
0:8 8 Onln 6 7.276 TB SATA HDD N N 512B ST8000NM0055-1RM112 U -
0:10 9 Onln 7 7.276 TB SATA HDD N N 512B ST8000NM0055-1RM112 U -
0:11 11 Failed 8 7.276 TB SATA HDD N N 512B ST8000NM0055-1RM U -
0:14 1 Onln 0 138.766 GB SATA SSD N N 512B INTEL SSDSC2BB150G7 U -
0:15 2 Onln 0 138.766 GB SATA SSD N N 512B INTEL SSDSC2BB150G7 U -
----------------------------------------------------------------------------------
EID=Enclosure Device ID|Slt=Slot No|DID=Device ID|DG=DriveGroup
DHS=Dedicated Hot Spare|UGood=Unconfigured Good|GHS=Global Hotspare
UBad=Unconfigured Bad|Sntze=Sanitize|Onln=Online|Offln=Offline|Intf=Interface
Med=Media Type|SED=Self Encryptive Drive|PI=Protection Info
SeSz=Sector Size|Sp=Spun|U=Up|D=Down|T=Transition|F=Foreign
UGUnsp=UGood Unsupported|UGShld=UGood shielded|HSPShld=Hotspare shielded
CFShld=Configured shielded|Cpybck=CopyBack|CBShld=Copyback Shielded
UBUnsp=UBad Unsupported|Rbld=Rebuild

/c0/e0/s11 的错误码 46

  • 查看raid的vd状态
# sudo storcli64 /c0/vall show
CLI Version = 007.2309.0000.0000 Sep 16, 2022
Operating system = Linux 5.4.0-137-generic
Controller = 0
Status = Success
Description = None
Virtual Drives :
==============
---------------------------------------------------------------
DG/VD TYPE State Access Consist Cache Cac sCC Size Name
---------------------------------------------------------------
0/0 RAID1 Optl RW Yes NRWTD - ON 138.766 GB
1/1 RAID0 Optl RW Yes NRWTD - ON 7.276 TB
2/2 RAID0 Optl RW Yes NRWTD - ON 7.276 TB
3/3 RAID0 Optl RW Yes NRWTD - ON 7.276 TB
4/4 RAID0 Optl RW Yes NRWTD - ON 7.276 TB
5/5 RAID0 Optl RW Yes NRWTD - ON 7.276 TB
6/6 RAID0 Optl RW Yes NRWTD - ON 7.276 TB
7/7 RAID0 Optl RW Yes NRWTD - ON 7.276 TB
8/8 RAID0 OfLn RW No NRWTD - ON 7.276 TB
---------------------------------------------------------------
VD=Virtual Drive| DG=Drive Group|Rec=Recovery
Cac=CacheCade|OfLn=OffLine|Pdgd=Partially Degraded|Dgrd=Degraded
Optl=Optimal|dflt=Default|RO=Read Only|RW=Read Write|HD=Hidden|TRANS=TransportReady
B=Blocked|Consist=Consistent|R=Read Ahead Always|NR=No Read Ahead|WB=WriteBack
AWB=Always WriteBack|WT=WriteThrough|C=Cached IO|D=Direct IO|sCC=Scheduled
Check Consistency

此时vd8已经offline

  • 查看raid卡事件
# sudo storcli64 /c0 show events filter=fatal
seqNum: 0x0000510a
Time: Sun Feb 12 18:11:01 2023
Code: 0x00000143
Class: 3
Locale: 0x21
Event Description: Controller cache pinned for missing or offline VD 08/8
Event Data:
===========
Target Id: 8
seqNum: 0x0000510b
Time: Sun Feb 12 18:11:01 2023
Code: 0x000000fc
Class: 3
Locale: 0x01
Event Description: VD 08/8 is now OFFLINE
Event Data:
===========
Target Id: 8
CLI Version = 007.2309.0000.0000 Sep 16, 2022
Operating system = Linux 5.4.0-137-generic
Controller = 0
Status = Success
Description = None
Events = GETEVENTS
Controller Properties :
=====================
------------------------------------
Ctrl Status Method Value
------------------------------------
0 Success handleSuboption Events
------------------------------------

故障触发点: Controller cache pinned for missing or offline VD

原因

磁盘因未知原因连接断开,cache上还有未回刷完成的数据。

解决办法

$ sudo storcli64 /c0 show preservedcache
CLI Version = 007.2309.0000.0000 Sep 16, 2022
Operating system = Linux 5.4.0-137-generic
Controller = 0
Status = Success
Description = None
--------------------
VD Size State
--------------------
8 7.276 TB Offline
--------------------
$ sudo storcli64 /c0/v8 delete preservedcache
CLI Version = 007.2309.0000.0000 Sep 16, 2022
Operating system = Linux 5.4.0-137-generic
Controller = 0
Status = Success
Description = Virtual Drive preserved Cache Data Cleared.
$ sudo storcli64 /c0/e0/s11 set online
CLI Version = 007.2309.0000.0000 Sep 16, 2022
Operating system = Linux 5.4.0-137-generic
Controller = 0
Status = Success
Description = Set Drive Online Succeeded.
$ sudo storcli64 /c0/eall/sall show
CLI Version = 007.2309.0000.0000 Sep 16, 2022
Operating system = Linux 5.4.0-137-generic
Controller = 0
Status = Success
Description = Show Drive Information Succeeded.
Drive Information :
=================
---------------------------------------------------------------------------------
EID:Slt DID State DG Size Intf Med SED PI SeSz Model Sp Type
---------------------------------------------------------------------------------
0:1 7 Onln 1 7.276 TB SATA HDD N N 512B ST8000NM0055-1RM112 U -
0:2 14 Onln 2 7.276 TB SATA HDD N N 512B ST8000NM0055-1RM112 U -
0:4 13 Onln 3 7.276 TB SATA HDD N N 512B ST8000NM0055-1RM112 U -
0:5 4 Onln 4 7.276 TB SATA HDD N N 512B ST8000NM0055-1RM112 U -
0:7 3 Onln 5 7.276 TB SATA HDD N N 512B ST8000NM0055-1RM112 U -
0:8 8 Onln 6 7.276 TB SATA HDD N N 512B ST8000NM0055-1RM112 U -
0:10 9 Onln 7 7.276 TB SATA HDD N N 512B ST8000NM0055-1RM112 U -
0:11 11 Onln 8 7.276 TB SATA HDD N N 512B ST8000NM0055-1RM112 U -
0:14 1 Onln 0 138.766 GB SATA SSD N N 512B INTEL SSDSC2BB150G7 U -
0:15 2 Onln 0 138.766 GB SATA SSD N N 512B INTEL SSDSC2BB150G7 U -
---------------------------------------------------------------------------------
EID=Enclosure Device ID|Slt=Slot No|DID=Device ID|DG=DriveGroup
DHS=Dedicated Hot Spare|UGood=Unconfigured Good|GHS=Global Hotspare
UBad=Unconfigured Bad|Sntze=Sanitize|Onln=Online|Offln=Offline|Intf=Interface
Med=Media Type|SED=Self Encryptive Drive|PI=Protection Info
SeSz=Sector Size|Sp=Spun|U=Up|D=Down|T=Transition|F=Foreign
UGUnsp=UGood Unsupported|UGShld=UGood shielded|HSPShld=Hotspare shielded
CFShld=Configured shielded|Cpybck=CopyBack|CBShld=Copyback Shielded
UBUnsp=UBad Unsupported|Rbld=Rebuild
$ sudo systemctl reset-failed ceph-osd@71
$ sudo systemctl restart ceph-osd@71

dell

posted @   ishmaelwanglin  阅读(302)  评论(0编辑  收藏  举报
相关博文:
阅读排行:
· TypeScript + Deepseek 打造卜卦网站:技术与玄学的结合
· 阿里巴巴 QwQ-32B真的超越了 DeepSeek R-1吗?
· 【译】Visual Studio 中新的强大生产力特性
· 【设计模式】告别冗长if-else语句:使用策略模式优化代码结构
· AI与.NET技术实操系列(六):基于图像分类模型对图像进行分类
点击右上角即可分享
微信分享提示