阵列故障排查MegaCli
阵列故障排查
MegaCli8.07.10.tar.gz 该压缩包包含Linux和Windows的程序,如需请下载这个 (访问密码: 2344)
MegaRAID.zip 点击下载: (访问密码: 2344)
unzip /opt/MegaRAID.zip -d /opt
ln -s /opt/MegaRAID/MegaCli/MegaCli64 /bin/MegaCli64
chmod +x /bin/MegaCli64
以上是安装命令
/opt/MegaRAID/MegaCli/MegaCli64 -LDInfo -Lall -aALL
查看所有阵列卡的所有阵列逻辑卷的所有信息
/opt/MegaRAID/MegaCli/MegaCli64 -LDInfo -Lall -a0 | grep -i "Virtual Disk"
查看第1张阵列卡上一共有多少个raid阵列组(默认从0开始计数)
/opt/MegaRAID/MegaCli/MegaCli64 -LDInfo -L0 -a0
查看第1张阵列卡上的第0个raid阵列组的信息(默认从0开始计数)
/opt/MegaRAID/MegaCli/MegaCli64 -LDPdInfo -aAll
查看所有阵列卡的所有阵列逻辑卷的所有信息(包括阵列逻辑卷对应的物理硬盘)
/opt/MegaRAID/MegaCli/MegaCli64 -AdpAllInfo -aALL
显示所有阵列卡信息,可以查看机器上一共有多少张阵列卡以及每张卡的详细信息
/opt/MegaRAID/MegaCli/MegaCli64 -PDList -aALL
查看所有阵列卡上的所有物理硬盘信息,包括型号、接口、SN码、部分SMART信息等
/opt/MegaRAID/MegaCli/MegaCli64 -CfgLdAdd -r5 [32:2,32:3,32:4] WB Direct -Hsp[32:4] -a0
在第1张阵列卡上创建一个 raid5 阵列,由物理盘 1,2,3 构成,该阵列的热备盘是物理盘 4
注意这里的32:2指的是使用eID和slot ID对一个阵列卡下的硬盘进行定位(默认从0开始计数)
/opt/MegaRAID/MegaCli/MegaCli64 -CfgLdAdd -r5 [32:2,32:3,32:4] WB Direct -a0
同上,创建一个 raid5 阵列,但是不指定热备盘
/opt/MegaRAID/MegaCli/MegaCli64 -LDRecon -Start -r5 -Add -PhysDrv[32:5] -L1 -a0
在线添加物理硬盘到某个阵列中
/opt/MegaRAID/MegaCli/MegaCli64 -CfgLdDel -L1 -a0
删除第1张阵列卡上的第2个raid阵列(默认从0开始计数)
/opt/MegaRAID/MegaCli/MegaCli64 -LDInit -ShowProg -LALL -aALL
阵列创建完后,会有一个初始化同步块的过程,可以看看其进度
/opt/MegaRAID/MegaCli/MegaCli64 -LDInit -ProgDsply -LALL -aALL
同上,但是以动态可视化文字界面显示进度信息
/opt/MegaRAID/MegaCli/MegaCli64 -LDBI -ShowProg -LALL -aALL
查看阵列后台初始化进度
/opt/MegaRAID/MegaCli/MegaCli64 -LDBI -ProgDsply -LALL -aALL
同上,以动态可视化文字界面显示
/opt/MegaRAID/MegaCli/MegaCli64 -PDHSP -Set [-EnclAffinity] [-nonRevertible] -PhysDrv[32:7] -a0
指定第1张阵列卡的第8块盘作为全局热备 (默认从0开始计数)
/opt/MegaRAID/MegaCli/MegaCli64 -PDHSP -Set [-Dedicated [-Array1]] [-EnclAffinity] [-nonRevertible] -PhysDrv[32:7] -a0
同上,指定为某个阵列的专用热备
/opt/MegaRAID/MegaCli/MegaCli64 -PDHSP -Rmv -PhysDrv[32:7] -a0
删除全局热备盘
/opt/MegaRAID/MegaCli/MegaCli64 -PDOffline -PhysDrv [32:2] -a0
将某块物理盘下线(offline)
/opt/MegaRAID/MegaCli/MegaCli64 -PDOnline -PhysDrv [32:2] -a0
将某块物理盘下线(online)
/opt/MegaRAID/MegaCli/MegaCli64 -PDRbld -ShowProg -PhysDrv [32:2] -a0
换盘后查看物理磁盘重建进度
/opt/MegaRAID/MegaCli/MegaCli64 -PDRbld -ProgDsply -PhysDrv [32:2] -a0
以动态可视化界面查看物理磁盘重建进度
/opt/MegaRAID/MegaCli/MegaCli64 -FwTermLog -Dsply -aALL
查看raid卡日志,注意日志的输出量可能较大,阵列卡对应的组建阵列等操作的详细信息都可以在日志中查看
/opt/MegaRAID/MegaCli/MegaCli64 -cfgdsply -aALL
显示Raid卡型号,Raid设置,Disk相关信息
/opt/MegaRAID/MegaCli/MegaCli64 -adpCount
显示适配器个数,只能查看机器上一共有多少张阵列卡,不能查看详细信息
/opt/MegaRAID/MegaCli/MegaCli64 -AdpGetTime –aALL
显示适配器时间
/opt/MegaRAID/MegaCli/MegaCli64 -AdpBbuCmd -aAll
查看BBU的详细信息
/opt/MegaRAID/MegaCli/MegaCli64 -AdpBbuCmd -GetBbuStatus - aALL |grep 'Charger Status'
查看BBU电池的充电状态
/opt/MegaRAID/MegaCli/MegaCli64 -AdpBbuCmd -GetBbuStatus -aALL
显示BBU状态信息
/opt/MegaRAID/MegaCli/MegaCli64 -AdpBbuCmd -GetBbuCapacityInfo -aALL
显示BBU容量信息
/opt/MegaRAID/MegaCli/ MegaCli64 -AdpBbuCmd -GetBbuDesignInfo -aALL
显示BBU设计参数
/opt/MegaRAID/MegaCli/MegaCli64 -AdpBbuCmd -GetBbuProperties -aALL
显示当前BBU属性
Adapter#0 —–raid卡控制器编号
EnclosureDevice ID: 252 —–外壳设备ID,也就是raid卡的ID号
SlotNumber: 4 ——槽号
Enclosureposition: 0 ——外壳位置
DeviceId: 49 —–设备ID
SequenceNumber: 2 —–序号
Media Error Count: 0 —–介质错误计数
Other Error Count: 0 —–其它错误计数
Predictive Failure Count: 0 —–预测故障计数
Last Predictive Failure Event Seq Number: 0
PD Type:SATA —–磁盘接口类型
Raw Size:3.638 TB [0x1d1c0beb0 Sectors] —–磁盘原始大小
NonCoerced Size: 3.637 TB [0x1d1b0beb0 Sectors] —–磁盘标准大小
CoercedSize: 3.637 TB [0x1d1b00000 Sectors] ——磁盘最大可用大小
Firmware state: Unconfigured(good),Spun down —–固件状态:未配置(好的),未连接
SASAddress(0): 0x4433221107000000
ConnectedPort Number: 3(path0) ——连接端口号
InquiryData: Z1ZBBJWWST4000NM0033-9ZM170 SN06 —–硬盘的序列号、型号、固件版本
FDECapable: Not Capable
FDEEnable: Disable
Secured:Unsecured
Locked:Unlocked
Needs EKMAttention: No
ForeignState: None
DeviceSpeed: 6.0Gb/s
LinkSpeed: 6.0Gb/s
MediaType: Hard Disk Device
Drive: Not Certified
DriveTemperature : N/A —-设备温度
需要特别关注这几个指标:Media Error / Other Error / Predictive Failure Count / LastPredictive Failure Event Seq Number
/opt/MegaRAID/MegaCli/MegaCli64 -pdlist -aALL | grep "Firmware state" | awk -F : '{print $2}' | awk -F , '{print $1}'
/opt/MegaRAID/MegaCli/MegaCli64 -pdlist -aALL | grep -E "Media Error" | awk -F : '{print $2}'
1,Media Error
磁盘存在错误,可能是磁盘有坏道。值越大,越危险。根据磁盘状况,一般大于100报修更换。
2,Other Error
磁盘存在未知的错误,可能是磁盘松动,需要重新再插入。根据磁盘状况,一般大于100报修更换。
3,Predictive Failure Count
磁盘的预警数。一般大于0,就报修更换。
4,Last Predictive Failure Event Seq Number
最后一条预警的时间序列号。这个值不为0,肯定Predictive Failure Count也不为0
5,Firmware state
磁盘目前的状态。一般有9种,即
(1)Unconfigured Good – A drive accessible to the RAID controller but not configured as a part of
a virtual drive or as a hot spare.
(2)Online – A drive that can be accessed by the RAID controller and will be part of the virtual
drive.
(3)Rebuild – A drive to which data is being written to restore full redundancy for a virtual drive.
(4)Failed – A drive that was originally configured as Online or Hot Spare, but on which the
firmware detects an unrecoverable error.
(5)Unconfigured Bad – A drive on which the firmware detects an unrecoverable error; the drive
was Unconfigured Good or the drive could not be initialized.
(6)Missing – A drive that was Online, but which has been removed from its location.
(7)Offline – A drive that is part of a virtual drive but which has invalid data as far as the RAID
configuration is concerned.
(8)Hot Spare – A drive that is configured as a hot spare.
(9)None – A drive with an unsupported flag set. An Unconfigured Good or Offline drive that has
completed the prepare for removal operation.
(10)还有一种特殊的状态copyback:
从磁盘组中把数据复制到非磁盘组的磁盘中,然后等failed的盘更换之后,再从这个非磁盘组的磁盘中把数据给copyback回来。
做hot spare的盘,会出现这种情况:即原来的hot spare盘只是临时存放了数据,等failed的盘更换之后,把数据从hotspare的盘中复制回来,正常使用的还是新更换的盘,hot spare的盘永久做hot spare。
遇到Firmware state: Unconfigured(bad)报错时可以使用以下步骤
先使用如下命令查看raid卡上的所有物理硬盘设备情况:
[nzel1:~]# /opt/MegaRAID/MegaCli/MegaCli64 -PDList -aAll -Nolog | grep -i -E 'state|Slot\ Number'
Slot Number: 4
Firmware state: Online, Spun Up
Foreign State: None
Media Type: Solid State Device
Slot Number: 5
Firmware state: Unconfigured(bad)
Foreign State: None
Media Type: Solid State Device
Slot Number: 6
Firmware state: Online, Spun Up
Foreign State: None
Media Type: Solid State Device
Slot Number: 7
Firmware state: Online, Spun Up
Foreign State: None
Media Type: Solid State Device
使用如下命令将硬盘的Unconfigured(bad)-> Unconfigured(good):
/opt/MegaRAID/MegaCli/MegaCli64 -PDMakeGood -Physdrv "[:5]" -a0
5为硬盘的Slot Number,-a0表示物理节点上的第0号控制器
[nzel1:~]# /opt/MegaRAID/MegaCli/MegaCli64 -PDList -aAll -Nolog | grep -i -E 'state|Slot\ Number'
Slot Number: 4
Firmware state: Online, Spun Up
Foreign State: None
Media Type: Solid State Device
Slot Number: 5
Firmware state: Unconfigured(good), Spun Up
Foreign State: Foreign
Media Type: Solid State Device
Slot Number: 6
Firmware state: Online, Spun Up
Foreign State: None
Media Type: Solid State Device
Slot Number: 7
Firmware state: Online, Spun Up
Foreign State: None
Media Type: Solid State Device
设置重建比率为60%
[nzel1:~]# /opt/MegaRAID/MegaCli/MegaCli64 -AdpSetProp RebuildRate 60 a0
[nzel1:~]# /opt/MegaRAID/MegaCli/MegaCli64 -AdpAllinfo -aALL |grep -i rebuild
Rebuild Rate : 60%
Auto Rebuild : Enabled
Rebuild Rate : Yes
Force Rebuild : Yes
导入配置
/opt/MegaRAID/MegaCli/MegaCli64 -cfgforeign -Import -a0
查看重建进度
nzel1:~]# /opt/MegaRAID/MegaCli/MegaCli64 -PDRbld -ShowProg -PhysDrv [:5] -a0
Device(Encl-N/A Slot-5) is not in rebuild process
Exit Code: 0x00