浪潮的硬件监控(ipmitool,MegaCli)
浪潮的硬件监控(ipmitool,MegaCli)
为什么使用ipmitool和MegaCli工具监控?
在浪潮的的服务器中ipmitool中的信息和管理卡中的信息是一一对应的,比如说看这个在管理卡上的风扇的状态值
在系统里ipmitool获取风扇的相关信息
ipmitool sdr list | grep FAN[0-6]
FAN0_F_Speed | 4224 RPM | ok
FAN0_R_Speed | 3744 RPM | ok
FAN1_F_Speed | 4224 RPM | ok
FAN1_R_Speed | 3744 RPM | ok
FAN2_F_Speed | 4320 RPM | ok
FAN2_R_Speed | 3840 RPM | ok
FAN3_F_Speed | 4224 RPM | ok
FAN3_R_Speed | 3840 RPM | ok
从这里输出的数据中可以开到他显示了风扇的状态和读值。
dell和浪潮的风扇的监控区别是,dell的监控中如果风扇的读取值异常了会显示在返回的状态中。在浪潮的监控中是只监控了风扇的在线情况。风扇的读取值异常比如时候风扇的读取值是200,但是风扇还在线。dell就会有状态的异常,会报警。浪潮的就不会(但是一般这种情况很少发生,我是没有见过)
ipmitool的监控
- 处理器的状态监控
这里看他的状态不是看ok那一栏,而是看Presence detected这一栏
ipmitool sdr elist | grep -i cpu[0-2]_status
对应管理卡
CPU0_Status | 7Dh | ok | 3.0 | Presence detected
CPU1_Status | 7Eh | ok | 3.0 | Presence detected
- 查看内存的状态
注意:浪潮这里就比较恶心了,他的内存的传感器
名字叫cpu(哈----忒)所以在grep的时候需要看一下他的传感器的名称。他的传感器的名称型号和型号之间是不一致的。所以做监控的时候需要兼容一下。
这里也是他的状态值是否是Presence Detected
ipmitool sdr elist | grep -i CPU[0-1]_C[0-1]D[0-1]
CPU0_C0D0 | 83h | ok | 32.0 | Presence Detected
CPU0_C0D1 | 84h | ok | 32.1 |
CPU0_C1D0 | 85h | ok | 32.2 | Presence Detected
CPU0_C1D1 | 86h | ok | 32.3 |
CPU1_C0D0 | 8Fh | ok | 32.12 | Presence Detected
CPU1_C0D1 | 90h | ok | 32.13 |
CPU1_C1D0 | 91h | ok | 32.14 | Presence Detected
CPU1_C1D1 | 92h | ok | 32.15 |
- 硬盘插口状态查看
ipmitool sdr elist| grep -i disk
下边是硬盘插槽
DISK0_Status | B4h | ok | 4.0 | Drive Present
DISK1_Status | B5h | ok | 4.1 | Drive Present
DISK2_Status | B6h | ok | 4.2 | Drive Present
DISK3_Status | B7h | ok | 4.3 | Drive Present
DISK4_Status | B8h | ok | 4.4 | Drive Present
DISK5_Status | B9h | ok | 4.5 | Drive Present
DISK6_Status | BAh | ok | 4.6 | Drive Present
DISK7_Status | BBh | ok | 4.7 | Drive Present
DISK8_Status | BCh | ok | 4.8 | Drive Present
DISK9_Status | BDh | ok | 4.9 | Drive Present
DISK10_Status | BEh | ok | 4.10 | Drive Present
DISK11_Status | BFh | ok | 4.11 | Drive Present
DISK12_Status | C0h | ok | 4.12 |
DISK13_Status | C1h | ok | 4.13 |
DISK14_Status | C2h | ok | 4.14 |
DISK15_Status | C3h | ok | 4.15 |
DISK16_Status | C4h | ok | 4.16 |
DISK17_Status | C5h | ok | 4.17 |
DISK18_Status | C6h | ok | 4.18 |
DISK19_Status | C7h | ok | 4.19 |
DISK20_Status | C8h | ok | 4.20 |
DISK21_Status | C9h | ok | 4.21 |
DISK22_Status | CAh | ok | 4.22 |
DISK23_Status | CBh | ok | 4.23 |
DISK24_Status | D4h | ok | 4.24 |
下边是硬盘背板插槽
DISK0_R_Status | CCh | ok | 4.0 | Drive Present
DISK1_R_Status | CDh | ok | 4.1 | Drive Present
DISK2_R_Status | CEh | ok | 4.2 |
DISK3_R_Status | CFh | ok | 4.3 |
DISK4_R_Status | D0h | ok | 4.4 |
DISK5_R_Status | D1h | ok | 4.5 |
DISK6_R_Status | D2h | ok | 4.6 |
DISK7_R_Status | D3h | ok | 4.7 |
- 电源信息
ipmitool sdr elist| grep -i psu[0-1]_status
PSU0_Status | 74h | ok | 10.0 | Presence detected
PSU1_Status | 75h | ok | 10.0 | Presence detected
- 风扇状态信息
ipmitool sdr elist| grep -i fan[0-9]_Present
FAN0_Present | 60h | ok | 29.0 | Device Present
FAN1_Present | 61h | ok | 29.1 | Device Present
FAN2_Present | 62h | ok | 29.2 | Device Present
FAN3_Present | 63h | ok | 29.3 | Device Present
- 温度情况监控
ipmitool sdr elist| grep -i temp
Inlet_Temp | 00h | ok | 12.0 | 22 degrees C
Outlet_Temp | 01h | ok | 55.1 | 32 degrees C
CPU0_Temp | 06h | ok | 3.0 | 28 degrees C
CPU1_Temp | 07h | ok | 3.0 | 26 degrees C
CPU0_DIMM_Temp | 0Eh | ok | 32.0 | 34 degrees C
CPU1_DIMM_Temp | 0Fh | ok | 32.0 | 32 degrees C
CPU0_VR_Temp | 02h | ok | 3.0 | 31 degrees C
CPU1_VR_Temp | 03h | ok | 3.1 | 30 degrees C
PCH_Temp | 16h | ok | 3.0 | 44 degrees C
OCP_Temp | 29h | ns | 11.0 | No Reading
NVME_Temp | 28h | ns | 11.1 | No Reading
PSU0_Temp | 1Ch | ok | 32.0 | 28 degrees C
PSU1_Temp | 1Dh | ok | 32.0 | 27 degrees C
RAID0_Temp | 17h | ok | 11.0 | 58 degrees C
RAID1_Temp | 18h | ns | 11.1 | No Reading
RAID2_Temp | 19h | ns | 11.2 | No Reading
RAID3_Temp | 1Ah | ns | 11.3 | No Reading
GPU0_Temp | 20h | ns | 11.0 | No Reading
GPU1_Temp | 21h | ns | 11.1 | No Reading
GPU2_Temp | 22h | ns | 11.2 | No Reading
GPU3_Temp | 23h | ns | 11.3 | No Reading
GPU4_Temp | 24h | ns | 11.4 | No Reading
GPU5_Temp | 25h | ns | 11.5 | No Reading
GPU6_Temp | 26h | ns | 11.6 | No Reading
GPU7_Temp | 27h | ns | 11.7 | No Reading
PCIE_SSD0_Temp | A7h | ns | 11.0 | No Reading
PCIE_SSD1_Temp | A8h | ns | 11.1 | No Reading
PCIE_SSD2_Temp | A9h | ns | 11.2 | No Reading
PCIE_SSD3_Temp | AAh | ns | 11.3 | No Reading
PCIE_SSD4_Temp | ABh | ns | 11.4 | No Reading
PCIE_SSD5_Temp | ACh | ns | 11.5 | No Reading
PCIE_SSD6_Temp | ADh | ns | 11.6 | No Reading
PCIE_SSD7_Temp | AEh | ns | 11.7 | No Reading
M.2_Inlet_Temp | 05h | ok | 55.0 | 28 degrees C
Rear_HDDBP_Temp | 2Ah | ns | 11.0 | No Reading
SWITCH0_Temp | 4Ah | ns | 11.0 | No Reading
SWITCH1_Temp | 4Bh | ns | 11.1 | No Reading
HDD_Max_Temp | 2Bh | ok | 11.0 | 32 degrees C
阵列监控
MegaCli64具体其他的使用可以百度一下
- 硬盘信息输出
sudo /opt/MegaRAID/MegaCli/MegaCli64 -PDList -aAll -NoLog| egrep -iv "exit|Adapter"
Enclosure Device ID: 8 # id
Slot Number: 13 # 磁盘插槽
Enclosure position: 0
Device Id: 14
Sequence Number: 2
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SATA
Raw Size: 3.638 TB [0x1d1c0beb0 Sectors] #设备大小
Non Coerced Size: 3.637 TB [0x1d1b0beb0 Sectors]
Coerced Size: 3.637 TB [0x1d1b00000 Sectors]
Firmware state: Online, Spun Up # 磁盘的状态 就是监控磁盘的这个值的状态
SAS Address(0): 0x56c92bf001fa0bcd
Connected Port Number: 0(path0)
Inquiry Data: V6J3J9SS HGST HUS726T4TALA6L4 VLGAW41G
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Needs EKM Attention: No
Foreign State: None
Device Speed: 6.0Gb/s
Link Speed: Unknown
Media Type: Hard Disk Device
Drive: Not Certified
Drive Temperature :27C (80.60 F) # 温度
- 虚拟硬盘的信息获取
他可能有很多的阵列,现在只是拿出其中一个说
sudo /opt/MegaRAID/MegaCli/MegaCli64 -LdPdInfo -aAll -NoLog| egrep -iv "exit|Adapter"
Virtual Drive: 9 (Target Id: 9)
Name :
RAID Level : Primary-0, Secondary-0, RAID Level Qualifier-0 # 这里就是raid0
Size : 3.637 TB
State : Optimal # 这个是这个整列的状态,阵列的监控就是监控的这个值
Strip Size : 64 KB # 这个是他的条带
Number Of Drives : 1
Span Depth : 1
Default Cache Policy: WriteBack, ReadAhead, Direct, Write Cache OK if Bad BBU
Current Cache Policy: WriteBack, ReadAhead, Direct, Write Cache OK if Bad BBU
Access Policy : Read/Write
Disk Cache Policy : Disk's Default
Encryption Type : None
Bad Blocks Exist: No
Number of Spans: 1
Span: 0 - Number of PDs: 1
# 下边是在这个整列中的磁盘信息,但是这里的磁盘信息需要注意,当磁盘信息是在线或者热备的时候会显示在这下边的列表中。
PD: 0 Information
Enclosure Device ID: 8
Slot Number: 10
Enclosure position: 0
Device Id: 17
Sequence Number: 2
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SATA
Raw Size: 3.638 TB [0x1d1c0beb0 Sectors]
Non Coerced Size: 3.637 TB [0x1d1b0beb0 Sectors]
Coerced Size: 3.637 TB [0x1d1b00000 Sectors]
Firmware state: Online, Spun Up
SAS Address(0): 0x56c92bf001fa0bca
Connected Port Number: 0(path0)
Inquiry Data: V6J3J1BS HGST HUS726T4TALA6L4 VLGAW41G
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Needs EKM Attention: No
Foreign State: None
Device Speed: 6.0Gb/s
Link Speed: Unknown
Media Type: Hard Disk Device
Drive: Not Certified
Drive Temperature :28C (82.40 F)
- 查看阵列卡的详细信息
sudo /opt/MegaRAID/MegaCli/MegaCli64 -AdpBbuCmd -aAll
BBU status for Adapter: 0
BatteryType: CVPM02
Voltage: 9431 mV
Current: 0 mA
Temperature: 25 C
BBU Firmware Status:
Charging Status : None
Voltage : OK
Temperature : OK
Learn Cycle Requested : No
Learn Cycle Active : No
Learn Cycle Status : OK
Learn Cycle Timeout : No
I2c Errors Detected : No
Battery Pack Missing : No
Battery Replacement required : No
Remaining Capacity Low : No
Periodic Learn Required : No
Transparent Learn : No
No space to cache offload : No
Pack is about to fail & should be replaced : No
Cache Offload premium feature required : No
Module microcode update required : No
Battery state:
GasGuageStatus:
Fully Discharged : Yes
Fully Charged : Yes
Discharging : Yes
Initialized : Yes
Remaining Time Alarm : No
Remaining Capacity Alarm: Yes
Discharge Terminated : Yes
Over Temperature : No
Charging Terminated : Yes
Over Charged : No
Pack energy : 247 J
Capacitance : 110
Remaining reserve space : 0
BBU Design Info for Adapter: 0
Date of Manufacture: 08/06, 2019
Design Capacity: 288 J
Design Voltage: 9500 mV
Serial Number: 1550
Manufacture Name: LSI
Device Name: CVPM02
Device Chemistry: EDLC
Battery FRU: N/A
TMM FRU: N/A
Module Version: 6635-02A
BBU Properties for Adapter: 0
Auto Learn Period: 2412000 Sec
Next Learn time: 634778466 Sec
Learn Delay Interval:0 Hours
Auto-Learn Mode: Enabled