浪潮的硬件监控(ipmitool,MegaCli)

浪潮的硬件监控(ipmitool,MegaCli)

为什么使用ipmitool和MegaCli工具监控?

在浪潮的的服务器中ipmitool中的信息和管理卡中的信息是一一对应的,比如说看这个在管理卡上的风扇的状态值

在系统里ipmitool获取风扇的相关信息

ipmitool sdr list | grep FAN[0-6]
FAN0_F_Speed     | 4224 RPM          | ok
FAN0_R_Speed     | 3744 RPM          | ok
FAN1_F_Speed     | 4224 RPM          | ok
FAN1_R_Speed     | 3744 RPM          | ok
FAN2_F_Speed     | 4320 RPM          | ok
FAN2_R_Speed     | 3840 RPM          | ok
FAN3_F_Speed     | 4224 RPM          | ok
FAN3_R_Speed     | 3840 RPM          | ok

从这里输出的数据中可以开到他显示了风扇的状态和读值。
dell和浪潮的风扇的监控区别是,dell的监控中如果风扇的读取值异常了会显示在返回的状态中。在浪潮的监控中是只监控了风扇的在线情况。风扇的读取值异常比如时候风扇的读取值是200,但是风扇还在线。dell就会有状态的异常,会报警。浪潮的就不会(但是一般这种情况很少发生,我是没有见过)

ipmitool的监控

  • 处理器的状态监控
    这里看他的状态不是看ok那一栏,而是看Presence detected这一栏
    ipmitool sdr elist | grep -i cpu[0-2]_status
    对应管理卡
CPU0_Status      | 7Dh | ok  |  3.0 | Presence detected
CPU1_Status      | 7Eh | ok  |  3.0 | Presence detected
  • 查看内存的状态
    注意:浪潮这里就比较恶心了,他的内存的传感器
    名字叫cpu(哈----忒)所以在grep的时候需要看一下他的传感器的名称。他的传感器的名称型号和型号之间是不一致的。所以做监控的时候需要兼容一下。
    这里也是他的状态值是否是Presence Detected

ipmitool sdr elist | grep -i CPU[0-1]_C[0-1]D[0-1]

CPU0_C0D0        | 83h | ok  | 32.0 | Presence Detected
CPU0_C0D1        | 84h | ok  | 32.1 |
CPU0_C1D0        | 85h | ok  | 32.2 | Presence Detected
CPU0_C1D1        | 86h | ok  | 32.3 |
CPU1_C0D0        | 8Fh | ok  | 32.12 | Presence Detected
CPU1_C0D1        | 90h | ok  | 32.13 |
CPU1_C1D0        | 91h | ok  | 32.14 | Presence Detected
CPU1_C1D1        | 92h | ok  | 32.15 |
  • 硬盘插口状态查看
    ipmitool sdr elist| grep -i disk
下边是硬盘插槽
DISK0_Status     | B4h | ok  |  4.0 | Drive Present
DISK1_Status     | B5h | ok  |  4.1 | Drive Present
DISK2_Status     | B6h | ok  |  4.2 | Drive Present
DISK3_Status     | B7h | ok  |  4.3 | Drive Present
DISK4_Status     | B8h | ok  |  4.4 | Drive Present
DISK5_Status     | B9h | ok  |  4.5 | Drive Present
DISK6_Status     | BAh | ok  |  4.6 | Drive Present
DISK7_Status     | BBh | ok  |  4.7 | Drive Present
DISK8_Status     | BCh | ok  |  4.8 | Drive Present
DISK9_Status     | BDh | ok  |  4.9 | Drive Present
DISK10_Status    | BEh | ok  |  4.10 | Drive Present
DISK11_Status    | BFh | ok  |  4.11 | Drive Present
DISK12_Status    | C0h | ok  |  4.12 |
DISK13_Status    | C1h | ok  |  4.13 |
DISK14_Status    | C2h | ok  |  4.14 |
DISK15_Status    | C3h | ok  |  4.15 |
DISK16_Status    | C4h | ok  |  4.16 |
DISK17_Status    | C5h | ok  |  4.17 |
DISK18_Status    | C6h | ok  |  4.18 |
DISK19_Status    | C7h | ok  |  4.19 |
DISK20_Status    | C8h | ok  |  4.20 |
DISK21_Status    | C9h | ok  |  4.21 |
DISK22_Status    | CAh | ok  |  4.22 |
DISK23_Status    | CBh | ok  |  4.23 |
DISK24_Status    | D4h | ok  |  4.24 |
下边是硬盘背板插槽
DISK0_R_Status   | CCh | ok  |  4.0 | Drive Present
DISK1_R_Status   | CDh | ok  |  4.1 | Drive Present
DISK2_R_Status   | CEh | ok  |  4.2 |
DISK3_R_Status   | CFh | ok  |  4.3 |
DISK4_R_Status   | D0h | ok  |  4.4 |
DISK5_R_Status   | D1h | ok  |  4.5 |
DISK6_R_Status   | D2h | ok  |  4.6 |
DISK7_R_Status   | D3h | ok  |  4.7 |
  • 电源信息
    ipmitool sdr elist| grep -i psu[0-1]_status
PSU0_Status      | 74h | ok  | 10.0 | Presence detected
PSU1_Status      | 75h | ok  | 10.0 | Presence detected
  • 风扇状态信息
    ipmitool sdr elist| grep -i fan[0-9]_Present
FAN0_Present     | 60h | ok  | 29.0 | Device Present
FAN1_Present     | 61h | ok  | 29.1 | Device Present
FAN2_Present     | 62h | ok  | 29.2 | Device Present
FAN3_Present     | 63h | ok  | 29.3 | Device Present
  • 温度情况监控
    ipmitool sdr elist| grep -i temp
Inlet_Temp       | 00h | ok  | 12.0 | 22 degrees C
Outlet_Temp      | 01h | ok  | 55.1 | 32 degrees C
CPU0_Temp        | 06h | ok  |  3.0 | 28 degrees C
CPU1_Temp        | 07h | ok  |  3.0 | 26 degrees C
CPU0_DIMM_Temp   | 0Eh | ok  | 32.0 | 34 degrees C
CPU1_DIMM_Temp   | 0Fh | ok  | 32.0 | 32 degrees C
CPU0_VR_Temp     | 02h | ok  |  3.0 | 31 degrees C
CPU1_VR_Temp     | 03h | ok  |  3.1 | 30 degrees C
PCH_Temp         | 16h | ok  |  3.0 | 44 degrees C
OCP_Temp         | 29h | ns  | 11.0 | No Reading
NVME_Temp        | 28h | ns  | 11.1 | No Reading
PSU0_Temp        | 1Ch | ok  | 32.0 | 28 degrees C
PSU1_Temp        | 1Dh | ok  | 32.0 | 27 degrees C
RAID0_Temp       | 17h | ok  | 11.0 | 58 degrees C
RAID1_Temp       | 18h | ns  | 11.1 | No Reading
RAID2_Temp       | 19h | ns  | 11.2 | No Reading
RAID3_Temp       | 1Ah | ns  | 11.3 | No Reading
GPU0_Temp        | 20h | ns  | 11.0 | No Reading
GPU1_Temp        | 21h | ns  | 11.1 | No Reading
GPU2_Temp        | 22h | ns  | 11.2 | No Reading
GPU3_Temp        | 23h | ns  | 11.3 | No Reading
GPU4_Temp        | 24h | ns  | 11.4 | No Reading
GPU5_Temp        | 25h | ns  | 11.5 | No Reading
GPU6_Temp        | 26h | ns  | 11.6 | No Reading
GPU7_Temp        | 27h | ns  | 11.7 | No Reading
PCIE_SSD0_Temp   | A7h | ns  | 11.0 | No Reading
PCIE_SSD1_Temp   | A8h | ns  | 11.1 | No Reading
PCIE_SSD2_Temp   | A9h | ns  | 11.2 | No Reading
PCIE_SSD3_Temp   | AAh | ns  | 11.3 | No Reading
PCIE_SSD4_Temp   | ABh | ns  | 11.4 | No Reading
PCIE_SSD5_Temp   | ACh | ns  | 11.5 | No Reading
PCIE_SSD6_Temp   | ADh | ns  | 11.6 | No Reading
PCIE_SSD7_Temp   | AEh | ns  | 11.7 | No Reading
M.2_Inlet_Temp   | 05h | ok  | 55.0 | 28 degrees C
Rear_HDDBP_Temp  | 2Ah | ns  | 11.0 | No Reading
SWITCH0_Temp     | 4Ah | ns  | 11.0 | No Reading
SWITCH1_Temp     | 4Bh | ns  | 11.1 | No Reading
HDD_Max_Temp     | 2Bh | ok  | 11.0 | 32 degrees C

阵列监控

MegaCli64具体其他的使用可以百度一下

  • 硬盘信息输出
    sudo /opt/MegaRAID/MegaCli/MegaCli64 -PDList -aAll -NoLog| egrep -iv "exit|Adapter"
Enclosure Device ID: 8 # id
Slot Number: 13 # 磁盘插槽
Enclosure position: 0
Device Id: 14
Sequence Number: 2
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SATA
Raw Size: 3.638 TB [0x1d1c0beb0 Sectors]  #设备大小
Non Coerced Size: 3.637 TB [0x1d1b0beb0 Sectors]
Coerced Size: 3.637 TB [0x1d1b00000 Sectors]
Firmware state: Online, Spun Up # 磁盘的状态 就是监控磁盘的这个值的状态
SAS Address(0): 0x56c92bf001fa0bcd
Connected Port Number: 0(path0)
Inquiry Data: V6J3J9SS            HGST HUS726T4TALA6L4                    VLGAW41G
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Needs EKM Attention: No
Foreign State: None
Device Speed: 6.0Gb/s
Link Speed: Unknown
Media Type: Hard Disk Device
Drive:  Not Certified
Drive Temperature :27C (80.60 F) # 温度
  • 虚拟硬盘的信息获取
    他可能有很多的阵列,现在只是拿出其中一个说
    sudo /opt/MegaRAID/MegaCli/MegaCli64 -LdPdInfo -aAll -NoLog| egrep -iv "exit|Adapter"
Virtual Drive: 9 (Target Id: 9)
Name                :
RAID Level          : Primary-0, Secondary-0, RAID Level Qualifier-0  # 这里就是raid0
Size                : 3.637 TB
State               : Optimal # 这个是这个整列的状态,阵列的监控就是监控的这个值
Strip Size          : 64 KB  # 这个是他的条带 
Number Of Drives    : 1
Span Depth          : 1
Default Cache Policy: WriteBack, ReadAhead, Direct, Write Cache OK if Bad BBU
Current Cache Policy: WriteBack, ReadAhead, Direct, Write Cache OK if Bad BBU
Access Policy       : Read/Write
Disk Cache Policy   : Disk's Default
Encryption Type     : None
Bad Blocks Exist: No
Number of Spans: 1
Span: 0 - Number of PDs: 1

# 下边是在这个整列中的磁盘信息,但是这里的磁盘信息需要注意,当磁盘信息是在线或者热备的时候会显示在这下边的列表中。
PD: 0 Information
Enclosure Device ID: 8
Slot Number: 10
Enclosure position: 0
Device Id: 17
Sequence Number: 2
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SATA
Raw Size: 3.638 TB [0x1d1c0beb0 Sectors]
Non Coerced Size: 3.637 TB [0x1d1b0beb0 Sectors]
Coerced Size: 3.637 TB [0x1d1b00000 Sectors]
Firmware state: Online, Spun Up
SAS Address(0): 0x56c92bf001fa0bca
Connected Port Number: 0(path0)
Inquiry Data: V6J3J1BS            HGST HUS726T4TALA6L4                    VLGAW41G
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Needs EKM Attention: No
Foreign State: None
Device Speed: 6.0Gb/s
Link Speed: Unknown
Media Type: Hard Disk Device
Drive:  Not Certified
Drive Temperature :28C (82.40 F)
  • 查看阵列卡的详细信息
    sudo /opt/MegaRAID/MegaCli/MegaCli64 -AdpBbuCmd -aAll
BBU status for Adapter: 0

BatteryType: CVPM02
Voltage: 9431 mV
Current: 0 mA
Temperature: 25 C

BBU Firmware Status:

 Charging Status              : None
 Voltage                                 : OK
 Temperature                             : OK
 Learn Cycle Requested	                  : No
 Learn Cycle Active                      : No
 Learn Cycle Status                      : OK
 Learn Cycle Timeout                     : No
 I2c Errors Detected                     : No
 Battery Pack Missing                    : No
 Battery Replacement required            : No
 Remaining Capacity Low                  : No
 Periodic Learn Required                 : No
 Transparent Learn                       : No
 No space to cache offload               : No
 Pack is about to fail & should be replaced : No
 Cache Offload premium feature required  : No
 Module microcode update required        : No

Battery state:

GasGuageStatus:
 Fully Discharged        : Yes
 Fully Charged           : Yes
 Discharging             : Yes
 Initialized             : Yes
 Remaining Time Alarm    : No
 Remaining Capacity Alarm: Yes
 Discharge Terminated    : Yes
 Over Temperature        : No
 Charging Terminated     : Yes
 Over Charged            : No

 Pack energy             : 247 J
 Capacitance             : 110
 Remaining reserve space : 0


BBU Design Info for Adapter: 0

Date of Manufacture: 08/06, 2019
Design Capacity: 288 J
Design Voltage: 9500 mV
Serial Number: 1550
Manufacture Name: LSI
Device Name: CVPM02
Device Chemistry: EDLC
Battery FRU: N/A
TMM FRU: N/A
Module Version: 6635-02A


BBU Properties for Adapter: 0

Auto Learn Period: 2412000 Sec
Next Learn time: 634778466 Sec
Learn Delay Interval:0 Hours
Auto-Learn Mode: Enabled
posted @ 2020-02-12 22:35  I'm杨呵呵  阅读(7174)  评论(0编辑  收藏  举报