prometheus使用vipmi exporter 监控服务器bmc,prometheus监控服务器硬件
prometheus 监控硬件
安装ipmitool 并加载相应模块
yum install ipmitool freeipmi -y
modprobe ipmi_msghandler
modprobe ipmi_devintf
modprobe ipmi_poweroff
modprobe ipmi_si
modprobe ipmi_watchdog
下载 ipmi_exporter 源码包
wget https://github.com/soundcloud/ipmi_exporter/releases/download/v1.0.0/ipmi_exporter-v1.0.0.linux-amd64.tar.gz
tar -xf ipmi_exporter-v1.0.0.linux-amd64.tar.gz -C /opt/
cd /opt/ipmi_exporter-v1.0.0.linux-amd64/
增加配置文件
cat ipmi_remote.yml
modules:
10.193.x.x: #远控卡ip地址
user: "root" #远控卡用户
pass: "xxxxxxxxxxxxx" #远控卡密码
# Available collectors are bmc, ipmi, chassis, and dcmi
collectors:
- bmc
- ipmi
- dcmi
- chassis
# Got any sensors you don't care about? Add them here.
exclude_sensor_ids:
- 2
- 29
- 32
启动ipmi_exporter
./ipmi_exporter --config.file=/usr/local/ipmi_exporter-v1.0.0.linux-amd64/ipmi_remote.yml --web.listen-address=:19293 &
增加prometheus server job 配置
#增加监控ipmi exporter rules 规则
- "rules/Memory_hardware.yml"
- "rules/power.yml"
- "rules/fan.yml"
- "rules/processor.yml"
- "rules/harddisk.yml"
#增加主配置文件job
#cat /usr/local/prometheus/prometheus.yml
- job_name: 'ipmi_exporter'
file_sd_configs:
- refresh_interval: 5s
files:
- ./conf.d/ipmi_exporter.json
#cat /usr/local/prometheus/conf.d/ipmi_exporter.json
[
{
"targets": ["10.65.x.x:19293"],
"labels": {
"hostname": "lgy-storage-glusterxxx"
}
}
]
增加rules 配置文件
# cd /usr/local/prometheus/rules
# cat Memory_hardware.yml (内存条监控)
groups:
- name: Memory_hardware
rules:
- alert: Memory_hardware error
expr: ipmi_sensor_state{type="Memory"} == 1
for: 3m
labels:
user: caizh
annotations:
summary: "Instance {{ $labels.instance }} 内存硬件警告"
description: "{{ $labels.instance }} of job {{$labels.job}} 内存硬件警告,当前状态[{{ $value }}]."
# cat power.yml (服务器电源模块监控)
groups:
- name: power status
rules:
- alert: power bad
expr: ipmi_sensor_state{name="Status",type="Power Supply"} == 1
for: 3m
labels:
user: caizh
annotations:
summary: "Instance {{ $labels.instance }} 电源坏了"
description: "{{ $labels.instance }} of job {{$labels.job}} 电源坏了,当前状态[{{ $value }}]."
# cat fan.yml (服务器风扇监控)
groups:
- name: fan status
rules:
- alert: speed fan bad
expr: ipmi_fan_speed_state{} == 1
for: 3m
labels:
user: caizh
annotations:
summary: "Instance {{ $labels.instance }} 风扇坏了"
description: "{{ $labels.instance }} of job {{$labels.job}} 风扇坏了,当前状态[{{ $value }}]."
# cat processor.yml (服务器处理器监控)
groups:
- name: Processor
rules:
- alert: Processor hardware error
expr: ipmi_sensor_state{name="Status",type="Processor"} == 1
for: 3m
labels:
user: caizh
annotations:
summary: "Instance {{ $labels.instance }} 处理器硬件警告"
# cat harddisk.yml (硬盘监控,主要是raid 组监控,系统盘和数据盘分开做的raid 组,会有两个参数)
groups:
- name: harddisk
rules:
- alert: hard disk bad
expr: ipmi_sensor_state{type="Drive Slot"} == 1
for: 3m
labels:
user: caizh
annotations:
summary: "Instance {{ $labels.instance }} 硬盘坏了"
description: "{{ $labels.instance }} of job {{$labels.job}} 硬盘坏了,当前状态[{{ $value }}]."
转自:https://www.cnblogs.com/lixinliang/p/15019679.html
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 阿里最新开源QwQ-32B,效果媲美deepseek-r1满血版,部署成本又又又降低了!
· SQL Server 2025 AI相关能力初探
· AI编程工具终极对决:字节Trae VS Cursor,谁才是开发者新宠?
· 开源Multi-agent AI智能体框架aevatar.ai,欢迎大家贡献代码
· Manus重磅发布:全球首款通用AI代理技术深度解析与实战指南
2019-10-17 (5.3.4)数据库迁移——数据对比(结构、数据类型)
2019-10-17 【基本优化实践】【1.5】如何在线稳定的删除/更新大量数据?