zabbix监控hp服务器raid硬盘健康情况

解决HP服务器raid坏的问题.相当于每天巡检一回硬盘状态,以便于及时发现处理硬盘的问题

系统环境为centos7.9

 

下载地址: https://downloads.linux.hpe.com/SDR/repo/mcp/centos/7/x86_64/current/

wget https://downloads.linux.hpe.com/SDR/repo/mcp/centos/7/x86_64/current/ssacli-5.10-44.0.x86_64.rpm

安装软件hp官方的软件  

rpm -ivh ssacli-5.10-44.0.x86_64.rpm

也可以直接用下面的命令安装

rpm -ivh https://downloads.linux.hpe.com/SDR/repo/mcp/centos/7/x86_64/current/ssacli-5.10-44.0.x86_64.rpm

查看硬盘状态命令

   ssacli ctrl slot=0 pd all show status

   physicaldrive 1I:3:1 (port 1I:box 3:bay 1, 300 GB): OK
   physicaldrive 1I:3:2 (port 1I:box 3:bay 2, 300 GB): OK
   physicaldrive 1I:3:3 (port 1I:box 3:bay 3, 300 GB): OK
   physicaldrive 1I:3:4 (port 1I:box 3:bay 4, 300 GB): OK

配置zabbix_aget.conf

echo "UserParameter=diskbad.count[*],/usr/local/sbin/diskinfo_z.sh" >> /etc/zabbix/zabbix_agentd.conf 

增加每天检测的计划任务

#已经坏
01 01 * * * /usr/sbin/ssacli ctrl slot=0 pd all show status | grep Failed|wc -l > /usr/local/sbin/diskinfo.txt
#即将坏
01 01 * * * /usr/sbin/ssacli ctrl slot=0 pd all show status|grep "Predictive Failure" |wc -l > /usr/local/sbin/diskinfo_predictive.txt

增加检测脚本

cat > /usr/local/sbin/diskinfo_z.sh <<EOF
#!/bin/bash
cat /usr/local/sbin/diskinfo.txt
EOF

给脚本增加可执行权限

chmod +x /usr/local/sbin/diskinfo_z.sh

配置zabbix监控项

配置触发器

重起一下客户端服务

systemctl restart zabbix-agent

测试效果

echo 1 > /usr/local/sbin/diskinfo.txt

 

钉钉群收到报警测试成功

 

附监控模板  zbx_export_templates.xml

<?xml version="1.0" encoding="UTF-8"?>
<zabbix_export>
    <version>5.0</version>
    <date>2023-03-23T08:00:53Z</date>
    <groups>
        <group>
            <name>Linux servers</name>
        </group>
    </groups>
    <templates>
        <template>
            <template>HPdiskbakcount</template>
            <name>HP服务器硬盘健康状态监控</name>
            <description>HP服务器硬盘健康状态监控</description>
            <groups>
                <group>
                    <name>Linux servers</name>
                </group>
            </groups>
            <items>
                <item>
                    <name>diskbadcount</name>
                    <key>diskbad.count</key>
                    <description>HP服务器硬盘健康度监控,一天更新一次数据</description>
                    <valuemap>
                        <name>diskbadcount</name>
                    </valuemap>
                    <triggers>
                        <trigger>
                            <expression>{last(,30)}&lt;&gt;0</expression>
                            <name>物理硬盘状态异常,请尽快检查处理</name>
                            <priority>HIGH</priority>
                            <description>物理硬盘状态异常</description>
                            <manual_close>YES</manual_close>
                        </trigger>
                    </triggers>
                </item>
            </items>
        </template>
    </templates>
    <graphs>
        <graph>
            <name>HP服务器硬盘健康度</name>
            <graph_items>
                <graph_item>
                    <sortorder>1</sortorder>
                    <color>1A7C11</color>
                    <item>
                        <host>HPdiskbakcount</host>
                        <key>diskbad.count</key>
                    </item>
                </graph_item>
            </graph_items>
        </graph>
    </graphs>
    <value_maps>
        <value_map>
            <name>diskbadcount</name>
            <mappings>
                <mapping>
                    <value>0</value>
                    <newvalue>OK</newvalue>
                </mapping>
                <mapping>
                    <value>1</value>
                    <newvalue>Failed1</newvalue>
                </mapping>
                <mapping>
                    <value>2</value>
                    <newvalue>Failed2</newvalue>
                </mapping>
            </mappings>
        </value_map>
    </value_maps>
</zabbix_export>

 

posted @ 2023-02-02 14:10  周智林  阅读(641)  评论(0编辑  收藏  举报