基于Prometheus搭建监控平台
前言
准备软件包:
- prometheus:prometheus-2.22.2.linux-amd64.tar.gz
- grafana安装包:grafana-enterprise-7.3.4-1.x86_64.rpm
- node_report:node_exporter-1.0.1.linux-amd64.tar.gz
- mysqld-exporter:mysqld_exporter-0.12.1.linux-amd64.tar.gz
- alertmanager:alertmanager-0.21.0.linux-amd64.tar.gz
- 钉钉告警插件:prometheus-webhook-dingtalk-1.4.0.linux-amd64.tar.gz
可以使用kill -HUP <prometheus's pid>
的方式热加载prometheus的配置。
本文档只是简单的初步配置,不涉及高可用和微服务、容器云、日志监控。
配置server单节点
- server节点:172.50.13.101
- 部署位置:/usr/local/prometheus/(建议部署在/usr/local/prometheus/prometheus/目录下)
- 配置文件:/usr/local/prometheus/prometheus.yml(建议放在/usr/local/prometheus/prometheus/目录下)
- 监听端口:19090
- 数据存储位置:/home/prometheus/data/(建议存放于/home/data/prometheus/目录下)
- 数据保留时间:15天
下载prometheus压缩包并解压,二进制文件可直接运行。
prometheus.service
将prometheus设置为service,并开机自启
vim /usr/lib/systemd/system/prometheus.service
- 文件内容见本文底部 -> 附录代码 -> prometheus.service
- 加载service:
systemctl daemon-reload
- 启动:
systemctl start prometheus
- 设置自启:
systemctl enable prometheus
配置node节点
- node节点:需要被监控的服务器
- 部署位置:/usr/local/prometheus/node_exporter/
- 监听端口:18080
- 下载node_exporter压缩包到目标服务器的部署位置
vim /usr/lib/systemd/system/node_exporter.service
- 文件内容见本文底部 -> 附录代码 -> node_exporter.service
- 加载service文件:
systemctl daemon-reload
- 启动:
systemctl start node_exporter
- 设置自启:
systemctl enable node_exporter
配置mysql监控
在数据库中添加exporter账户
- 登录需要被监控的数据库
- 创建用户exporter:
CREATE USER 'exporter'@'localhost' IDENTIFIED BY 'XXXXXXXX' WITH MAX_USER_CONNECTIONS 3;
- 授权:
GRANT PROCESS, REPLICATION CLIENT, SELECT ON *.* TO 'exporter'@'localhost';
- 刷新:
flush privileges;
修改mysql_exporter的配置
添加/usr/local/prometheus/mysql_exporter/.my.cnf
文件
[client]
user=exporter
password=xxxxxxxx
添加service
见附录代码中的mysql_exporter.service
相关参考资料
配置grafana
安装grafana
下载rpm安装包后直接yum安装。
添加prometheus数据源
进入grafana的web界面,添加prometheus的数据源。
导入模板
mysql主题的ID:7362
node主题ID:13105
踩坑
- 问题1:添加mysql的exporter后,prometheus的web页面能看到mysql的监控数据,但是grafana里面添加不了prometheus的数据源
- 解决:
- 试试其它mysql dashboard的ID
- 自己创建panel。
- 原因:并没有适配Prometheus
- 解决:
- 问题2:grafana中的MySQL仪表盘有部分没有数据:
- 解决:更改node_exporter和mysql_exporter的instance为一致
官方Dashboard资源
监控告警
- 告警服务在172.50.13.102
- 部署位置:
/usr/local/prometheus/alertmanager/
- 数据存储路径:
/home/data/prometheus/alertmanager/
- 监听端口:18081
配置alertmanager
./alertmanager --storage.path=/home/data/prometheus/alertmanager/ --web.listen-address=:18081 --config.file=/usr/local/prometheus/alertmanager/alertmanager.yml --data.retention=120h --web.external-url=http://172.50.13.102:18081 &
- storage.path:数据存储路径
- web.listen-address:监听端口
- config.file:配置文件路径
- data.retention:数据存储保留时长
- web.external-url:web访问的url
配置prometheus告警规则
- 关联alertmanager:见prometheus.yml中的alerting
- 指定告警规则文件的路径:见prometheus.yml中的rule_files
配置钉钉告警机器人
/usr/local/prometheus/prometheus-webhook-dingtalk/prometheus-webhook-dingtalk --ding.profile=webhook1=https://oapi.dingtalk.com/robot/send?access_token=钉钉token &
附录代码
prometheus.yml
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets: ['172.50.13.102:18081']
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
# - "first_rules.yml"
# - "second_rules.yml"
- "alertrules/*_rules.yml"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: 'prometheus'
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ['localhost:19090']
- job_name: '非生产'
file_sd_configs:
- files: ['/usr/local/prometheus/sd_configs/noGroup*.yml']
refresh_interval: 10s
- job_name: '生产mysql'
file_sd_configs:
- files: ['/usr/local/prometheus/sd_configs/mysql/product*.yml']
refresh_interval: 10s
- job_name: '非生产mysql'
file_sd_configs:
- files: ['/usr/local/prometheus/sd_configs/mysql/noproduct*.yml']
refresh_interval: 10s
- job_name: '生产服务器'
file_sd_configs:
- files: ['/usr/local/prometheus/sd_configs/product*.yml']
refresh_interval: 10s
- job_name: '物理机'
file_sd_configs:
- files: ['/usr/local/prometheus/sd_configs/wuli*.yml']
refresh_interval: 10s
alertmanager.yml
global:
resolve_timeout: 5m
route:
group_by: [alertname]
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: webhook
receivers:
- name: webhook
webhook_configs:
- url: 'http://172.50.13.102:8060/dingtalk/webhook1/send'
send_resolved: true
prometheus.service
[Unit]
Description=https://prometheus.io
Documentation=https://prometheus.io
[Service]
Restart=on-failure
ExecStart=/usr/local/prometheus/prometheus/ \
--storage.tsdb.path=/home/prometheus/data/ \
--config.file=/usr/local/prometheus/prometheus.yml \
--web.listen-address=:19090 \
--storage.tsdb.retention=15d
[Install]
WantedBy=multi-user.target
node_exporter.service
[Unit]
Description=https://prometheus.io
[Service]
Restart=on-failure
ExecStart=/usr/local/prometheus/node_exporter/node_exporter --web.listen-address 0.0.0.0:18080
[Install]
WantedBy=multi-user.target
mysql_exporter.service
[Unit]
Description=https://prometheus.io
[Service]
Restart=on-failure
ExecStart=/usr/local/prometheus/mysqld_exporter/mysqld_exporter \
--web.listen-address 0.0.0.0:9104 \
--config.my-cnf=/usr/local/prometheus/mysqld_exporter/.my.cnf
[Install]
WantedBy=multi-user.target
alertmanager.service
告警规则-内存、硬盘、CPU
groups:
- name: mem
rules:
- alert : mem
expr: (1 - (node_memory_MemAvailable_bytes / (node_memory_MemTotal_bytes)))* 100 > 95
for: 5m
labels:
severity: critical
annotations:
description: "{{ $labels.job }} {{ $labels.instance }} 节点的内存使用率超过95%已持续5分钟!"
summary: "{{ $labels.instance }} 内存使用率超标! "
- name: disk
rules:
- alert : disk
expr: (node_filesystem_size_bytes{fstype=~"ext.*|xfs",mountpoint !~".*pod.*"}-node_filesystem_free_bytes{fstype=~"ext.*|xfs",mountpoint !~".*pod.*"}) *100/(node_filesystem_avail_bytes {fstype=~"ext.*|xfs",mountpoint !~".*pod.*"}+(node_filesystem_size_bytes{fstype=~"ext.*|xfs",mountpoint !~".*pod.*"}-node_filesystem_free_bytes{fstype=~"ext.*|xfs",mountpoint !~".*pod.*"})) > 95
for: 5m
labels:
severity: warning
annotations:
description: "{{ $labels.job }} {{ $labels.instance }} 节点的硬盘使用率超过95%已持续5分钟!"
summary: "{{ $labels.instance }} 硬盘空间使用率已超过95%! "
- name: cpu
rules:
- alert : cpu
expr: ((1- sum(increase(node_cpu_seconds_total{mode="idle"}[5m])) by (instance)/sum(increase(node_cpu_seconds_total[5m])) by (instance)) * 100) > 70
for: 5m
labels:
severity: warning
annotations:
description: "{{ $labels.job }} {{ $labels.instance }} 节点的CPU使用率超过70%已持续5分钟!"
summary: "{{ $labels.instance }} CPU使用率已超过70! "
告警规则-主机存活
groups:
- name: UP
rules:
- alert : node
expr: up == 0
for: 1m
labels:
severity: critical
annotations:
description: "{{ $labels.job }} {{ $labels.instance }} 节点断联已超过1分钟!"
summary: "{{ $labels.instance }} down "
本文来自博客园,作者:花酒锄作田,转载请注明原文链接:https://www.cnblogs.com/XY-Heruo/p/14498541.html