Prometheus监控zookeeper集群(1)
因为zookeeper版本较低为3.4.x版本,所有采用zookeeper_exporter方式采集数据
1.下载(zookeeper_exporter采集器)
https://github.com/carlpett/zookeeper_exporter/releases/download/v1.1.0/zookeeper_exporter
2. 传到liunx上/opt目录下,没有目录可以自行创建
3.授予权限
chmod 755 zookeeper_exporter
4.编写zookeeper_exporter监控脚本(集群每台都跑)
vim /lib/systemd/system/zkexporter.service
[Unit]
Description=zookeeper_exporter
After=network.target
[Service]
Type=simple
User=root
ExecStart=/opt/zookeeper_exporter -zookeeper 10.249.0.63:2181 -bind-addr :9143
Restart=on-failure
[Install]
WantedBy=multi-user.target
5.分别执行如下启动命令
systemctl start zkexporter.service
systemctl status zkexporter.service
6.查看zookeeper_exporter运行状态(如出现Active: active (running) 已经运行成功)
7.查看采集数据
curl localhost:9143/metrics
8.修改 Prometheus 的配置文件 (prometheus.yml)
9.重启Prometheus ,访问http://localhost:9090
如上所示,当 State 状态显示为 UP 时,则说明 zookeeper_exporter 服务已经集成进来了
10.rule告警文件(仅供参考):
groups:
- name: zookeeperStatsAlert
rules:
- alert: 堆积请求数过大
expr: avg(zk_outstanding_requests) by (instance) > 10
for: 1m
labels:
severity: critical
annotations:
summary: "Instance {{ $labels.instance }} "
description: "积请求数过大"
- alert: 阻塞中的 sync 过多
expr: avg(zk_pending_syncs) by (instance) > 10
for: 1m
labels:
severity: critical
annotations:
summary: "Instance {{ $labels.instance }} "
description: "塞中的 sync 过多"
- alert: 平均响应延迟过高
expr: avg(zk_avg_latency) by (instance) > 10
for: 1m
labels:
severity: critical
annotations:
summary: "Instance {{ $labels.instance }} "
description: '平均响应延迟过高'
- alert: 打开文件描述符数大于系统设定的大小
expr: zk_open_file_descriptor_count > zk_max_file_descriptor_count * 0.85
for: 1m
labels:
severity: critical
annotations:
summary: "Instance {{ $labels.instance }} "
description: '打开文件描述符数大于系统设定的大小'
- alert: zookeeper服务器宕机
expr: up{job="prd_zookeeper"} == 0
for: 5s
labels:
severity: critical
annotations:
summary: "Instance {{ $labels.instance }} "
description: 'zookeeper服务器宕机'
- alert: zk主节点丢失
expr: absent(zk_server_state{state="leader"}) != 1
for: 1m
labels:
severity: critical
annotations:
summary: "Instance {{ $labels.instance }} "
description: 'zk主节点丢失'
11.配置grafana
grafanaid: 11442