docker compose部署grafana alertmanager prometheus webhook- pushgateway 各项指标监控
本次搭建实现:grafana图形 prometheus监控告警 钉钉告警
一、了解服务作用
- Prometheus开源的系统监控和报警框架,灵感源自Google的Borgmon监控系统
- AlertManager 处理由客户端应用程序(如Prometheus server)发送的警报。它负责将重复数据删除,分组和路由到正确的接收者集成,还负责沉默和抑制警报
- Node_Exporter 用来监控各节点的资源信息的exporter,应部署到prometheus监控的所有节点
- cAdvisor 监控容器
-
mysqld-exporter 用于收集mysql数据
-
pushgateway 通过curl推送到pushgateway组件收集数据 prometheus组件去拉取
- prometheus-webhook-dingtalk 钉钉告警插件
- grafana 监控可视化
简单拓扑图
二、创建prometheus目录 便于存放所有监控 。以及机器信息
服务器就一台:10.1.1.10 存放所有服务。想监控多台 配置文件新增个job ,被监控方启个Node_Exporter服务即可
mkdir /data/prometheus #以下所有操作都在prometheus目录下操作
cd /data/prometheus
三、创建prometheus配置文件以及数据目录。用于启动prometheu时读取
mkdir ./prometheus/data -p
chmod 777 ./prometheus/data #创建存放prometheus数据目录
vim ./prometheus/prometheus.yml
global: scrape_interval: 15s # 多久 收集 一次数据 evaluation_interval: 15s # 多久 评估 一次规则 scrape_timeout: 10s # 每次 收集数据的 超时时间 # 收集数据 配置 列表 scrape_configs: - job_name: prometheus # 必须配置, 自动附加的job labels, 必须唯一 static_configs: - targets: ['10.1.1.10:9090'] # 指定prometheus ip端口 labels: instance: prometheus #标签 - job_name: 1.1.1.1-node1 #node-exporter static_configs: - targets: ['10.1.1.10:9100'] labels: instance: 1.1.1.1 #标签一致 可以同时收集node和cadvisor信息
- job_name: 1.1.1.1-node2 #cadvisor
static_configs:
- targets: ['10.1.1.10:9200']
labels:
instance: 1.1.1.1
alerting: #Alertmanager相关的配置 alertmanagers: - static_configs: - targets: - 10.1.1.10:9093 #指定告警模块 rule_files: #告警规则文件, 可以使用通配符 - "/etc/prometheus/rules/*.yml"
四、创建告警规则文件及触发条件文件 。用于prometheus配置文件读取此告警内容
4.1:
mkdir rules #先创建rules目录
vim rules/alert-rules.yml #通用
groups: - name: prometheus-alert rules: - alert: node-down expr: prometheus:up == 0 for: 1m labels: severity: 'critical' annotations: summary: "instance: {{ $labels.instance }} 宕机了" description: "instance: {{ $labels.instance }} \n- job: {{ $labels.job }} 关机了, 时间已经1分钟了。" value: "{{ $value }}" instance: "{{ $labels.instance }}" - alert: cpu-high expr: prometheus:cpu:total:percent > 80 for: 3m labels: severity: info annotations: summary: "instance: {{ $labels.instance }} cpu 使用率高于 {{ $value }}" description: "instance: {{ $labels.instance }} \n- job: {{ $labels.job }} CPU使用率已经持续一分钟高过80% 。" value: "{{ $value }}" instance: "{{ $labels.instance }}" - alert: cpu-iowait-high expr: prometheus:cpu:iowait:percent >= 12 for: 3m labels: severity: info annotations: summary: "instance: {{ $labels.instance }} cpu iowait 使用率高于 {{ $value }}" description: "instance: {{ $labels.instance }} \n- job: {{ $labels.job }} cpu iowait使用率已经持续三分钟高过12%" value: "{{ $value }}" instance: "{{ $labels.instance }}" - alert: load-load1-high expr: (prometheus:load:load1) > (prometheus:cpu:count) * 1.2 for: 3m labels: severity: info annotations: summary: "instance: {{ $labels.instance }} load1 使用率高于 {{ $value }}" description: "" value: "{{ $value }}" instance: "{{ $labels.instance }}" - alert: memory-high expr: prometheus:memory:used:percent > 85 for: 3m labels: severity: info annotations: summary: "instance: {{ $labels.instance }} memory 使用率高于 {{ $value }}" description: "" value: "{{ $value }}" instance: "{{ $labels.instance }}" - alert: disk-high expr: prometheus:disk:used:percent > 80 for: 10m labels: severity: info annotations: summary: "instance: {{ $labels.instance }} disk 使用率高于 {{ $value }}" description: "" value: "{{ $value }}" instance: "{{ $labels.instance }}" - alert: disk-read:count-high expr: prometheus:disk:read:count:rate > 2000 for: 2m labels: severity: info annotations: summary: "instance: {{ $labels.instance }} iops read 使用率高于 {{ $value }}" description: "" value: "{{ $value }}" instance: "{{ $labels.instance }}" - alert: disk-write-count-high expr: prometheus:disk:write:count:rate > 2000 for: 2m labels: severity: info annotations: summary: "instance: {{ $labels.instance }} iops write 使用率高于 {{ $value }}" description: "" value: "{{ $value }}" instance: "{{ $labels.instance }}" - alert: disk-read-mb-high expr: prometheus:disk:read:mb:rate > 60 for: 2m labels: severity: info annotations: summary: "instance: {{ $labels.instance }} 读取字节数 高于 {{ $value }}" description: "" instance: "{{ $labels.instance }}" value: "{{ $value }}" - alert: disk-write-mb-high expr: prometheus:disk:write:mb:rate > 60 for: 2m labels: severity: info annotations: summary: "instance: {{ $labels.instance }} 写入字节数 高于 {{ $value }}" description: "" value: "{{ $value }}" instance: "{{ $labels.instance }}" - alert: filefd-allocated-percent-high expr: prometheus:filefd_allocated:percent > 80 for: 10m labels: severity: info annotations: summary: "instance: {{ $labels.instance }} 打开文件描述符 高于 {{ $value }}" description: "" value: "{{ $value }}" instance: "{{ $labels.instance }}" - alert: network-netin-error-rate-high expr: prometheus:network:netin:error:rate > 4 for: 1m labels: severity: info annotations: summary: "instance: {{ $labels.instance }} 包进入的错误速率 高于 {{ $value }}" description: "" value: "{{ $value }}" instance: "{{ $labels.instance }}" - alert: network-netin-packet-rate-high expr: prometheus:network:netin:packet:rate > 35000 for: 1m labels: severity: info annotations: summary: "instance: {{ $labels.instance }} 包进入速率 高于 {{ $value }}" description: "" value: "{{ $value }}" instance: "{{ $labels.instance }}" - alert: network-netout-packet-rate-high expr: prometheus:network:netout:packet:rate > 35000 for: 1m labels: severity: info annotations: summary: "instance: {{ $labels.instance }} 包流出速率 高于 {{ $value }}" description: "" value: "{{ $value }}" instance: "{{ $labels.instance }}" - alert: network-tcp-total-count-high expr: prometheus:network:tcp:total:count > 40000 for: 1m labels: severity: info annotations: summary: "instance: {{ $labels.instance }} tcp连接数量 高于 {{ $value }}" description: "" value: "{{ $value }}" instance: "{{ $labels.instance }}" - alert: process-zoom-total-count-high expr: prometheus:process:zoom:total:count > 10 for: 10m labels: severity: info annotations: summary: "instance: {{ $labels.instance }} 僵死进程数量 高于 {{ $value }}" description: "" value: "{{ $value }}" instance: "{{ $labels.instance }}" - alert: time-offset-high expr: prometheus:time:offset > 0.03 for: 2m labels: severity: info annotations: summary: "instance: {{ $labels.instance }} {{ $labels.desc }} {{ $value }} {{ $labels.unit }}" description: "" value: "{{ $value }}" instance: "{{ $labels.instance }}"
vim rules/record-rules.yml
groups: - name: prometheus-record rules: # - expr: up{job!="prometheus"} == 0 ## record: prometheus:up # labels: # desc: "节点是否在线, 在线1,不在线0" # unit: " " # job: "prometheus" - expr: time() - node_boot_time_seconds{} record: prometheus:node_uptime labels: desc: "节点的运行时间" unit: "s" job: "prometheus" ############################################################################################## # cpu # - expr: (1 - avg by (environment,instance) (irate(node_cpu_seconds_total{job!="prometheus",mode="idle"}[5m]))) * 100 record: prometheus:cpu:total:percent labels: desc: "节点的cpu总消耗百分比" unit: "%" job: "prometheus" - expr: (avg by (environment,instance) (irate(node_cpu_seconds_total{job!="prometheus",mode="idle"}[5m]))) * 100 record: prometheus:cpu:idle:percent labels: desc: "节点的cpu idle百分比" unit: "%" job: "prometheus" - expr: (avg by (environment,instance) (irate(node_cpu_seconds_total{job!="prometheus",mode="iowait"}[5m]))) * 100 record: prometheus:cpu:iowait:percent labels: desc: "节点的cpu iowait百分比" unit: "%" job: "prometheus" - expr: (avg by (environment,instance) (irate(node_cpu_seconds_total{job!="prometheus",mode="system"}[5m]))) * 100 record: prometheus:cpu:system:percent labels: desc: "节点的cpu system百分比" unit: "%" job: "prometheus" - expr: (avg by (environment,instance) (irate(node_cpu_seconds_total{job!="prometheus",mode="user"}[5m]))) * 100 record: prometheus:cpu:user:percent labels: desc: "节点的cpu user百分比" unit: "%" job: "prometheus" - expr: (avg by (environment,instance) (irate(node_cpu_seconds_total{job!="prometheus",mode=~"softirq|nice|irq|steal"}[5m]))) * 100 record: prometheus:cpu:other:percent labels: desc: "节点的cpu 其他的百分比" unit: "%" job: "prometheus" ############################################################################################## ############################################################################################## # memory # - expr: node_memory_MemTotal_bytes{job!="prometheus"} record: prometheus:memory:total labels: desc: "节点的内存总量" unit: byte job: "prometheus" - expr: node_memory_MemFree_bytes{job!="prometheus"} record: prometheus:memory:free labels: desc: "节点的剩余内存量" unit: byte job: "prometheus" - expr: node_memory_MemTotal_bytes{job!="prometheus"} - node_memory_MemFree_bytes{job!="prometheus"} record: prometheus:memory:used labels: desc: "节点的已使用内存量" unit: byte job: "prometheus" - expr: node_memory_MemTotal_bytes{job!="prometheus"} - node_memory_MemAvailable_bytes{job!="prometheus"} record: prometheus:memory:actualused labels: desc: "节点用户实际使用的内存量" unit: byte job: "prometheus" - expr: (1-(node_memory_MemAvailable_bytes{job!="prometheus"} / (node_memory_MemTotal_bytes{job!="prometheus"})))* 100 record: prometheus:memory:used:percent labels: desc: "节点的内存使用百分比" unit: "%" job: "prometheus" - expr: ((node_memory_MemAvailable_bytes{job!="prometheus"} / (node_memory_MemTotal_bytes{job!="prometheus"})))* 100 record: prometheus:memory:free:percent labels: desc: "节点的内存剩余百分比" unit: "%" job: "prometheus" ############################################################################################## # load # - expr: sum by (instance) (node_load1{job!="prometheus"}) record: prometheus:load:load1 labels: desc: "系统1分钟负载" unit: " " job: "prometheus" - expr: sum by (instance) (node_load5{job!="prometheus"}) record: prometheus:load:load5 labels: desc: "系统5分钟负载" unit: " " job: "prometheus" - expr: sum by (instance) (node_load15{job!="prometheus"}) record: prometheus:load:load15 labels: desc: "系统15分钟负载" unit: " " job: "prometheus" ############################################################################################## # disk # - expr: node_filesystem_size_bytes{job!="prometheus" ,fstype=~"ext4|xfs"} record: prometheus:disk:usage:total labels: desc: "节点的磁盘总量" unit: byte job: "prometheus" - expr: node_filesystem_avail_bytes{job!="prometheus",fstype=~"ext4|xfs"} record: prometheus:disk:usage:free labels: desc: "节点的磁盘剩余空间" unit: byte job: "prometheus" - expr: node_filesystem_size_bytes{job!="prometheus",fstype=~"ext4|xfs"} - node_filesystem_avail_bytes{job!="prometheus",fstype=~"ext4|xfs"} record: prometheus:disk:usage:used labels: desc: "节点的磁盘使用的空间" unit: byte job: "prometheus" - expr: (1 - node_filesystem_avail_bytes{job!="prometheus",fstype=~"ext4|xfs"} / node_filesystem_size_bytes{job!="prometheus",fstype=~"ext4|xfs"}) * 100 record: prometheus:disk:used:percent labels: desc: "节点的磁盘的使用百分比" unit: "%" job: "prometheus" - expr: irate(node_disk_reads_completed_total{job!="prometheus"}[1m]) record: prometheus:disk:read:count:rate labels: desc: "节点的磁盘读取速率" unit: "次/秒" job: "prometheus" - expr: irate(node_disk_writes_completed_total{job!="prometheus"}[1m]) record: prometheus:disk:write:count:rate labels: desc: "节点的磁盘写入速率" unit: "次/秒" job: "prometheus" - expr: (irate(node_disk_written_bytes_total{job!="prometheus"}[1m]))/1024/1024 record: prometheus:disk:read:mb:rate labels: desc: "节点的设备读取MB速率" unit: "MB/s" job: "prometheus" - expr: (irate(node_disk_read_bytes_total{job!="prometheus"}[1m]))/1024/1024 record: prometheus:disk:write:mb:rate labels: desc: "节点的设备写入MB速率" unit: "MB/s" job: "prometheus" ############################################################################################## # filesystem # - expr: (1 -node_filesystem_files_free{job!="prometheus",fstype=~"ext4|xfs"} / node_filesystem_files{job!="prometheus",fstype=~"ext4|xfs"}) * 100 record: prometheus:filesystem:used:percent labels: desc: "节点的inode的剩余可用的百分比" unit: "%" job: "prometheus" ############################################################################################# # filefd # - expr: node_filefd_allocated{job!="prometheus"} record: prometheus:filefd_allocated:count labels: desc: "节点的文件描述符打开个数" unit: "%" job: "prometheus" - expr: node_filefd_allocated{job!="prometheus"}/node_filefd_maximum{job!="prometheus"} * 100 record: prometheus:filefd_allocated:percent labels: desc: "节点的文件描述符打开百分比" unit: "%" job: "prometheus" ############################################################################################# # network # - expr: avg by (environment,instance,device) (irate(node_network_receive_bytes_total{device=~"eth0|eth1|ens33|ens37"}[1m])) record: prometheus:network:netin:bit:rate labels: desc: "节点网卡eth0每秒接收的比特数" unit: "bit/s" job: "prometheus" - expr: avg by (environment,instance,device) (irate(node_network_transmit_bytes_total{device=~"eth0|eth1|ens33|ens37"}[1m])) record: prometheus:network:netout:bit:rate labels: desc: "节点网卡eth0每秒发送的比特数" unit: "bit/s" job: "prometheus" - expr: avg by (environment,instance,device) (irate(node_network_receive_packets_total{device=~"eth0|eth1|ens33|ens37"}[1m])) record: prometheus:network:netin:packet:rate labels: desc: "节点网卡每秒接收的数据包个数" unit: "个/秒" job: "prometheus" - expr: avg by (environment,instance,device) (irate(node_network_transmit_packets_total{device=~"eth0|eth1|ens33|ens37"}[1m])) record: prometheus:network:netout:packet:rate labels: desc: "节点网卡发送的数据包个数" unit: "个/秒" job: "prometheus" - expr: avg by (environment,instance,device) (irate(node_network_receive_errs_total{device=~"eth0|eth1|ens33|ens37"}[1m])) record: prometheus:network:netin:error:rate labels: desc: "节点设备驱动器检测到的接收错误包的数量" unit: "个/秒" job: "prometheus" - expr: avg by (environment,instance,device) (irate(node_network_transmit_errs_total{device=~"eth0|eth1|ens33|ens37"}[1m])) record: prometheus:network:netout:error:rate labels: desc: "节点设备驱动器检测到的发送错误包的数量" unit: "个/秒" job: "prometheus" - expr: node_tcp_connection_states{job!="prometheus", state="established"} record: prometheus:network:tcp:established:count labels: desc: "节点当前established的个数" unit: "个" job: "prometheus" - expr: node_tcp_connection_states{job!="prometheus", state="time_wait"} record: prometheus:network:tcp:timewait:count labels: desc: "节点timewait的连接数" unit: "个" job: "prometheus" - expr: sum by (environment,instance) (node_tcp_connection_states{job!="prometheus"}) record: prometheus:network:tcp:total:count labels: desc: "节点tcp连接总数" unit: "个" job: "prometheus"
五、创建grafana数据目录以及配置文件 。 用于grafana存放数据
mkdir grafana/grafana-storage -p chmod 777 grafana/grafana-storage
grafana.ini 配置文件可从grafana容器里拷贝一份出来
六、创建alert配置。用于向webhook发送告警
mkdir alert
vim alert/alertmanager.yml
global: resolve_timeout: 5m
route: receiver: webhook group_wait: 30s # 在组内等待所配置的时间,如果同组内,30秒内出现相同报警,在一个组内出现。 group_interval: 5m # 如果组内内容不变化,合并为一条警报信息,5m后发送 repeat_interval: 5m # 发送报警间隔,如果指定时间内没有修复,则重新发送报警 group_by: [alertname] routes: - receiver: webhook group_wait: 10s receivers: - name: webhook webhook_configs: - url: http://10.1.1.10:8060/dingtalk/webhook1/send send_resolved: true ~
指向webhook的地址
七、webhook配置文件
mkdir ./webhook
vim ./webhook/dingding.tmpl
[root@prometheus ~]# cat dingding.tmpl
{{ define "__subject" }}[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .GroupLabels.SortedPairs.Values | join " " }} {{ if gt (len .CommonLabels) (len .GroupLabels) }}({{ with .CommonLabels.Remove .GroupLabels.Names }}{{ .Values | join " " }}{{ end }}){{ end }}{{ end }}
**====侦测到故障====**
{{ template "__text_alert_list" .Alerts.Firing }}
{{- end }}
{{ if gt (len .Alerts.Resolved) 0 -}}
恢复列表:
{{ template "__text_resolve_list" .Alerts.Resolved }}
{{- end }}
{{- end }}
八、编辑docker-compose启动服务yml
vim docker-compose.yml
启动
docker-compose -f docker-compose.yml up -d
九、创建启动收集服务node-exporter-compose.yml
vim node-exporter-compose.yml
docker-compose -f node-exporter-compose.yml up -d
每加一台。创建一份即可。 本机也行
my.cnf
[client]
user=exporter #使用mysql创建一个用户给mysqld-exporter登录
password=Aa123456 #创建密码。使用复杂密码
host=192.168.1.1 #指定数据库ip
port=3307 #指定数据库端口
#以下数据库创建用户语句
#CREATE USER 'exporter'@'%' IDENTIFIED BY 'Aa123456';
#GRANT PROCESS, REPLICATION CLIENT ON *.* TO 'exporter'@'%';
十、 node-exporter 推向pushgateway (其他exporter同理)
curl 127.0.0.1:9100/metrics|curl --data-binary @- http://xxx.com:9091/metrics/job/test/instance/10.2.1.11/hostname/ip-10-2-1-11
脚本
#!/bin/bash job_name="test" hostname=$(hostname) HOST_IP=$(hostname --all-ip-addresses | awk '{print $1}') /usr/bin/curl 127.0.0.1:9100/metrics|/usr/bin/curl --data-binary @- http://xxxx.com:9091/metrics/job/$job_name/instance/$HOST_IP/hostname/$hostname
记得在prometheus 配置文件 ,配置收集pushgateway信息
十一、检查
docker ps -a #检查容器是否启动
netstat -nltp #检查端口是否启动
页面访问ip:9090
Prometheus读取当地时间
告警手动测试 token换成自己得
curl -H "Content-Type: application/json" -d '{"msgtype":"text","text":{"content":"prometheus alert test"}}' https://oapi.dingtalk.com/robot/send?access_token=***
十二、配置Grafana ip:3000 默认账号密码admin/admin
效果展示 (插件下载方式在下方。 可以使用16314、 1860)
#去官方下载监控模板即可
插件地址:
到这就部署完了。 谢谢观看,转载请@此文章