Kubernetes集群Prometheus Operator钉钉报警配置
文档参考《Kubernetes环境使用Prometheus Operator自发现监控SpringBoot》,各类监控项的数据采集,以及grafana的监控展示测试都正常,于是进入下一步报警的迁入测试,alertmanager原生不支持钉钉报警,所以只能通过webhook的方式,好在已经有大佬开源了一套基于prometheus 钉钉报警的webhook(项目地址https://github.com/timonwong/prometheus-webhook-dingtalk),所以我们直接配置使用就可以了。
怎么创建钉钉机器人非常简单这里就不介绍了,创建好钉钉机器人以后,下一步就是部署webhook,接收alertmanager的报警信息,格式化以后再发送给钉钉机器人。非kubernetes集群部署也是非常简单,直接编写一个docker-compose文件,直接运行就可以了。
1、在kubernetes集群中,pod之间需要通信,需要使用service,所以先编写一个kubernetes的yaml文件dingtalk-webhook.yaml。
apiVersion: apps/v1 kind: Deployment metadata: name: webhook-dingtalk namespace: monitoring spec: selector: matchLabels: app: dingtalk replicas: 1 template: metadata: labels: app: dingtalk spec: restartPolicy: Always containers: - name: dingtalk image: timonwong/prometheus-webhook-dingtalk:v1.4.0 imagePullPolicy: IfNotPresent args: - '--web.enable-ui' - '--web.enable-lifecycle' - '--config.file=/config/config.yaml' ports: - containerPort: 8060 protocol: TCP volumeMounts: - mountPath: "/config" name: dingtalk-volume resources: limits: cpu: 100m memory: 100Mi requests: cpu: 100m memory: 100Mi volumes: - name: dingtalk-volume persistentVolumeClaim: claimName: dingding-pvc --- apiVersion: v1 kind: Service metadata: name: webhook-dingtalk namespace: monitoring spec: ports: - port: 80 protocol: TCP targetPort: 8060 selector: app: dingtalk sessionAffinity: None
1.1、第一种方式通过数据持久化,把配置文件config.yaml和报警模板放在了共享存储里面,这样webhook不管部署到哪台node,都可以读取到配置文件和报警模板。怎么通过NFS让数据持久化可以参考文档《Kubernetes使用StorageClass动态生成NFS类型的PV》。
dingding-pvc.yaml
# cat dingding-pvc.yaml kind: PersistentVolumeClaim apiVersion: v1 metadata: name: dingding-pvc annotations: volume.beta.kubernetes.io/storage-class: "nfs-client" namespace: monitoring spec: accessModes: - ReadWriteMany resources: requests: storage: 50Mi
配置文件config.yaml:
templates: - /config/template.tmpl targets: webhook1: url: https://oapi.dingtalk.com/robot/send?access_token=替换成自己的钉钉机器人token
报警模板template.tmpl:
{{ define "ding.link.title" }}[监控报警]{{ end }} {{ define "ding.link.content" -}} {{- if gt (len .Alerts.Firing) 0 -}} {{ range $i, $alert := .Alerts.Firing }} [告警项目]:{{ index $alert.Labels "alertname" }} [告警实例]:{{ index $alert.Labels "instance" }} [告警级别]:{{ index $alert.Labels "severity" }} [告警阀值]:{{ index $alert.Annotations "value" }} [告警详情]:{{ index $alert.Annotations "description" }} [触发时间]:{{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }} {{ end }}{{- end }} {{- if gt (len .Alerts.Resolved) 0 -}} {{ range $i, $alert := .Alerts.Resolved }} [项目]:{{ index $alert.Labels "alertname" }} [实例]:{{ index $alert.Labels "instance" }} [状态]:恢复正常 [开始]:{{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }} [恢复]:{{ (.EndsAt.Add 28800e9).Format "2006-01-02 15:04:05" }} {{ end }}{{- end }} {{- end }}
可以根据自己的喜欢自己修改模板,“.EndsAt.Add 28800e9”是UTC时间+8小时,因为prometheus和alertmanager默认都是使用的UTC时间,另外需要把这两个文件的属主和属组设置成65534,不然webhook容器没有权限访问这两个文件。
1.2、第二种方式通过configMap方式(推荐)挂载配置文件和模板,需要修改原来的dingtalk-webhook.yaml文件,添加挂载为configMap。
[root@master tmp]# cat dingtalk-webhook.yaml apiVersion: v1 kind: ConfigMap metadata: name: dingtalk-config namespace: monitoring data: config.yaml: | templates: - /config/template.tmpl targets: webhook1: url: https://oapi.dingtalk.com/robot/send?access_token=d6fab51d798f81b10a464fd232d4d3ec6d2aa9ed2df3dd013c659d7c11d946ff (钉钉机器人地址) template.tmpl: | {{ define "ding.link.title" }}[监控报警]{{ end }} {{ define "ding.link.content" -}} {{- if gt (len .Alerts.Firing) 0 -}} {{ range $i, $alert := .Alerts.Firing }} [告警项目]:{{ index $alert.Labels "alertname" }} [告警实例]:{{ index $alert.Labels "instance" }} [告警级别]:{{ index $alert.Labels "severity" }} [告警阀值]:{{ index $alert.Annotations "value" }} [告警详情]:{{ index $alert.Annotations "description" }} [触发时间]:{{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }} {{ end }}{{- end }} {{- if gt (len .Alerts.Resolved) 0 -}} {{ range $i, $alert := .Alerts.Resolved }} [项目]:{{ index $alert.Labels "alertname" }} [实例]:{{ index $alert.Labels "instance" }} [状态]:恢复正常 [开始]:{{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }} [恢复]:{{ (.EndsAt.Add 28800e9).Format "2006-01-02 15:04:05" }} {{ end }}{{- end }} {{- end }} --- apiVersion: apps/v1 kind: Deployment metadata: name: dingding-webhook namespace: monitoring spec: selector: matchLabels: app: dingtalk replicas: 1 template: metadata: labels: app: dingtalk spec: restartPolicy: Always containers: - name: dingtalk image: timonwong/prometheus-webhook-dingtalk:v1.4.0 imagePullPolicy: IfNotPresent args: - '--web.enable-ui' - '--web.enable-lifecycle' - '--config.file=/config/config.yaml' ports: - containerPort: 8060 protocol: TCP volumeMounts: - name: dingtalk-config mountPath: "/config" resources: limits: cpu: 100m memory: 100Mi requests: cpu: 100m memory: 100Mi volumes: - name: dingtalk-config configMap: name: dingtalk-config --- apiVersion: v1 kind: Service metadata: name: dingding-webhook #(下面的alertmanager-secret.yaml配置中url部分会用到此名称)(- "url": "http://dingding-webhook/dingtalk/webhook1/send") namespace: monitoring spec: ports: - port: 80 protocol: TCP targetPort: 8060 selector: app: dingtalk sessionAffinity: None
2、修改alertmanager默认的配置文件,增加webhook_configs,直接修改kube-prometheus-master/manifests/alertmanager-secret.yaml文件为以下内容:
cat alertmanager-secret.yaml apiVersion: v1 data: {} kind: Secret metadata: name: alertmanager-main namespace: monitoring stringData: alertmanager.yaml: |- "global": "resolve_timeout": "5m" "inhibit_rules": - "equal": - "namespace" - "alertname" "source_match": "severity": "critical" "target_match_re": "severity": "warning|info" - "equal": - "namespace" - "alertname" "source_match": "severity": "warning" "target_match_re": "severity": "info" "receivers": - "name": "www.amd5.cn" #- "name": "Watchdog" #- "name": "Critical" #- "name": "webhook" "webhook_configs": - "url": "http://dingding-webhook/dingtalk/webhook1/send" #(注意使用上面dingtalk-webhook.yaml 配置中service 的名称dingding-webhook) "send_resolved": true "route": "group_by": - "namespace" "group_interval": "5m" "group_wait": "30s" "receiver": "www.amd5.cn" "repeat_interval": "12h" #"routes": #- "match": # "alertname": "Watchdog" # "receiver": "Watchdog" #- "match": # "severity": "critical" # "receiver": "Critical"
所有的yaml文件准备好以后,执行
kubectl apply -f dingding-pvc.yaml kubectl apply -f dingtalk-webhook.yaml kubectl apply -f alertmanager-secret.yaml
查看执行结果
[root@master tmp]# kubectl get po,svc -n monitoring|grep dingding-webhook pod/dingding-webhook-6f765c6c59-ql9v7 1/1 Running 0 17m service/dingding-webhook ClusterIP 10.107.133.148 <none> 80/TCP 18m
然后访问alertmanager的地址(把alertmanager.amd5.cn替换为自己的地址)查看配置webhook_configs是否已经生效,http://alertmanager.amd5.cn/#/status。
3、生效以后,我们就添加报警规则,等待触发规则阈值报警测试。
创建文件prometheus-rules.yaml在末尾添加下面的内容,注意缩进
[root@master tmp]# cat prometheus-rules.yaml apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: name: prometheus-k8s-rules namespace: monitoring spec: groups: - name: 主机状态-监控告警 rules: - alert: 节点内存 expr: (1 - (node_memory_MemAvailable_bytes / (node_memory_MemTotal_bytes))) * 100 > 85 for: 1m labels: severity: warning annotations: summary: "内存使用率过高!" description: "节点{{$labels.instance}}内存使用大于85%(目前使用:{{$value}}%)" - alert: 节点TCP会话 expr: node_netstat_Tcp_CurrEstab > 1000 for: 1m labels: severity: warning annotations: summary: "TCP_ESTABLISHED过高!" description: "{{$labels.instance}} TCP_ESTABLISHED连接数大于1000" - alert: 节点磁盘容量 expr: | max( (node_filesystem_size_bytes{fstype=~"ext.?|xfs"} - node_filesystem_free_bytes{fstype=~"ext.?|xfs"}) * 100 / (node_filesystem_avail_bytes{fstype=~"ext.?|xfs"} + (node_filesystem_size_bytes{fstype=~"ext.?|xfs"} - node_filesystem_free_bytes{fstype=~"ext.?|xfs"})) ) by (instance) > 85 for: 1m labels: severity: warning annotations: summary: "节点磁盘分区使用率过高!" description: "{{$labels.instance}} 磁盘分区使用大于80%(目前使用:{{$value}}%)" - alert: 节点CPU expr: | ( 100 - ( avg by (instance) ( irate(node_cpu_seconds_total{job=~".*",mode="idle"}[5m]) ) * 100 ) ) > 85 for: 1m labels: severity: warning annotations: summary: "节点CPU��用率过高!" description: "{{$labels.instance}} CPU使用率大于85%(目前使用:{{$value}}%)" # 以下为添加的报警测试规则 - name: 自定义报警测试 rules: - alert: '钉钉报警测试' expr: | jvm_threads_live > 140 for: 1m labels: severity: '警告' annotations: summary: "{{ $labels.instance }}: 钉钉报警测试" description: "{{ $labels.instance }}:钉钉报警测试" custom: "钉钉报警测试" value: "{{$value}}"
然后执行命令更新规则
kubectl apply -f prometheus-rules.yaml
然后访问prometheus地址http://prometheus.amd5.cn/alerts查看rule生效情况,如下图:
由于我集群里没有配置 java 所以告警策略 改成了 CPU 告警
题外话有空在研究一下 抑制规则
在这个例子中:
source_match
用于匹配源告警,即触发抑制的告警。target_match
用于匹配目标告警,即被抑制的告警。equal
是一个可选字段,它指定源告警和目标告警之间必须匹配的标签,用于进一步细化匹配条件。duration
字段指定了抑制操作将持续的时间。即使源告警在此时间段内得到解决,目标告警的通知仍然会被抑制。
编写抑制规则时,你需要确保源告警和目标告警的匹配条件能够准确地描述你想要实现的抑制逻辑。此外,你还需要考虑抑制操作的持续时间和标签匹配,以确保抑制规则不会意外地抑制其他重要的告警通知。
请注意,抑制规则的配置可能因Alertmanager的版本而有所不同。因此,在编写抑制规则时,建议查阅你所使用的Alertmanager版本的官方文档,以获取最准确和最新的配置指南。
最后,编写完抑制规则后,你需要将它们添加到Alertmanager的配置文件中,并重新启动Alertmanager以使配置生效。
inhibit_rules: - source_match: # 定义源告警的匹配条件 alertname: "SourceAlert" severity: "critical" target_match: # 定义目标告警的匹配条件 alertname: "TargetAlert" environment: "production" # 可选:定义源告警和目标告警之间必须匹配的标签 equal: - "instance" # 抑制操作将持续的时间,即使源告警已经解决 duration: 1h
本文来自博客园,作者:IT老登,转载请注明原文链接:https://www.cnblogs.com/nb-blog/p/18101799