访次: AmazingCounters.com 次

Kubernetes集群Prometheus Operator钉钉报警配置

文档参考《Kubernetes环境使用Prometheus Operator自发现监控SpringBoot》,各类监控项的数据采集,以及grafana的监控展示测试都正常,于是进入下一步报警的迁入测试,alertmanager原生不支持钉钉报警,所以只能通过webhook的方式,好在已经有大佬开源了一套基于prometheus 钉钉报警的webhook(项目地址https://github.com/timonwong/prometheus-webhook-dingtalk),所以我们直接配置使用就可以了。

怎么创建钉钉机器人非常简单这里就不介绍了,创建好钉钉机器人以后,下一步就是部署webhook,接收alertmanager的报警信息,格式化以后再发送给钉钉机器人。非kubernetes集群部署也是非常简单,直接编写一个docker-compose文件,直接运行就可以了。

1、在kubernetes集群中,pod之间需要通信,需要使用service,所以先编写一个kubernetes的yaml文件dingtalk-webhook.yaml。

apiVersion: apps/v1
kind: Deployment
metadata:
  name: webhook-dingtalk
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: dingtalk
  replicas: 1
  template:
    metadata:
      labels:
        app: dingtalk
    spec:
      restartPolicy: Always
      containers:
      - name: dingtalk
        image: timonwong/prometheus-webhook-dingtalk:v1.4.0
        imagePullPolicy: IfNotPresent
        args:
          - '--web.enable-ui'
          - '--web.enable-lifecycle'
          - '--config.file=/config/config.yaml'
        ports:
        - containerPort: 8060
          protocol: TCP
        volumeMounts:
        - mountPath: "/config"
          name: dingtalk-volume
        resources:
          limits:
            cpu: 100m
            memory: 100Mi
          requests:
            cpu: 100m
            memory: 100Mi
      volumes:
      - name: dingtalk-volume
        persistentVolumeClaim:
          claimName: dingding-pvc
---
apiVersion: v1
kind: Service
metadata:
  name: webhook-dingtalk
  namespace: monitoring
spec:
  ports:
  - port: 80
    protocol: TCP
    targetPort: 8060
  selector:
    app: dingtalk
  sessionAffinity: None

1.1、第一种方式通过数据持久化,把配置文件config.yaml和报警模板放在了共享存储里面,这样webhook不管部署到哪台node,都可以读取到配置文件和报警模板。怎么通过NFS让数据持久化可以参考文档《Kubernetes使用StorageClass动态生成NFS类型的PV》。

dingding-pvc.yaml 

# cat dingding-pvc.yaml 
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: dingding-pvc
  annotations:
    volume.beta.kubernetes.io/storage-class: "nfs-client"
  namespace: monitoring
spec:
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 50Mi

配置文件config.yaml:

templates:
  - /config/template.tmpl
targets:
  webhook1:
    url: https://oapi.dingtalk.com/robot/send?access_token=替换成自己的钉钉机器人token

报警模板template.tmpl:

{{ define "ding.link.title" }}[监控报警]{{ end }}
{{ define "ding.link.content" -}}
{{- if gt (len .Alerts.Firing) 0 -}}
  {{ range $i, $alert := .Alerts.Firing }}
    [告警项目]:{{ index $alert.Labels "alertname" }}
    [告警实例]:{{ index $alert.Labels "instance" }}
    [告警级别]:{{ index $alert.Labels "severity" }}
    [告警阀值]:{{ index $alert.Annotations "value" }}
    [告警详情]:{{ index $alert.Annotations "description" }}
    [触发时间]:{{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
  {{ end }}{{- end }}
{{- if gt (len .Alerts.Resolved) 0 -}}
  {{ range $i, $alert := .Alerts.Resolved }}
    [项目]:{{ index $alert.Labels "alertname" }}
    [实例]:{{ index $alert.Labels "instance" }}
    [状态]:恢复正常
    [开始]:{{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
    [恢复]:{{ (.EndsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
  {{ end }}{{- end }}
{{- end }}

可以根据自己的喜欢自己修改模板,“.EndsAt.Add 28800e9”是UTC时间+8小时,因为prometheus和alertmanager默认都是使用的UTC时间,另外需要把这两个文件的属主和属组设置成65534,不然webhook容器没有权限访问这两个文件。

1.2、第二种方式通过configMap方式(推荐)挂载配置文件和模板,需要修改原来的dingtalk-webhook.yaml文件,添加挂载为configMap。

[root@master tmp]# cat dingtalk-webhook.yaml 
apiVersion: v1
kind: ConfigMap
metadata:
  name: dingtalk-config
  namespace: monitoring
data:
  config.yaml: |
    templates:
      - /config/template.tmpl
    targets:
      webhook1:
        url: https://oapi.dingtalk.com/robot/send?access_token=d6fab51d798f81b10a464fd232d4d3ec6d2aa9ed2df3dd013c659d7c11d946ff (钉钉机器人地址)
  template.tmpl: |
    {{ define "ding.link.title" }}[监控报警]{{ end }}
    {{ define "ding.link.content" -}}
    {{- if gt (len .Alerts.Firing) 0 -}}
      {{ range $i, $alert := .Alerts.Firing }}
        [告警项目]:{{ index $alert.Labels "alertname" }}
        [告警实例]:{{ index $alert.Labels "instance" }}
        [告警级别]:{{ index $alert.Labels "severity" }}
        [告警阀值]:{{ index $alert.Annotations "value" }}
        [告警详情]:{{ index $alert.Annotations "description" }}
        [触发时间]:{{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
      {{ end }}{{- end }}
    {{- if gt (len .Alerts.Resolved) 0 -}}
      {{ range $i, $alert := .Alerts.Resolved }}
        [项目]:{{ index $alert.Labels "alertname" }}
        [实例]:{{ index $alert.Labels "instance" }}
        [状态]:恢复正常
        [开始]:{{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
        [恢复]:{{ (.EndsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
      {{ end }}{{- end }}
    {{- end }}
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: dingding-webhook
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: dingtalk
  replicas: 1
  template:
    metadata:
      labels:
        app: dingtalk
    spec:
      restartPolicy: Always
      containers:
      - name: dingtalk
        image: timonwong/prometheus-webhook-dingtalk:v1.4.0
        imagePullPolicy: IfNotPresent
        args:
          - '--web.enable-ui'
          - '--web.enable-lifecycle'
          - '--config.file=/config/config.yaml'
        ports:
        - containerPort: 8060
          protocol: TCP
        volumeMounts:
        - name: dingtalk-config
          mountPath: "/config"
        resources:
          limits:
            cpu: 100m
            memory: 100Mi
          requests:
            cpu: 100m
            memory: 100Mi
      volumes:
      - name: dingtalk-config
        configMap:
          name: dingtalk-config
---
apiVersion: v1
kind: Service
metadata:
  name: dingding-webhook #(下面的alertmanager-secret.yaml配置中url部分会用到此名称)(- "url": "http://dingding-webhook/dingtalk/webhook1/send")
  namespace: monitoring
spec:
  ports:
  - port: 80
    protocol: TCP
    targetPort: 8060
  selector:
    app: dingtalk
  sessionAffinity: None

2、修改alertmanager默认的配置文件,增加webhook_configs,直接修改kube-prometheus-master/manifests/alertmanager-secret.yaml文件为以下内容:

 cat alertmanager-secret.yaml
apiVersion: v1
data: {}
kind: Secret
metadata:
  name: alertmanager-main
  namespace: monitoring
stringData:
  alertmanager.yaml: |-
    "global":
      "resolve_timeout": "5m"
    "inhibit_rules":
    - "equal":
      - "namespace"
      - "alertname"
      "source_match":
        "severity": "critical"
      "target_match_re":
        "severity": "warning|info"
    - "equal":
      - "namespace"
      - "alertname"
      "source_match":
        "severity": "warning"
      "target_match_re":
        "severity": "info"
    "receivers":
    - "name": "www.amd5.cn"
    #- "name": "Watchdog"
    #- "name": "Critical"
    #- "name": "webhook"
      "webhook_configs":
      - "url": "http://dingding-webhook/dingtalk/webhook1/send" #(注意使用上面dingtalk-webhook.yaml 配置中service 的名称dingding-webhook)
        "send_resolved": true 
    "route":
      "group_by":
      - "namespace"
      "group_interval": "5m"
      "group_wait": "30s"
      "receiver": "www.amd5.cn"
      "repeat_interval": "12h"
      #"routes":
      #- "match":
      #    "alertname": "Watchdog"
      #  "receiver": "Watchdog"
      #- "match":
      #    "severity": "critical"
      #  "receiver": "Critical"

 

所有的yaml文件准备好以后,执行

kubectl apply -f dingding-pvc.yaml 
kubectl apply -f dingtalk-webhook.yaml
kubectl apply -f alertmanager-secret.yaml

查看执行结果

[root@master tmp]# kubectl get po,svc -n monitoring|grep dingding-webhook
pod/dingding-webhook-6f765c6c59-ql9v7      1/1     Running             0          17m
service/dingding-webhook        ClusterIP   10.107.133.148   <none>        80/TCP                       18m

然后访问alertmanager的地址(把alertmanager.amd5.cn替换为自己的地址)查看配置webhook_configs是否已经生效,http://alertmanager.amd5.cn/#/status。

 3、生效以后,我们就添加报警规则,等待触发规则阈值报警测试。

创建文件prometheus-rules.yaml在末尾添加下面的内容,注意缩进

[root@master tmp]# cat prometheus-rules.yaml 
apiVersion: monitoring.coreos.com/v1  
kind: PrometheusRule  
metadata:  
  name: prometheus-k8s-rules  
  namespace: monitoring  
spec:  
  groups:  
    - name: 主机状态-监控告警  
      rules:  
        - alert: 节点内存  
          expr: (1 - (node_memory_MemAvailable_bytes / (node_memory_MemTotal_bytes))) * 100 > 85
          for: 1m  
          labels:  
            severity: warning  
          annotations:  
            summary: "内存使用率过高!"  
            description: "节点{{$labels.instance}}内存使用大于85%(目前使用:{{$value}}%)"  
  
        - alert: 节点TCP会话  
          expr: node_netstat_Tcp_CurrEstab > 1000  
          for: 1m  
          labels:  
            severity: warning  
          annotations:  
            summary: "TCP_ESTABLISHED过高!"  
            description: "{{$labels.instance}} TCP_ESTABLISHED连接数大于1000"  
  
        - alert: 节点磁盘容量  
          expr: |  
            max(  
              (node_filesystem_size_bytes{fstype=~"ext.?|xfs"} - node_filesystem_free_bytes{fstype=~"ext.?|xfs"})  
              * 100 /  
              (node_filesystem_avail_bytes{fstype=~"ext.?|xfs"} + (node_filesystem_size_bytes{fstype=~"ext.?|xfs"} - node_filesystem_free_bytes{fstype=~"ext.?|xfs"}))  
            ) by (instance) > 85
          for: 1m  
          labels:  
            severity: warning  
          annotations:  
            summary: "节点磁盘分区使用率过高!"  
            description: "{{$labels.instance}} 磁盘分区使用大于80%(目前使用:{{$value}}%)"  
  
        - alert: 节点CPU  
          expr: |  
            (  
              100 - (  
                avg by (instance) (  
                  irate(node_cpu_seconds_total{job=~".*",mode="idle"}[5m])  
                ) * 100  
              )  
            ) > 85  
          for: 1m  
          labels:  
            severity: warning  
          annotations:  
            summary: "节点CPU��用率过高!"  
            description: "{{$labels.instance}} CPU使用率大于85%(目前使用:{{$value}}%)"  
  
# 以下为添加的报警测试规则  
    - name: 自定义报警测试  
      rules:  
        - alert: '钉钉报警测试'  
          expr: |  
            jvm_threads_live > 140  
          for: 1m  
          labels:  
            severity: '警告'  
          annotations:  
            summary: "{{ $labels.instance }}: 钉钉报警测试"  
            description: "{{ $labels.instance }}:钉钉报警测试"  
            custom: "钉钉报警测试"  
            value: "{{$value}}"

然后执行命令更新规则

kubectl apply -f prometheus-rules.yaml

然后访问prometheus地址http://prometheus.amd5.cn/alerts查看rule生效情况,如下图:

 由于我集群里没有配置 java  所以告警策略  改成了 CPU 告警

 

 题外话有空在研究一下 抑制规则  

在这个例子中:

  • source_match 用于匹配源告警,即触发抑制的告警。
  • target_match 用于匹配目标告警,即被抑制的告警。
  • equal 是一个可选字段,它指定源告警和目标告警之间必须匹配的标签,用于进一步细化匹配条件。
  • duration 字段指定了抑制操作将持续的时间。即使源告警在此时间段内得到解决,目标告警的通知仍然会被抑制。

编写抑制规则时,你需要确保源告警和目标告警的匹配条件能够准确地描述你想要实现的抑制逻辑。此外,你还需要考虑抑制操作的持续时间和标签匹配,以确保抑制规则不会意外地抑制其他重要的告警通知。

请注意,抑制规则的配置可能因Alertmanager的版本而有所不同。因此,在编写抑制规则时,建议查阅你所使用的Alertmanager版本的官方文档,以获取最准确和最新的配置指南。

最后,编写完抑制规则后,你需要将它们添加到Alertmanager的配置文件中,并重新启动Alertmanager以使配置生效。

inhibit_rules:  
- source_match:  
    # 定义源告警的匹配条件  
    alertname: "SourceAlert"  
    severity: "critical"  
  target_match:  
    # 定义目标告警的匹配条件  
    alertname: "TargetAlert"  
    environment: "production"  
  # 可选:定义源告警和目标告警之间必须匹配的标签  
  equal:  
  - "instance"  
  # 抑制操作将持续的时间,即使源告警已经解决  
  duration: 1h

 

posted @ 2024-03-28 15:25  IT老登  阅读(260)  评论(0编辑  收藏  举报
访次: AmazingCounters.com 次