prometheus.(8).AlertManager

Prometheus AlertManager

作者声明:本博客内容是作者在学习以及搭建过程中积累的内容,内容采自网络中各位老师的优秀博客以及视频,并根据作者本人的理解加以修改(由于工作以及学习中东拼西凑,如何造成无法提供原链接,在此抱歉!!!)

作者再次声明:作者只是一个很抠脚的IT工作者,希望可以跟那些提供原创的老师们学习

原文:大米运维

启动alert服务

alert-deploy.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: alertmanager
  namespace: kube-system
  labels:
    k8s-app: alertmanager
    kubernetes.io/cluster-service: "true"
    addonmanager.kubernetes.io/mode: Reconcile
spec:
  replicas: 1
  selector:
    matchLabels:
      k8s-app: alertmanager
      version: v0.15.3
  template:
    metadata:
      labels:
        k8s-app: alertmanager
        version: v0.15.3
      annotations:
        scheduler.alpha.kubernetes.io/critical-pod: ''
    spec:
      priorityClassName: system-cluster-critical
      containers:
        - name: prometheus-alertmanager
          image: "prom/alertmanager:v0.15.3"
          imagePullPolicy: "IfNotPresent"
          args:
            - --config.file=/etc/alertmanager/config.yml
            - --storage.path=/alertmanager/data
            - --web.external-url=/
          ports:
            - containerPort: 9093
          readinessProbe:
            httpGet:
              path: /#/status
              port: 9093
            initialDelaySeconds: 30
            timeoutSeconds: 30
          volumeMounts:
            - name: alert-config
              mountPath: /etc/alertmanager
            - name: storage-volume
              mountPath: "/alertmanager/data"
              subPath: ""
          resources:
            limits:
              cpu: 10m
              memory: 50Mi
            requests:
              cpu: 10m
              memory: 50Mi
        - name: prometheus-alertmanager-configmap-reload
          image: "jimmidyson/configmap-reload:v0.1"
          imagePullPolicy: "IfNotPresent"
          args:
            - --volume-dir=/etc/alertmanager
            - --webhook-url=http://localhost:9093/-/reload
          volumeMounts:
            - name: alert-config
              mountPath: /etc/alertmanager
              readOnly: true
          resources:
            limits:
              cpu: 10m
              memory: 10Mi
            requests:
              cpu: 10m
              memory: 10Mi
      volumes:
        - name: alert-config
          configMap:
            name: alert-config
        - name: storage-volume
          persistentVolumeClaim:
            claimName: alertmanager

configMap重载**是一个简单的二进制文件,用于在 Kubernetes ConfigMaps更新时触发重新加载。 它监视装载的卷目录,并通知目标进程配置映射已经更改。 目前它只支持发送HTTP请求,但在未来它期望支持发送操作系统( 比如 )。 SIGHUP ) 一旦Kubernetes支持pod名称空间

alert-conf.yaml

apiVersion: v1
kind: ConfigMap
metadata:
  name: alert-config
  namespace: kube-system
data:
  config.yml: |-
    global:
      # 在没有报警的情况下声明为已解决的时间
      resolve_timeout: 5m
      # 配置邮件发送信息
      smtp_smarthost: 'smtp.163.com:25'
      smtp_from: ''
      smtp_auth_username: ''
      smtp_auth_password: '' #授权密码
      smtp_hello: '163.com'
      smtp_require_tls: false
    # 所有报警信息进入后的根路由,用来设置报警的分发策略
    route:
      # 这里的标签列表是接收到报警信息后的重新分组标签,例如,接收到的报警信息里面有许多具有 cluster=A 和 alertname=LatncyHigh 这样的标签的报警信息将会批量被聚合到一个分组里面
      group_by: ['alertname', 'cluster']
      # 当一个新的报警分组被创建后,需要等待至少group_wait时间来初始化通知,这种方式可以确保您能有足够的时间为同一分组来获取多个警报,然后一起触发这个报警信息。
      group_wait: 30s

      # 当第一个报警发送后,等待'group_interval'时间来发送新的一组报警信息。
      group_interval: 5m

      # 如果一个报警信息已经发送成功了,等待'repeat_interval'时间来重新发送他们
      repeat_interval: 5m

      # 默认的receiver:如果一个报警没有被一个route匹配,则发送给默认的接收器
      receiver: default

      # 上面所有的属性都由所有子路由继承,并且可以在每个子路由上进行覆盖。
      routes:
      - receiver: email
        group_wait: 10s
        match:
          team: node
    receivers:
    - name: 'default'
      email_configs:
      - to: '810553413@qq.com'
        send_resolved: true
    
    - name: 'email'
      email_configs:
      - to: '810553413@qq.com'
        send_resolved: true

alert-pvc.yaml

apiVersion: v1
kind: PersistentVolume
metadata:
  name: alertmanager
spec:
  capacity:
    storage: 2Gi
  accessModes:
  - ReadWriteOnce
  persistentVolumeReclaimPolicy: Recycle
  #storageClassName: managed-nfs-storage  #storageClassName与prometheus-statefulset.yaml中volumeClaimTemplates下定义的需要保持一致
  nfs:
    server: 192.168.2.7
    path: /data/k8s
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: alertmanager
  namespace: kube-system
  labels:
    kubernetes.io/cluster-service: "true"
    addonmanager.kubernetes.io/mode: EnsureExists
spec:
  # 使用自己的动态PV
  #storageClassName: managed-nfs-storage 
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: "2Gi"

alert-svc.yaml

apiVersion: v1
kind: Service
metadata:
  name: alertmanager
  namespace: kube-system
  labels:
    kubernetes.io/cluster-service: "true"
    addonmanager.kubernetes.io/mode: Reconcile
    kubernetes.io/name: "Alertmanager"
spec:
  ports:
    - name: http
      port: 80
      protocol: TCP
      targetPort: 9093
  selector:
    k8s-app: alertmanager
  type: NodePort
kubectl create -f alert-pvc.yaml
kubectl create -f alert-conf.yaml 
kubectl create -f alert-deploy.yaml 
kubectl create -f alert-svc.yaml 

启动报错

alert-deploy路径问题

1586357155716

kubectl logs -f alertmanager-7d854bcbdf-7kh4k -n kube-system -c prometheus-alertmanager

1586357955540

根据报错信息确定失败原因

1586358215993

level=info ts=2020-04-08T04:08:34.170520801Z caller=main.go:322 msg="Loading configuration file" file=/etc/config/alertmanager.yml
level=error ts=2020-04-08T04:08:34.170585226Z caller=main.go:325 msg="Loading configuration file failed" file=/etc/config/alertmanager.yml err="open /etc/config/alertmanager.yml: no such file or directory"

alert-pvc 创建问题

default-scheduler  pod has unbound immediate PersistentVolumeClaims (repeated 2 times)

alert-svc 配置问题

The Service "alertmanager" is invalid: spec.ports[1].name: Duplicate value: "http"

配置Prometheus与Alertmanager通信

编辑 prometheus-configmap.yaml 配置文件添加绑定信息

alerting:
  alertmanagers:
    - static_configs:
      - targets: ["alertmanager:80"]
[root@k8s-master prometheus]# kubectl get svc --all-namespaces
NAMESPACE     NAME                           TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                  AGE
kube-system   alertmanager                   NodePort    10.110.206.170   <none>        80:30199/TCP             72m
kube-system   prometheus                     NodePort    10.111.194.39    <none>        9090:31611/TCP           93d

alertmanager控制台

1586321607922

prometheus控制台

查看配置是否生效

1586321677979

配置告警

编辑 prometheus.configmap.yaml 添加报警信息

    # 添加:指定读取rules配置
    rules_files:
    - /etc/config/rules/*.rules


kubectl apply -f prometheus.configmap.yaml

故障 无法访问

prometheus无法访问,耐心排查,可以先deploy挂载到容器rules目录,在添加报警信息

1586349896304

编辑报警规则

prometheus-rules.yaml

apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-rules
  namespace: kube-system
data:
  general.rules: |
    groups:
    - name: general.rules
      rules:
      - alert: InstanceDown
        expr: up == 0
        for: 1m
        labels:
          severity: error
        annotations:
          summary: "Instance {{ $labels.instance }} 停止工作"
          description: "{{ $labels.instance }} job {{ $labels.job }} 已经停止5分钟以上."
  node.rules: |
    groups:
    - name: node.rules
      rules:
      - alert: NodeFilesystemUsage
        expr: 100 - (node_filesystem_free_bytes{fstype=~"ext4|xfs"} / node_filesystem_size_bytes{fstype=~"ext4|xfs"} * 100) > 80
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "Instance {{ $labels.instance }} : {{ $labels.mountpoint }} 分区使用率过高"
          description: "{{ $labels.instance }}: {{ $labels.mountpoint }} 分区使用大于80% (当前值: {{ $value }})"

      - alert: NodeMemoryUsage
        expr: 100 - (node_memory_MemFree_bytes+node_memory_Cached_bytes+node_memory_Buffers_bytes) / node_memory_MemTotal_bytes * 100 > 20
        for: 1m
        labels:
          severity: warning
          team: node
        annotations:
          summary: "Instance {{ $labels.instance }} 内存使用率过高"
          description: "{{ $labels.instance }}内存使用大于80% (当前值: {{ $value }})"

      - alert: NodeCPUUsage
        expr: 100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance) * 100) > 60
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "Instance {{ $labels.instance }} CPU使用率过高"
          description: "{{ $labels.instance }}CPU使用大于60% (当前值: {{ $value }})"
[root@k8s-master prometheus]# kubectl get cm --all-namespaces
NAMESPACE     NAME                                 DATA   AGE
kube-public   cluster-info                         1      152d
kube-system   alert-config                         1      9h
kube-system   coredns                              1      152d
kube-system   extension-apiserver-authentication   6      152d
kube-system   kube-flannel-cfg                     2      152d
kube-system   kube-proxy                           2      152d
kube-system   kubeadm-config                       2      152d
kube-system   kubelet-config-1.16                  1      152d
kube-system   prometheus-blackbox-exporter         1      29d
kube-system   prometheus-config                    1      15m
kube-system   prometheus-rules                     2      6h58m

configmap挂载到容器rules目录

修改挂载点位置,使用之前部署的prometheus.deploy动态PV

volumeMounts:
# 添加:指定rules的configmap配置文件名称
- name: prometheus-rules
  mountPath: /etc/config/rules
  subPath: ""
  
volumes:
# 添加:name rules
  - name: prometheus-rules
  # 添加:配置文件
    configMap:
    # 添加:定义文件名称
      name: prometheus-rules

更改config文件需重启prometheus

创建configmap并更新PV
kubectl apply -f prometheus-rules.yaml
#如果prometheus.deploy更新失败,可以先删除
kubectl delete -f prometheus.deploy.yaml
kubectl apply -f prometheus.deploy.yaml 

存储服务器

[root@localhost ~]# cat /etc/exports
/data/k8s  192.168.2.0/24(rw,no_root_squash,sync)

查看alerts告警规则

1586350513741

访问alerts管理后台

1586350857189

我们可以看到页面中出现了我们刚刚定义的报警规则信息,而且报警信息中还有状态显示。一个报警信息在生命周期内有下面3种状态:

  • inactive: 表示当前报警信息既不是firing状态也不是pending状态
  • pending: 表示在设置的阈值时间范围内被激活了
  • firing: 表示超过设置的阈值时间被激活了

模拟报警

修改报警规则 prometheus-rules.yaml

修改后 热更新 kubectl apply -f prometheus-rules.yaml

报警接收

1586412594407

邮件接收

1586416727652

无法发送邮件

先后重启了DNS以及POD,之后恢复了原因不明

注意事项

team:node 标签一定要一致,否则alert无法收到报警

  routes:
  - receiver: email
    group_wait: 10s
    match:
      team: node
  - alert: NodeMemoryUsage
    expr: 100 - (node_memory_MemFree_bytes+node_memory_Cached_bytes+node_memory_Buffers_bytes) / node_memory_MemTotal_bytes * 100 > 20
    for: 1m
    labels:
      severity: warning
      team: node
    annotations:
      summary: "Instance {{ $labels.instance }} 内存使用率过高"
      description: "{{ $labels.instance }}内存使用大于80% (当前值: {{ $value }})"
posted @ 2020-05-04 10:37  薄荷少年郎微微凉  阅读(522)  评论(0编辑  收藏  举报