k8s 集群部署prometheus + alertmanager + grafana +钉钉告警

准备k8s 集群

  • 前言
    准备好k8s 集群,通过部署prometheus 达到获取k8s 容器资源,根据收集指标制定报警策略,从而提高监控响应能力。
$ kubectl get  node
NAME       STATUS     ROLES    AGE   VERSION
master01   Ready      master   13d   v1.16.0
master02   Ready      master   13d   v1.16.0
master03   Ready      master   13d   v1.16.0
node01     Ready      <none>   13d   v1.16.0
node02     Ready      <none>   13d   v1.16.0
node03     Ready      <none>   13d   v1.16.0

部署 prometheus node-exporter

  • node-exporter 为 DaemonSet部署方式,且占用宿主机端口
# cat node-exporter-ds.yaml 
apiVersion: apps/v1
kind: DaemonSet
metadata:
  labels:
    app: node-exporter
    app.kubernetes.io/managed-by: Helm
    grafanak8sapp: "true"
  name: node-exporter
  namespace: prometheus
spec:
  selector:
    matchLabels:
      app: node-exporter
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: node-exporter
        grafanak8sapp: "true"
    spec:
      containers:
      - args:
        - --path.procfs=/host/proc
        - --path.sysfs=/host/sys
        - --web.listen-address=$(HOST_IP):9100
        env:
        - name: HOST_IP
          value: 0.0.0.0
        image: prom/node-exporter:v0.16.0
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /
            port: 9100
            scheme: HTTP
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        name: node-exporter
        ports:
        - containerPort: 9100
          hostPort: 9100
          name: metrics
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /
            port: 9100
            scheme: HTTP
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /host/proc
          name: proc
          readOnly: true
        - mountPath: /host/sys
          name: sys
          readOnly: true
      dnsPolicy: ClusterFirst
      hostNetwork: true
      hostPID: true
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext:
        fsGroup: 65534
        runAsGroup: 65534
        runAsNonRoot: true
        runAsUser: 65534
      terminationGracePeriodSeconds: 30
      tolerations:
      - effect: NoSchedule
        operator: Exists
      volumes:
      - hostPath:
          path: /proc
          type: ""
        name: proc
      - hostPath:
          path: /sys
          type: ""
        name: sys

生成配置文件
kubectl apply  -f node-exporter-ds.yaml 

部署blackbox-exporter

一、准备blackbox-exporter  deployment 配置文件
# cat blackbox-exporter-deployment.yaml 
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: blackbox-exporter
  name: blackbox-exporter
  namespace: prometheus
spec:
  replicas: 1
  selector:
    matchLabels:
      app: blackbox-exporter
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: blackbox-exporter
    spec:
      containers:
      - image: prom/blackbox-exporter
        imagePullPolicy: IfNotPresent
        name: blackbox-exporter
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30

二、准备blackbox-exporter-svc 文件
# cat blackbox-exporter-svc.yaml 
apiVersion: v1
kind: Service
metadata:
  labels:
    app: blackbox-exporter
  name: blackbox-exporter
  namespace: prometheus
spec:
  ports:
  - name: blackbox
    port: 9115
    protocol: TCP
    targetPort: 9115
  selector:
    app: blackbox-exporter
  sessionAffinity: None
  type: ClusterIP
status:
  loadBalancer: {}

生成配置文件
kubectl apply  -f blackbox-exporter-deployment.yaml
kubectl apply  -f blackbox-exporter-svc.yaml

部署 kubernetes-kube-state 用来收集node 节点指标

wget  https://github.com/kubernetes/kube-state-metrics/tree/release-1.9/examples/standard
1. 修改名称空间 kube-system 为 prometheus
2. 增加deployment label 为 grafanak8sapp: "true"
├── cluster-role-binding.yaml
├── cluster-role.yaml
├── deployment.yaml
├── service-account.yaml
└── service.yaml

#kube-state-metrics 部署配置文件如下,可以不用上面的deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  generation: 1
  labels:
    app.kubernetes.io/name: kube-state-metrics
    app.kubernetes.io/version: v1.9.8
    grafanak8sapp: "true"
  name: kube-state-metrics
  namespace: prometheus
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: kube-state-metrics
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null
      labels:
        app.kubernetes.io/name: kube-state-metrics
        app.kubernetes.io/version: v1.9.8
        grafanak8sapp: "true"
    spec:
      containers:
      - image: quay.io/coreos/kube-state-metrics:v1.9.8
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /healthz
            port: 8080
            scheme: HTTP
          initialDelaySeconds: 5
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 5
        name: kube-state-metrics
        ports:
        - containerPort: 8080
          name: http-metrics
          protocol: TCP
        - containerPort: 8081
          name: telemetry
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /
            port: 8081
            scheme: HTTP
          initialDelaySeconds: 5
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 5
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      nodeSelector:
        kubernetes.io/os: linux
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: kube-state-metrics
      serviceAccountName: kube-state-metrics
      terminationGracePeriodSeconds: 30


生成配置文件
kubectl  apply -f  ./

部署prometheus

  • prometheus 配置文件主要监控项目为 kubernetes-nodes (集群各个节点)
  • prometheus 监控prometheus 服务
  • kubernetes-services 通过blackbox-exporter 监控 service
  • kubernetes-nodes-cadvisor 通过 cadvisor 收集各个node 节点Pod 信息
  • kubernetes-ingresses 通过blackbox-exporter 监控 ingresses
  • kubernetes-kubelet 监控各个节点 kubelet
  • traefik 监控traefik 存活
  • kubernetes-apiservers 监控 apiservers 存活
  • 关键性服务监控 监控各个关键性服务状态
  • blackbox_http_pod_probe 通过blackbox-exporter监控各个pod
  • kubernetes-kube-state 通过 kube-state 可以收集各个node 节点信息
├── prometheus-conf.yaml
├── prometheus-deployment.yaml
├── prometheus-ingress.yaml
├── prometheus-pv-pvc.yaml
├── prometheus-rules.yaml
└── prometheus-svc.yaml

一、准备 pv、pvc 配置文件
# cat prometheus-pv-pvc.yaml 
apiVersion: v1
kind: PersistentVolume
metadata:
  name: prometheus-server
  namespace: prometheus
  labels:
    name: prometheus-server
spec:
  nfs:
    path: /export/nfs_share/volume-prometheus/prometheus-server
    server: 10.65.0.94
  accessModes: ["ReadWriteMany","ReadOnlyMany"]
  capacity:
    storage: 50Gi
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: prometheus-server
  namespace: prometheus
spec:
  accessModes: ["ReadWriteMany"]
  resources:
    requests:
      storage: 50Gi

二、准备prometheus-conf 文件

# cat prometheus-conf.yaml  
apiVersion: v1
data:
  prometheus.yml: |-
    # my global config
    global:
      scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
      evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
      # scrape_timeout is set to the global default (10s).

    # Alertmanager configuration
    alerting:
      alertmanagers:
        - static_configs:
          - targets:
            - 'alertmanager-service.prometheus:9093'
    # Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
    rule_files:
      - "/etc/prometheus/rules/nodedown.rule.yml"
    scrape_configs:
      - job_name: 'prometheus'
        static_configs:
          - targets: ['localhost:9090']

      - job_name: kubernetes-kube-state
        honor_timestamps: true
        scrape_interval: 15s
        scrape_timeout: 10s
        metrics_path: /metrics
        scheme: http
        relabel_configs:
        - separator: ;
          regex: __meta_kubernetes_pod_label_(.+)
          replacement: $1
          action: labelmap
        - source_labels: [__meta_kubernetes_namespace]
          separator: ;
          regex: (.*)
          target_label: kubernetes_namespace
          replacement: $1
          action: replace
        - source_labels: [__meta_kubernetes_pod_name]
          separator: ;
          regex: (.*)
          target_label: kubernetes_pod_name
          replacement: $1
          action: replace
        - source_labels: [__meta_kubernetes_pod_label_grafanak8sapp]
          separator: ;
          regex: .*true.*
          replacement: $1
          action: keep
        - source_labels: [__meta_kubernetes_pod_label_daemon, __meta_kubernetes_pod_node_name]
          separator: ;
          regex: node-exporter;(.*)
          target_label: nodename
          replacement: $1
          action: replace
        kubernetes_sd_configs:
        - role: pod

      - job_name: '关键性服务监控'
        metrics_path: /probe
        params:
          module: [http_2xx]
        static_configs:
          - targets:
            - https://news-graphql-xy.tgbus.com/healthz
        relabel_configs:
          - source_labels: [__address__]
            target_label: __param_target
          - source_labels: [__param_target]
            target_label: instance
          - target_label: __address__
            replacement: blackbox-exporter.prometheus.svc.cluster.local:9115

      - job_name: blackbox_http_pod_probe
        honor_timestamps: true
        params:
          module:
          - http_2xx
        scrape_interval: 15s
        scrape_timeout: 10s
        metrics_path: /probe
        scheme: http
        relabel_configs:
        - source_labels: [__meta_kubernetes_pod_annotation_blackbox_scheme]
          separator: ;
          regex: http
          replacement: $1
          action: keep
        - source_labels: [__address__, __meta_kubernetes_pod_annotation_blackbox_port, __meta_kubernetes_pod_annotation_blackbox_path]
          separator: ;
          regex: ([^:]+)(?::\d+)?;(\d+);(.+)
          target_label: __param_target
          replacement: $1:$2$3
          action: replace
        - separator: ;
          regex: (.*)
          target_label: __address__
          replacement: blackbox-exporter.prometheus.svc.cluster.local:9115
          action: replace
        - source_labels: [__param_target]
          separator: ;
          regex: (.*)
          target_label: instance
          replacement: $1
          action: replace
        - separator: ;
          regex: __meta_kubernetes_pod_label_(.+)
          replacement: $1
          action: labelmap
        - source_labels: [__meta_kubernetes_namespace]
          separator: ;
          regex: (.*)
          target_label: kubernetes_namespace
          replacement: $1
          action: replace
        - source_labels: [__meta_kubernetes_pod_name]
          separator: ;
          regex: (.*)
          target_label: kubernetes_pod_name
          replacement: $1
          action: replace
        kubernetes_sd_configs:
        - role: pod


      - bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
        job_name: kubernetes-nodes
        kubernetes_sd_configs:
        - role: node
        relabel_configs:
        - action: labelmap
          regex: __meta_kubernetes_node_label_(.+)
        - replacement: kubernetes.default.svc:443
          target_label: __address__
        - regex: (.+)
          replacement: /api/v1/nodes/${1}/proxy/metrics
          source_labels:
          - __meta_kubernetes_node_name
          target_label: __metrics_path__
        scheme: https
        tls_config:
          ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
          insecure_skip_verify: true

      - job_name: 'kubernetes-services'
        metrics_path: /probe
        params:
          module: [http_2xx]
        kubernetes_sd_configs:
        - role: service
        relabel_configs:
        - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_probe]
          action: keep
          regex: true
        - source_labels: [__address__]
          target_label: __param_target
        - target_label: __address__
          replacement: blackbox-exporter.prometheus.svc.cluster.local:9115
        - source_labels: [__param_target]
          target_label: instance
        - action: labelmap
          regex: __meta_kubernetes_service_label_(.+)
        - source_labels: [__meta_kubernetes_namespace]
          target_label: kubernetes_namespace
        - source_labels: [__meta_kubernetes_service_name]
          target_label: kubernetes_name

      - bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
        job_name: kubernetes-nodes-cadvisor
        kubernetes_sd_configs:
        - role: node
        relabel_configs:
        - action: labelmap
          regex: __meta_kubernetes_node_label_(.+)
        - replacement: kubernetes.default.svc:443
          target_label: __address__
        - regex: (.+)
          replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
          source_labels:
          - __meta_kubernetes_node_name
          target_label: __metrics_path__
        scheme: https
        tls_config:
          ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
          insecure_skip_verify: true

      - bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
        job_name: 'kubernetes-ingresses'
        metrics_path: /probe
        params:
          module: [http_2xx]
        kubernetes_sd_configs:
        - role: ingress
        relabel_configs:
        - source_labels: [__meta_kubernetes_ingress_annotation_prometheus_io_probe]
          action: keep
          regex: true
        - source_labels: [__meta_kubernetes_ingress_scheme,__address__,__meta_kubernetes_ingress_path]
          regex: (.+);(.+);(.+)
          replacement: ${1}://${2}${3}
          target_label: __param_target
        - target_label: __address__
          replacement: blackbox-exporter.prometheus.svc.cluster.local:9115
        - source_labels: [__param_target]
          target_label: instance
        - action: labelmap
          regex: __meta_kubernetes_ingress_label_(.+)
        - source_labels: [__meta_kubernetes_namespace]
          target_label: kubernetes_namespace
        - source_labels: [__meta_kubernetes_ingress_name]
          target_label: kubernetes_name

      - job_name: 'kubernetes-kubelet'
        scheme: https
        tls_config:
          ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
        kubernetes_sd_configs:
          - role: node
        relabel_configs:
          - action: labelmap
            regex: __meta_kubernetes_node_label_(.+)
          - target_label: __address__
            replacement: kubernetes.default.svc:443
          - source_labels: [__meta_kubernetes_node_name]
            regex: (.+)
            target_label: __metrics_path__
            replacement: /api/v1/nodes/${1}/proxy/metrics

      - bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
        job_name: kubernetes-apiservers
        kubernetes_sd_configs:
        - role: endpoints
        relabel_configs:
        - action: keep
          regex: default;kubernetes;https
          source_labels:
          - __meta_kubernetes_namespace
          - __meta_kubernetes_service_name
          - __meta_kubernetes_endpoint_port_name
        scheme: https
        tls_config:
          ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
          insecure_skip_verify: true
kind: ConfigMap
metadata:
  labels:
    app: prometheus
  name: prometheus-conf
  namespace: prometheus


三、准备prometheus-rules 配置文件
# cat prometheus-rules.yaml 
apiVersion: v1
data:
  nodedown.rule.yml: |
    groups:
    - name: YingPuDev-Alerting
      rules:
      - alert: 实例崩溃
        expr: up {instance !~"lgy-k8s-node0031.k8s.host.com"} == 0
        for: 1m
        labels:
          severity: error
          team: 影谱运维
        annotations:
          summary: "实例{{ $labels.instance }}崩溃"
          description: "{{ $labels.instance }}的{{ $labels.job }}实例已经崩溃超过了1分钟。"

      - alert: pod cpu 使用率
        expr: sum by(pod, namespace) (rate(container_cpu_usage_seconds_total{container!="",container!="POD",image!=""}[1m])) / (sum by(pod, namespace) (container_spec_cpu_quota{container!="",container!="POD",image!=""} / 100000)) * 100 > 75
        for: 2m
        labels:
          severity: warning
          team: 影谱运维
        annotations:
          description: |
            命名空间: {{ $labels.namespace }}
            Pod名称: {{ $labels.pod }}
          summary: 容器CPU使用率大于75%,当前值是{{ printf "%.2f" $value }}


      - alert: Pod 内存使用率
        expr: sum by(container,namespace,pod) (container_memory_rss{image!=""}) / sum by(container,namespace,pod) (container_spec_memory_limit_bytes{image!=""}) * 100  !=  +Inf > 85
        for: 2m
        labels:
          severity: warning
          team: 影谱运维
        annotations:
          description: |
            命名空间: {{ $labels.namespace }}
            Pod名称: {{ $labels.pod }}
          summary: 容器内存使用率大于75%,当前值是{{ printf "%.2f" $value }}

      - alert: Pod OOM
        expr: sum by(pod, namespace) (changes(kube_pod_container_status_terminated_reason{reason="OOMKilled"}[2m])) > 0
        for: 30s
        labels:
          severity: warning
          team: 影谱运维
        annotations:
          description: |
            命名空间: {{ $labels.namespace }}
            Pod名称: {{ $labels.pod }}
          summary: Pod OOM

      - alert: Pod 重启
        expr: sum by(pod, namespace) (changes(kube_pod_container_status_restarts_total{namespace!="ad-socket"}[3m])) > 0
        for: 30s
        labels:
          severity: warning
          team: 影谱运维
        annotations:
          description: |
            命名空间: {{ $labels.namespace }}
            Pod名称: {{ $labels.pod }}
          summary: Pod 重启

      - alert: Pod Pending
        expr: sum by(pod,namespace)  (kube_pod_status_phase{phase=~"Pending"}) ==1
        for: 30s
        labels:
          severity: warning
          team: 影谱运维
        annotations:
          description: |
            命名空间: {{ $labels.namespace }}
            Pod名称: {{ $labels.pod }}
          summary: Pod Pending

      - alert: Pod ImagePullBackOff
        expr: sum by(pod,namespace) (kube_pod_container_status_waiting_reason{reason="ImagePullBackOff"}) > 0
        for: 30s
        labels:
          severity: warning
          team: 影谱运维
        annotations:
          description: |
            命名空间: {{ $labels.namespace }}
            Pod名称: {{ $labels.pod }}
          summary: Pod ImagePullBackOff

      - alert: 节点内存压力
        expr: kube_node_status_condition{condition="MemoryPressure",status="true"} == 1
        for: 1m
        labels:
          severity: warning
          team: 影谱运维
        annotations:
          description: |
            节点: {{ $labels.node }}
            实例: {{ $labels.instance }}
          summary: 节点的可用内存达到了驱逐阈值

      - alert: 节点磁盘压力
        expr: kube_node_status_condition{condition="DiskPressure",status="true"} == 1
        for: 1m
        labels:
          severity: warning
          team: 影谱运维
        annotations:
          description: |
            节点: {{ $labels.node }}
            实例: {{ $labels.instance }}
          summary: 节点磁盘可用空间达到了阈值

      - alert: Node 节点状态异常
        expr: kube_node_status_condition{condition="Ready",status="true"} == 0
        for: 1m
        labels:
          severity: warning
          team: 影谱运维
        annotations:
          description: |
            节点: {{ $labels.node }}
            实例: {{ $labels.instance }}
          summary: Node 节点状态异常

      - alert: Master节点CPU Requsts
        expr: sum(kube_pod_container_resource_requests_cpu_cores{node=~"lgy-k8s-master0021|lgy-k8s-master0022|lgy-k8s-master0023"}) / sum(kube_node_status_allocatable_cpu_cores{node=~"lgy-k8s-master0021|lgy-k8s-master0022|lgy-k8s-master0023"}) * 100 > 75
        for: 30s
        labels:
          severity: warning
          team: 影谱运维
        annotations:
          description: |
            '来广营正式k8s集群 Master节点的CPU Requests大于75%,当前值为{{ printf "%.2f" $value }},请扩容Master节点!!
          summary: Master节点的CPU Requests大于75%,当前值为{{ printf "%.2f" $value }}

      - alert: Master节点 Memory Requsts
        expr: sum(kube_pod_container_resource_requests_memory_bytes{node=~"lgy-k8s-master0021|lgy-k8s-master0022|lgy-k8s-master0023"}) / sum(kube_node_status_allocatable_memory_bytes{node=~"lgy-k8s-master0021|lgy-k8s-master0022|lgy-k8s-master0023"}) * 100 > 75
        for: 30s
        labels:
          severity: warning
          team: 影谱运维
        annotations:
          description: |
            '来广营正式k8s集群 Master节点的Memory Requsts 大于75%,当前值为{{ printf "%.2f" $value }},请扩容Master节点!!
          summary: Master节点的Memory Requests大于75%,当前值为{{ printf "%.2f" $value }}

      - alert: Node节点CPU Requsts
        expr: sum(kube_pod_container_resource_requests_cpu_cores{node=~"lgy-k8s-node0031.k8s.host.com|lgy-k8s-node0032|lgy-k8s-node0033|lgy-k8s-node0034|lgy-k8s-node0035|lgy-k8s-gpu-node0036|lgy-k8s-gpu-node0037"}) / sum(kube_node_status_allocatable_cpu_cores{node=~"lgy-k8s-node0031.k8s.host.com|lgy-k8s-node0032|lgy-k8s-node0033|lgy-k8s-node0034|lgy-k8s-node0035|lgy-k8s-gpu-node0036|lgy-k8s-gpu-node0037"}) * 100 > 75
        for: 30s
        labels:
          severity: warning
          team: 影谱运维
        annotations:
          description: |
            '来广营正式k8s集群 Node节点的CPU Requests大于75%,当前值为{{ printf "%.2f" $value }},请扩容Node节点!!
          summary: Node节点的CPU Requests大于75%,当前值为{{ printf "%.2f" $value }}

      - alert: Node节点 Memory Requsts
        expr: sum(kube_pod_container_resource_requests_memory_bytes{node=~"lgy-k8s-node0031.k8s.host.com|lgy-k8s-node0032|lgy-k8s-node0033|lgy-k8s-node0034|lgy-k8s-node0035|lgy-k8s-gpu-node0036|lgy-k8s-gpu-node0037"}) / sum(kube_node_status_allocatable_memory_bytes{node=~"lgy-k8s-node0031.k8s.host.com|lgy-k8s-node0032|lgy-k8s-node0033|lgy-k8s-node0034|lgy-k8s-node0035|lgy-k8s-gpu-node0036|lgy-k8s-gpu-node0037"}) * 100 > 75
        for: 30s
        labels:
          severity: warning
          team: 影谱运维
        annotations:
          description: |
            '来广营正式k8s集群 Node节点的Memory Requsts 大于75%,当前值为{{ printf "%.2f" $value }},请扩容Node节点!!
          summary: Node节点的Memory Requests大于75%,当前值为{{ printf "%.2f" $value }}
kind: ConfigMap
metadata:
  labels:
    app: prometheus
  name: prometheus-rules
  namespace: prometheus


四、准备prometheus-deployment 配置文件
# 准备serviceaccount 文件
# cat prometheus-serviceaccount.yaml 
apiVersion: v1
kind: ServiceAccount
metadata:
  labels:
    app: prometheus
  name: prometheus
  namespace: prometheus

#准备prometheus 配置文件
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: prometheus
  name: prometheus
  namespace: prometheus
spec:
  replicas: 1
  selector:
    matchLabels:
      app: prometheus
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: prometheus
    spec:
      containers:
      - args:
        - --config.file=/etc/prometheus/prometheus.yml
        - --storage.tsdb.path=/prometheus
        - --storage.tsdb.retention=30d
        env:
        - name: STAKATER_PROMETHEUS_CONF_CONFIGMAP
          value: e4dd2838dd54e8392b62d85898083cc3d20210cc
        - name: STAKATER_PROMETHEUS_RULES_CONFIGMAP
          value: ca65a78fcb15d2c767166e468e8e734c6d4e267f
        image: prom/prometheus:latest
        imagePullPolicy: Always
        name: prometheus
        ports:
        - containerPort: 9090
          protocol: TCP
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /prometheus
          name: prometheus-data-volume
        - mountPath: /etc/prometheus/prometheus.yml
          name: prometheus-conf-volume
          subPath: prometheus.yml
        - mountPath: /etc/prometheus/rules
          name: prometheus-rules-volume
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext:
        runAsUser: 0
      serviceAccount: prometheus
      serviceAccountName: prometheus
      terminationGracePeriodSeconds: 30
      tolerations:
      - effect: NoSchedule
        key: node-role.kubernetes.io/master
      volumes:
      - name: prometheus-data-volume
        persistentVolumeClaim:
          claimName: prometheus-server
      - configMap:
          defaultMode: 420
          name: prometheus-conf
        name: prometheus-conf-volume
      - configMap:
          defaultMode: 420
          name: prometheus-rules
        name: prometheus-rules-volume

五、准备 prometheus-svc 配置文件
# cat prometheus-svc.yaml  
apiVersion: v1
kind: Service
metadata:
  labels:
    app: prometheus
  name: prometheus-service
  namespace: prometheus
spec:
  ports:
  - port: 9090
    protocol: TCP
    targetPort: 9090
  selector:
    app: prometheus
  sessionAffinity: None
  type: ClusterIP
status:
  loadBalancer: {}

六、准备 prometheus-ingress 配置文件
# cat prometheus-ingress.yaml  
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: prometheus-ingress
  namespace: prometheus
spec:
  rules:
  - host: prometheus.movie.cn
    http:
      paths:
      - backend:
          serviceName: prometheus-service
          servicePort: 9090
status:
  loadBalancer: {}


依次生成配置文件
kubectl apply  -f prometheus-serviceaccount.yaml 
kubectl apply  -f prometheus-conf.yaml
kubectl apply  -f prometheus-pv-pvc.yaml
kubectl apply  -f prometheus-rules.yaml
kubectl apply  -f prometheus-deployment.yaml
kubectl apply  -f prometheus-svc.yaml
kubectl apply  -f prometheus-ingress.yaml

#Permissive RBAC Permissions   创建高级别rbac 
kubectl create clusterrolebinding permissive-binding  --clusterrole=cluster-admin  --user=admin   --user=kubelet  --group=system:serviceaccounts

部署 alertmanager

  • 准备好能接能收的邮件,并且获得授权码
一、准备好 alertmanager 主配置文件
# cat alertmanager-config.yaml 
apiVersion: v1
data:
  config.yml: |-
    global:
      smtp_smarthost: 'smtp.exmail.qq.com:465'
      smtp_from: 'gitlab@movie.cn'
      smtp_auth_username: 'gitlab@movie.cn'
      smtp_auth_password: 'password'
      smtp_require_tls: false

    route:
      group_by: ['alertname', 'cluster']
      group_wait: 30s
      group_interval: 1m
      repeat_interval: 20m
      receiver: default
      routes:
      - receiver: email
        group_wait: 10s
        match:
          team: node
    receivers:
    - name: 'default'
      email_configs:
      - to: 'long_ma@movie.cn'
      - to: 'test_li@movie.cn'
        send_resolved: true
    - name: 'email'
      email_configs:
      - to: 'long_ma@movie.cn'
      - to: 'test_li@movie.cn'
        send_resolved: true
kind: ConfigMap
metadata:
  name: alert-config
  namespace: prometheus

二、准备alertmanager-deployment 配置文件
# cat alertmanager-deployment.yaml 
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: alertmanager
  name: alertmanager
  namespace: prometheus
spec:
  replicas: 1
  selector:
    matchLabels:
      app: alertmanager
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: alertmanager
    spec:
      containers:
      - args:
        - --config.file=/etc/alertmanager/config.yml
        - --storage.path=/alertmanager/data
        - --cluster.advertise-address=0.0.0.0:9093
        image: prom/alertmanager:v0.15.3
        imagePullPolicy: IfNotPresent
        name: alertmanager
        ports:
        - containerPort: 9093
          name: http
          protocol: TCP
        resources:
          requests:
            cpu: 100m
            memory: 256Mi
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /etc/alertmanager
          name: alertcfg
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
      volumes:
      - configMap:
          defaultMode: 420
          name: alert-config
        name: alertcfg

三、准备好alertmanager-service 配置文件
# cat alertmanager-service.yaml 
apiVersion: v1
kind: Service
metadata:
  labels:
    app: alertmanager
  name: alertmanager-service
  namespace: prometheus
spec:
  externalTrafficPolicy: Cluster
  ports:
  - nodePort: 31567
    port: 9093
    protocol: TCP
    targetPort: 9093
  selector:
    app: alertmanager
  sessionAffinity: None
  type: NodePort
status:
  loadBalancer: {}

生成配置文件
kubectl apply  -f alertmanager-config.yaml
kubectl apply  -f alertmanager-deployment.yaml
kubectl apply  -f alertmanager-service.yaml

部署 grafana

  • grafana 在要求用户的时候需要发送邮件,因此需要将grafana 配置文件提取出来做成 configmap 并配置接收邮件内容。
├── grafana-conf.yaml
├── grafana-deployment.yaml
├── grafana-ingress.yaml
├── grafana-pv-pvc.yaml
└── grafana-service.yaml

一、准备grafana pv、pvc 配置文件
# cat grafana-pv-pvc.yaml 
apiVersion: v1
kind: PersistentVolume
metadata:
  name: grafana
  namespace: prometheus
  labels:
    name: grafana
spec:
  nfs:
    path: /export/nfs_share/volume-prometheus/grafana
    server: 10.65.0.94
  accessModes: ["ReadWriteMany","ReadOnlyMany"]
  capacity:
    storage: 50Gi
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: grafana
  namespace: prometheus
spec:
  accessModes: ["ReadWriteMany"]
  resources:
    requests:
      storage: 50Gi


二、准备 grafana config 配置文件
# cat grafana-conf.yaml 
apiVersion: v1
data:
  grafana.ini: |
        [paths]
        [server]
        [database]
        [remote_cache]
        [dataproxy]
        [analytics]
        [security]
        [snapshots]
        [dashboards]
        [users]
        [auth]
        [auth.anonymous]
        [auth.github]
        [auth.gitlab]
        [auth.google]
        [auth.grafana_com]
        [auth.azuread]
        [auth.okta]
        [auth.generic_oauth]
        [auth.basic]
        [auth.proxy]
        [auth.ldap]
        [smtp]
        enabled = true
        host = smtp.exmail.qq.com:465
        user = gitlab@movie.cn
        password = password
        from_address = gitlab@movie.cn
        from_name = Grafana
        [emails]
        [log]
        [log.console]
        [log.file]
        [log.syslog]
        [quota]
        [alerting]
        [annotations.dashboard]
        [annotations.api]
        [explore]
        [metrics]
        [metrics.environment_info]
        [metrics.graphite]
        [grafana_com]
        [tracing.jaeger]
        [external_image_storage]
        [external_image_storage.s3]
        [external_image_storage.webdav]
        [external_image_storage.gcs]
        [external_image_storage.azure_blob]
        [external_image_storage.local]
        [rendering]
        [panels]
        [plugins]
        [plugin.grafana-image-renderer]
        [enterprise]
        [feature_toggles]
        [date_formats]            
kind: ConfigMap
metadata:
  labels:
    app: grafana
  name: grafana-conf
  namespace: prometheus

三、 准备grafana-deployment 配置文件

# cat grafana-deployment.yaml 
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: grafana
  name: grafana
  namespace: prometheus
spec:
  replicas: 1
  selector:
    matchLabels:
      app: grafana
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: grafana
    spec:
      containers:
      - env:
        - name: GF_AUTH_BASIC_ENABLED
          value: "true"
        - name: GF_AUTH_ANONYMOUS_ENABLED
          value: "false"
        image: grafana/grafana:latest
        imagePullPolicy: IfNotPresent
        name: grafana
        ports:
        - containerPort: 3000
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /login
            port: 3000
            scheme: HTTP
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /var/lib/grafana
          name: grafana-data-volume
        - mountPath: /etc/grafana/grafana.ini
          name: configmap-volume
          subPath: grafana.ini
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext:
        runAsUser: 0
      terminationGracePeriodSeconds: 30
      volumes:
      - name: grafana-data-volume
        persistentVolumeClaim:
          claimName: grafana
      - configMap:
          defaultMode: 420
          name: grafana-conf
        name: configmap-volume


四、准备 grafana-service  配置文件
# cat grafana-service.yaml 
apiVersion: v1
kind: Service
metadata:
  labels:
    app: grafana
  name: grafana-service
  namespace: prometheus
spec:
  ports:
  - port: 3000
    protocol: TCP
    targetPort: 3000
  selector:
    app: grafana
  sessionAffinity: None
  type: ClusterIP
status:
  loadBalancer: {}


五、准备 grafana-ingress 配置
# cat grafana-ingress.yaml 
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: grafana-ingress
  namespace: prometheus
spec:
  rules:
  - host: grafana.movie.cn
    http:
      paths:
      - backend:
          serviceName: grafana-service
          servicePort: 3000
status:
  loadBalancer: {}

生成配置文件
kubectl apply  -f grafana-pv-pvc.yaml 
kubectl apply  -f grafana-conf.yaml
kubectl apply  -f grafana-deployment.yaml
kubectl apply  -f grafana-service.yaml
kubectl apply  -f grafana-ingress.yaml 

线上ingress 配置

  • ingress 要想被 prometheus 抓取,需要声明,在ingress 中要添加 prometheus.io/probe: "true" ,以下是一个示例
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  annotations:
    prometheus.io/probe: "true"
  name: signinreception-node-test
  namespace: signin
spec:
  rules:
  - host: test.signin.xx.cn
    http:
      paths:
      - backend:
          serviceName: signin-node-test
          servicePort: 80
        path: /

线上 service 配置

  • service 要想被 prometheus 抓取,需要声明,在service 中要添加 prometheus.io/probe: "true" ,以下是一个示例
apiVersion: v1
kind: Service
metadata:
  annotations:
    prometheus.io/probe: "true"
  labels:
    name: signinreception-node-test
  name: signinreception-node-test
  namespace: signin
spec:
  ports:
  - name: httpd
    port: 80
    protocol: TCP
    targetPort: 80
  selector:
    name: signinreception-node-test
  sessionAffinity: None
  type: ClusterIP

线上deployment 配置

  • deployment要想被 prometheus 抓取,需要声明,在deployment 中要添加
    blackbox_path: /
    blackbox_port: "80"
    blackbox_scheme: http
    --- 以下是一个示例
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    name: videoai-node-dev
  name: videoai-node-dev
  namespace: videoai
spec:
  replicas: 1
  selector:
    matchLabels:
      name: node-dev
  template:
    metadata:
      annotations:
        blackbox_path: /
        blackbox_port: "80"
        blackbox_scheme: http
      creationTimestamp: null
      labels:
        name: node-dev
    spec:
      containers:
      - image: node_dev:20201202103938
        imagePullPolicy: IfNotPresent
        name: node-dev

prometheus 效果展示

配置 grafana

  • 需求是需要在grafana 中分业务展示pod cpu、内存等信息,pod 需要下拉框显示固定名称空间下的pod
  • 需要在grafana 表达式中指定固定 namespace ,但是pod 需要使用变量指定
  • 在dashboard 设置中配置变量


各个指标表达式

cpu 表达式:
sum (rate (container_cpu_usage_seconds_total{container="POD", job="kubernetes-nodes-cadvisor", namespace="5g-meetting",pod="$container"}[1m])) by (container,pod)
内存表达式:
sum (container_memory_working_set_bytes{container="POD", job="kubernetes-nodes-cadvisor", namespace="5g-meetting",pod="$container"}) by (container, pod)

prometheus 扩充

prometheus 针对单独业务发送报警

  • 修改prometheus rules 配置文件,主要是修改 labels 标签值 user: agcm-platform ,因为业务是按照namespace 划分的,所以指定业务名称namespace 。 修改alert名称为: agcm-platform is down
      - alert: agcm-platform is down
        expr: probe_http_status_code {kubernetes_namespace=~"agcm-platform"} >= 400 or probe_http_status_code {kubernetes_namespace=~"agcm-platform"} == 0
        for: 1m
        labels:
          user: agcm-platform
        annotations:
          summary: "API {{ $labels.kubernetes_name }} 服务不可用"
          description: "{{ $labels.kubernetes_name }}的{{ $labels.job }}服务已经超过1分钟不可用了。当前状态码为:{{ $value }}"
  • 修改 alertmanager 配置文件
routes 中新增加
      - receiver: agcm-platform
        group_wait: 10s
        match:
          user: agcm-platform

receivers 中新增加
    - name: 'agcm-platform'
      email_configs:
      - to: '1032957318@qq.com'
        send_resolved: true

prometheus 中已成功显示并能针对业务单独发送告警邮件


prometheus 接入钉钉告警

  • 本文将介绍通过prometheus + alertmanager的方式实现钉钉报警。

获取自定义机器人webhook

  • 打开PC端钉钉,点击头像,选择“机器人管理”。

  • 在机器人管理页面选择“自定义”机器人,输入机器人名字并选择要发送消息的群,同时可以为机器人设置机器人头像。

  • 完成安全设置后,复制出机器人的Webhook地址,可用于向这个群发送消息,格式如下:
    https://oapi.dingtalk.com/robot/send?access_token=XXXXXX

将钉钉接入 Prometheus AlertManager WebHook

#下载二进制安装包
shell> wget https://github.com/timonwong/prometheus-webhook-dingtalk/releases/download/v1.4.0/prometheus-webhook-dingtalk-1.4.0.linux-amd64.tar.gz
shell> tar zxvf prometheus-webhook-dingtalk-1.4.0.linux-amd64.tar.gz
shell> mv prometheus-webhook-dingtalk-1.4.0.linux-amd64 /usr/local/dingtalk/
  • 配置钉钉告警文件
shell> cd /usr/local/dingtalk/
# 替换 config.yml 文件中的 url 后面的值信息为 复制出机器人的Webhook地址
shell> cp config.example.yml config.yml
# 修改config.yml 配置文件
shell> cat config.yml
## Request timeout
# timeout: 5s

## Customizable templates path
templates:
  - contrib/templates/legacy/template.tmpl

## You can also override default template using `default_message`
## The following example to use the 'legacy' template from v0.3.0
# default_message:
#   title: '{{ template "legacy.title" . }}'
#   text: '{{ template "legacy.content" . }}'

## Targets, previously was known as "profiles"
targets:
  webhook2:
    url: https://oapi.dingtalk.com/robot/send?access_token=xxx
  webhook_legacy:
    url: https://oapi.dingtalk.com/robot/send?access_token=xxx
    # Customize template content
    message:
      # Use legacy template
      title: '{{ template "legacy.title" . }}'
      text: '{{ template "legacy.content" . }}'
  webhook_mention_all:
    url: https://oapi.dingtalk.com/robot/send?access_token=xxx
    mention:
      all: true

  • 配置钉钉告警模板
shell> cat  /usr/local/dingtalk/contrib/templates/legacy/template.tmpl

{{ define "__subject" }}[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .GroupLabels.SortedPairs.Values | join " " }} {{ if gt (len .CommonLabels) (len .GroupLabels) }}({{ with .CommonLabels.Remove .GroupLabels.Names }}{{ .Values | join " " }}{{ end }}){{ end }}{{ end }}
{{ define "__alertmanagerURL" }}{{ .ExternalURL }}/#/alerts?receiver={{ .Receiver }}{{ end }}

{{ define "__text_alert_list" }}{{ range . }}
**Labels**
{{ range .Labels.SortedPairs }} - {{ .Name }}: {{ .Value | markdown | html }}
{{ end }}
**Annotations**
{{ range .Annotations.SortedPairs }} - {{ .Name }}: {{ .Value | markdown | html }}
{{ end }}
**Source:** [{{ .GeneratorURL }}]({{ .GeneratorURL }})
{{ end }}{{ end }}

{{ define "default.__text_alert_list" }}{{ range . }}
---
**告警级别:** {{ .Labels.severity | upper }}

**运营团队:** {{ .Labels.team | upper }}

**触发时间:** {{ dateInZone "2006.01.02 15:04:05" (.StartsAt) "Asia/Shanghai" }}

**事件信息:** 
{{ range .Annotations.SortedPairs }} - {{ .Name }}: {{ .Value | markdown | html }}


{{ end }}

**事件标签:**
{{ range .Labels.SortedPairs }}{{ if and (ne (.Name) "severity") (ne (.Name) "summary") (ne (.Name) "team") }} - {{ .Name }}: {{ .Value | markdown | html }}
{{ end }}{{ end }}
{{ end }}
{{ end }}
{{ define "default.__text_alertresovle_list" }}{{ range . }}
---
**告警级别:** {{ .Labels.severity | upper }}

**运营团队:** {{ .Labels.team | upper }}

**触发时间:** {{ dateInZone "2006.01.02 15:04:05" (.StartsAt) "Asia/Shanghai" }}

**结束时间:** {{ dateInZone "2006.01.02 15:04:05" (.EndsAt) "Asia/Shanghai" }}

**事件信息:**
{{ range .Annotations.SortedPairs }} - {{ .Name }}: {{ .Value | markdown | html }}


{{ end }}

**事件标签:**
{{ range .Labels.SortedPairs }}{{ if and (ne (.Name) "severity") (ne (.Name) "summary") (ne (.Name) "team") }} - {{ .Name }}: {{ .Value | markdown | html }}
{{ end }}{{ end }}
{{ end }}
{{ end }}

{{/* Default */}}
{{ define "default.title" }}{{ template "__subject" . }}{{ end }}
{{ define "default.content" }}#### \[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}\] **[{{ index .GroupLabels "alertname" }}]({{ template "__alertmanagerURL" . }})**
{{ if gt (len .Alerts.Firing) 0 -}}

{{ template "default.__text_alert_list" .Alerts.Firing }}


{{- end }}

{{ if gt (len .Alerts.Resolved) 0 -}}
{{ template "default.__text_alertresovle_list" .Alerts.Resolved }}


{{- end }}
{{- end }}

{{/* Legacy */}}
{{ define "legacy.title" }}{{ template "__subject" . }}{{ end }}
{{ define "legacy.content" }}#### \[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}\] **[{{ index .GroupLabels "alertname" }}]({{ template "__alertmanagerURL" . }})**
{{ template "__text_alert_list" .Alerts.Firing }}
{{- end }}

{{/* Following names for compatibility */}}
{{ define "ding.link.title" }}{{ template "default.title" . }}{{ end }}
{{ define "ding.link.content" }}{{ template "default.content" . }}{{ end }}


  • 启动 prometheus-webhook-dingtalk
shell> /usr/local/dingtalk/prometheus-webhook-dingtalk --config.file=/usr/local/dingtalk/config.yml  2>&1 &
  • 查看服务已启动
# netstat  -ntpl |grep 8060
tcp6       0      0 :::8060                 :::*                    LISTEN      8874/prometheus-web 

配置 alertmanager webhook

  • 在 alertmanager 中新增加 webhook 内容
    receivers:
    - name: 'default'
      email_configs:
      - to: 'xi@mov.cn'
        send_resolved: true
      webhook_configs:
      - url: 'http://10.6.9.9:8060/dingtalk/webhook2/send'
        send_resolved: true

  • 查看钉钉告警内容

prometheus 之告警收敛

  • 抑制(Inhibition):当警报发出后,停止重复发送由此警报引发的其他警报
配置 alertmanager 

    inhibit_rules:
      - source_match:            #匹配当前告警发生后其他告警抑制掉
          severity: 'error'      #指定告警级别
        target_match:            #抑制告警
          severity: 'warning'    #指定抑制告警级别
        equal: ['instance','kubernetes_name']  #只有包含指定标签才可成立规则

例如有两个告警邮件,发出报警内容一致,只是在告警规则 labels:  severity: error/warming  中配置不一致;
因此根据 alertmanager 中设置只有告警级别为  error 的才能发出警报
posted @ 2021-04-02 16:57  lixinliang  阅读(2272)  评论(0编辑  收藏  举报