监控-Prometheus09-监控Kubernetes

  • 在过去的几年中,云计算已经成为及分布式计算最火热的技术之一,其中Docker、Kubernetes、Prometheus等开源软件的发展极大地推动了云计算的发展。
  • Kubernetes使用Docker进行容器管理,如果说Docker和kubernetes的搭配是云原生时代的基石,那么Prometheus为云原生插上了飞翔的翅膀。随着云原生社区的不断壮大,应用场景越来越复杂,需要一套针对云原生环境的完善并且开放的监控平台。在这样的环境下,Prometheus应运而生,天然支持Kubernetes。
  • 传统方式部署步骤相对复杂,随着Operator的日益成熟,推荐使用Operator方式部署Prometheus。通过Operator方式部署Prometheus,可将更多的操作集成到Operator中,简化了操作过程,也使部署更加简单。

1、Prometheus Operator介绍

  • Kubernetes的Prometheus Operator为Kubernetes服务和Prometheus实例的部署和管理提供了简单的监控定义。
  • Prometheus Operator(后面都简称Operater)提供如下功能:
    • 创建/销毁:在Kubernetes namespace中更加容易地启动一个Prometheues实例,一个特定应用程序或者团队可以更容易使用Operator。
    • 便捷配置:可以通过Kubernetes资源配置Prometheus的基本信息,比如版本、存储、保留策略和副本集等。

1.1、Prometheus Operator架构

  • Prometheus Operator架构如图11-1所示:

  • 架构中的各组以k8s自定义资源的方式运行在Kubernetes集群中,它们各自有不同的作用。
    • Operator:根据自定义资源(Custom ResourceDefinition,CRD)来部署和管理Prometheus Server,同时监控这些自定义资源事件的变化来做相应的处理,是整个系统的控制中心。
    • Prometheus资源:声明Prometheus statefulset控制器的期望状态,Prometheus Operator确保这个statefulset控制器运行时一直与定义保持一致。
    • Prometheus Server:Operator根据Prometheus资源部署Prometheus Server集群,这些自定义资源可以看作是用来管理Prometheus Server集群的StatefulSets资源。
    • Alertmanager资源:定义AlertManager statefulset控制器的期望状态,Prometheus Operator确保这个statefulset控制器运行时一直与定义保持一致。
    • ServiceMonitor资源:声明Prometheus监控的target列表。该资源通过Labels来选取对应的Service Endpoint,让Prometheus Server通过选取的Service来获取Metrics信息。
    • Service:简单的说就是Prometheus监控的对象。

2.2、Prometheus Operator的自定义资源

  • Prometheus Operater有四种自定义资源
    • Prometheus
    • ServiceMonitor
    • Alertmanager
    • PrometheusRule
  • 查看名称空间中的所有资源
]# kubectl api-resources --verbs=list --namespaced -o name | xargs -n 1 kubectl get --show-kind --ignore-not-found -n monitoring

1、Prometheus资源

  • Prometheus自定义资源(CRD):声明Prometheus statefulset控制器的期望状态,Prometheus Operator确保这个statefulset控制器运行时一直与定义保持一致。包含副本数量、持久化存储以及Prometheus实例发送警告到的Alertmanagers等配置选项。
  • Prometheus Operator会根据Prometheus资源在相同namespace下生成一个StatefulSet控制器。Prometheus的Pod都会挂载一个名为<prometheus-name>的Secret,里面包含了Prometheus的配置。Prometheus Operator根据包含的ServiceMonitor生成配置,并且更新含有配置的Secret。无论是对ServiceMonitors或者Prometheus的修改,都会持续不断的被按照前面的步骤更新。
  • 查看Prometheus资源(使用helm部署kube-Prometheus的内容)
]# kubectl edit -n monitoring prometheus.monitoring.coreos.com/kube-promet
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  annotations:
    meta.helm.sh/release-name: kube-prometheus
    meta.helm.sh/release-namespace: monitoring
  labels:
    app.kubernetes.io/component: prometheus
    app.kubernetes.io/instance: kube-prometheus
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: kube-prometheus
    helm.sh/chart: kube-prometheus-8.1.11
  name: kube-prometheus-prometheus
  namespace: monitoring
spec:
  affinity:             #定义亲和性
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - podAffinityTerm:
          labelSelector:
            matchLabels:
              app.kubernetes.io/component: prometheus
              app.kubernetes.io/instance: kube-prometheus
              app.kubernetes.io/name: kube-prometheus
          namespaces:
          - monitoring
          topologyKey: kubernetes.io/hostname
        weight: 1
  alerting:            #定义Prometheus关联的Alertmanager
    alertmanagers:
    - name: kube-prometheus-alertmanager
      namespace: monitoring
      pathPrefix: /
      port: http
  containers:          #定义容器
  - name: prometheus   #prometheus容器
    livenessProbe:     #容器存活性探针
      failureThreshold: 10
      httpGet:
        path: /-/healthy
        port: web
        scheme: HTTP
      initialDelaySeconds: 0
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 3
    readinessProbe:    #容器可用性探针
      failureThreshold: 10
      httpGet:
        path: /-/ready
        port: web
        scheme: HTTP
      initialDelaySeconds: 0
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 3
    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        drop:
        - ALL
      readOnlyRootFilesystem: false
      runAsNonRoot: true
    startupProbe:      #容器启动性探针
      failureThreshold: 60
      httpGet:
        path: /-/ready
        port: web
        scheme: HTTP
      initialDelaySeconds: 0
      periodSeconds: 15
      successThreshold: 1
      timeoutSeconds: 3
  - name: config-reloader    #config-reloader容器
    livenessProbe:
      failureThreshold: 6
      initialDelaySeconds: 10
      periodSeconds: 10
      successThreshold: 1
      tcpSocket:
        port: reloader-web
      timeoutSeconds: 5
    readinessProbe:
      failureThreshold: 6
      initialDelaySeconds: 15
      periodSeconds: 20
      successThreshold: 1
      tcpSocket:
        port: reloader-web
      timeoutSeconds: 5
    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        drop:
        - ALL
      readOnlyRootFilesystem: false
      runAsNonRoot: true
  enableAdminAPI: false
  evaluationInterval: 30s
  externalUrl: http://127.0.0.1:9090/
  image: docker.io/bitnami/prometheus:2.39.1-debian-11-r1    #镜像
  listenLocal: false
  logFormat: logfmt    #日志风格
  logLevel: info
  paused: false
  podMetadata:
    labels:
      app.kubernetes.io/component: prometheus
      app.kubernetes.io/instance: kube-prometheus
      app.kubernetes.io/name: kube-prometheus
  podMonitorNamespaceSelector: {}
  podMonitorSelector: {}
  portName: web
  probeNamespaceSelector: {}
  probeSelector: {}
  replicas: 1         #定义Proemtheus“集群”有两个副本。说是集群,其实Prometheus自身不带集群功能,这里只是起两个完全一样的Prometheus来避免单点故障
  retention: 10d
  routePrefix: /
  ruleNamespaceSelector: {}
  ruleSelector: {}    #定义Prometheus使用哪些PrometheusRule,根据标签选择。
  scrapeInterval: 30s
  securityContext:
    fsGroup: 1001
    runAsUser: 1001
  serviceAccountName: kube-prometheus-prometheus
  serviceMonitorNamespaceSelector: {}    #定义Prometheus在哪些namespace中选择要被监控的ServiceMonitor,根据标签选择namespace。不声明则会全部选中
  serviceMonitorSelector: {}             #定义Prometheus选择哪些要被监控的ServiceMonitor,根据标签选择ServiceMonitor。不声明则会全部选中
  shards: 1
  storage:             #定义存储卷
    volumeClaimTemplate:
      spec:
        accessModes:
        - ReadWriteOnce
        resources:
          requests:
            storage: 8Gi
        storageClassName: nfs-client
status:
  availableReplicas: 1
  conditions:
  - lastTransitionTime: "2022-10-22T21:24:42Z"
    status: "True"
    type: Available
  - lastTransitionTime: "2022-10-22T21:12:51Z"
    status: "True"
    type: Reconciled
  paused: false
  replicas: 1
  shardStatuses:
  - availableReplicas: 1
    replicas: 1
    shardID: "0"
    unavailableReplicas: 0
    updatedReplicas: 1
  unavailableReplicas: 0
  updatedReplicas: 1
  • 查看Prometheus资源生成的StatefulSet控制器(使用helm部署kube-Prometheus的内容)
]# kubectl edit -n monitoring statefulset.apps/prometheus-kube-prometheus-prometheus
apiVersion: apps/v1
kind: StatefulSet
metadata:
  annotations:
    meta.helm.sh/release-name: kube-prometheus
    meta.helm.sh/release-namespace: monitoring
  generation: 1
  labels:
    app.kubernetes.io/component: prometheus
    app.kubernetes.io/instance: kube-prometheus
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: kube-prometheus
    helm.sh/chart: kube-prometheus-8.1.11
    operator.prometheus.io/name: kube-prometheus-prometheus
    operator.prometheus.io/shard: "0"
  name: prometheus-kube-prometheus-prometheus
  namespace: monitoring
  ownerReferences:
  - apiVersion: monitoring.coreos.com/v1
    blockOwnerDeletion: true
    controller: true
    kind: Prometheus
    name: kube-prometheus-prometheus
spec:
  podManagementPolicy: Parallel
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app.kubernetes.io/instance: kube-prometheus-prometheus
      app.kubernetes.io/managed-by: prometheus-operator
      app.kubernetes.io/name: prometheus
      operator.prometheus.io/name: kube-prometheus-prometheus
      operator.prometheus.io/shard: "0"
      prometheus: kube-prometheus-prometheus
  serviceName: prometheus-operated
  template:
    metadata:
      annotations:
        kubectl.kubernetes.io/default-container: prometheus
      creationTimestamp: null
      labels:
        app.kubernetes.io/component: prometheus
        app.kubernetes.io/instance: kube-prometheus-prometheus
        app.kubernetes.io/managed-by: prometheus-operator
        app.kubernetes.io/name: prometheus
        app.kubernetes.io/version: 2.39.0
        operator.prometheus.io/name: kube-prometheus-prometheus
        operator.prometheus.io/shard: "0"
        prometheus: kube-prometheus-prometheus
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - podAffinityTerm:
              labelSelector:
                matchLabels:
                  app.kubernetes.io/component: prometheus
                  app.kubernetes.io/instance: kube-prometheus
                  app.kubernetes.io/name: kube-prometheus
              namespaces:
              - monitoring
              topologyKey: kubernetes.io/hostname
            weight: 1
      automountServiceAccountToken: true
      containers:
      - args:
        - --web.console.templates=/etc/prometheus/consoles
        - --web.console.libraries=/etc/prometheus/console_libraries
        - --storage.tsdb.retention.time=10d
        - --config.file=/etc/prometheus/config_out/prometheus.env.yaml
        - --storage.tsdb.path=/prometheus
        - --web.enable-lifecycle
        - --web.external-url=http://127.0.0.1:9090/
        - --web.route-prefix=/
        - --web.config.file=/etc/prometheus/web_config/web-config.yaml
        image: docker.io/bitnami/prometheus:2.39.1-debian-11-r1
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 10
          httpGet:
            path: /-/healthy
            port: web
            scheme: HTTP
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 3
        name: prometheus
        ports:
        - containerPort: 9090
          name: web
          protocol: TCP
        readinessProbe:
          failureThreshold: 10
          httpGet:
            path: /-/ready
            port: web
            scheme: HTTP
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 3
        resources: {}
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop:
            - ALL
          readOnlyRootFilesystem: false
          runAsNonRoot: true
        startupProbe:
          failureThreshold: 60
          httpGet:
            path: /-/ready
            port: web
            scheme: HTTP
          periodSeconds: 15
          successThreshold: 1
          timeoutSeconds: 3
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: FallbackToLogsOnError
        volumeMounts:
        - mountPath: /etc/prometheus/config_out
          name: config-out
          readOnly: true
        - mountPath: /etc/prometheus/certs
          name: tls-assets
          readOnly: true
        - mountPath: /prometheus
          name: prometheus-kube-prometheus-prometheus-db
          subPath: prometheus-db
        - mountPath: /etc/prometheus/rules/prometheus-kube-prometheus-prometheus-rulefiles-0
          name: prometheus-kube-prometheus-prometheus-rulefiles-0
        - mountPath: /etc/prometheus/web_config/web-config.yaml
          name: web-config
          readOnly: true
          subPath: web-config.yaml
      - args:
        - --listen-address=:8080
        - --reload-url=http://127.0.0.1:9090/-/reload
        - --config-file=/etc/prometheus/config/prometheus.yaml.gz
        - --config-envsubst-file=/etc/prometheus/config_out/prometheus.env.yaml
        - --watched-dir=/etc/prometheus/rules/prometheus-kube-prometheus-prometheus-rulefiles-0
        command:
        - /bin/prometheus-config-reloader
        env:
        - name: POD_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.name
        - name: SHARD
          value: "0"
        image: docker.io/bitnami/prometheus-operator:0.60.1-debian-11-r0
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 6
          initialDelaySeconds: 10
          periodSeconds: 10
          successThreshold: 1
          tcpSocket:
            port: reloader-web
          timeoutSeconds: 5
        name: config-reloader
        ports:
        - containerPort: 8080
          name: reloader-web
          protocol: TCP
        readinessProbe:
          failureThreshold: 6
          initialDelaySeconds: 15
          periodSeconds: 20
          successThreshold: 1
          tcpSocket:
            port: reloader-web
          timeoutSeconds: 5
        resources:
          limits:
            cpu: 100m
            memory: 50Mi
          requests:
            cpu: 100m
            memory: 50Mi
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop:
            - ALL
          readOnlyRootFilesystem: false
          runAsNonRoot: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: FallbackToLogsOnError
        volumeMounts:
        - mountPath: /etc/prometheus/config
          name: config
        - mountPath: /etc/prometheus/config_out
          name: config-out
        - mountPath: /etc/prometheus/rules/prometheus-kube-prometheus-prometheus-rulefiles-0
          name: prometheus-kube-prometheus-prometheus-rulefiles-0
      dnsPolicy: ClusterFirst
      initContainers:
      - args:
        - --watch-interval=0
        - --listen-address=:8080
        - --config-file=/etc/prometheus/config/prometheus.yaml.gz
        - --config-envsubst-file=/etc/prometheus/config_out/prometheus.env.yaml
        - --watched-dir=/etc/prometheus/rules/prometheus-kube-prometheus-prometheus-rulefiles-0
        command:
        - /bin/prometheus-config-reloader
        env:
        - name: POD_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.name
        - name: SHARD
          value: "0"
        image: docker.io/bitnami/prometheus-operator:0.60.1-debian-11-r0
        imagePullPolicy: IfNotPresent
        name: init-config-reloader
        ports:
        - containerPort: 8080
          name: reloader-web
          protocol: TCP
        resources:
          limits:
            cpu: 100m
            memory: 50Mi
          requests:
            cpu: 100m
            memory: 50Mi
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop:
            - ALL
          readOnlyRootFilesystem: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: FallbackToLogsOnError
        volumeMounts:
        - mountPath: /etc/prometheus/config
          name: config
        - mountPath: /etc/prometheus/config_out
          name: config-out
        - mountPath: /etc/prometheus/rules/prometheus-kube-prometheus-prometheus-rulefiles-0
          name: prometheus-kube-prometheus-prometheus-rulefiles-0
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext:
        fsGroup: 1001
        runAsUser: 1001
      serviceAccount: kube-prometheus-prometheus
      serviceAccountName: kube-prometheus-prometheus
      terminationGracePeriodSeconds: 600
      volumes:
      - name: config
        secret:
          defaultMode: 420
          secretName: prometheus-kube-prometheus-prometheus
      - name: tls-assets
        projected:
          defaultMode: 420
          sources:
          - secret:
              name: prometheus-kube-prometheus-prometheus-tls-assets-0
      - emptyDir: {}
        name: config-out
      - configMap:
          defaultMode: 420
          name: prometheus-kube-prometheus-prometheus-rulefiles-0
        name: prometheus-kube-prometheus-prometheus-rulefiles-0
      - name: web-config
        secret:
          defaultMode: 420
          secretName: prometheus-kube-prometheus-prometheus-web-config
  updateStrategy:
    type: RollingUpdate
  volumeClaimTemplates:
  - apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      creationTimestamp: null
      name: prometheus-kube-prometheus-prometheus-db
    spec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 8Gi
      storageClassName: nfs-client
      volumeMode: Filesystem
    status:
      phase: Pending
status:
  collisionCount: 0
  currentReplicas: 1
  currentRevision: prometheus-kube-prometheus-prometheus-8cbb4d97f
  observedGeneration: 1
  readyReplicas: 1
  replicas: 1
  updateRevision: prometheus-kube-prometheus-prometheus-8cbb4d97f
  updatedReplicas: 1
View Code

2、ServiceMonitor资源

  • ServiceMonitor自定义资源(CRD)能够声明如何监控一组动态的服务。它会使用标签选择一组需要被监控的服务(target)。
  • Prometheus Operator想要监控Kubernetes集群中的应用时,它的Endpoints必须存在
    • Endpoints对象本质是一个IP地址列表。
    • Endpoints对象由Service构建。Service对象通过对象选择器发现Pod并将它们添加到Endpoints对象中。
  • 一个Service可以公开一个或多个端口,通常情况下,这些端口由指向一个Pod的多个Endpoints支持。
  • Prometheus Operator引入ServiceMonitor对象,通过它发现Endpoints对象,然后让Prometheus去监控这些Pods
    • ServiceMonitor.Spec的endpoints部分用于配置需要收集metrics的Endpoints的端口和其他参数。在endpoints部分指定endpoint时,请严格使用。
  • 注意:endpoints(小写)是ServiceMonitor CRD中的一个字段,而Endpoints(大写)是Kubernetes资源类型。
  • ServiceMonitor和发现的目标可能来自任何namespace。这对于跨namespace的监控十分重要,比如monitoring。
    • 使用Prometheus.Spec下ServiceMonitorNamespaceSelector,通过各自Prometheus server限制ServiceMonitors作用namespece。
    • 使用ServiceMonitor.Spec下的namespaceSelector可以现在允许发现Endpoints对象的命名空间。要发现所有命名空间下的目标,namespaceSelector必须为空。
  • 查看ServiceMonitor资源(使用helm部署kube-Prometheus的内容)
]# kubectl edit -n monitoring servicemonitor.monitoring.coreos.com/kube-prometheus-node-exporter
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  annotations:
    meta.helm.sh/release-name: kube-prometheus
    meta.helm.sh/release-namespace: monitoring
  generation: 1
  labels:
    app.kubernetes.io/instance: kube-prometheus
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: node-exporter
    helm.sh/chart: node-exporter-3.2.1
  name: kube-prometheus-node-exporter
  namespace: monitoring
spec:
  endpoints:
  - port: metrics
    interval: 15s            #抓取Endpoints的时间间隔
    relabelings:             #标签重写
    - action: replace
      regex: (.*)
      replacement: $1
      sourceLabels:
      - __meta_kubernetes_pod_node_name
      targetLabel: instance
  jobLabel: jobLabel
  namespaceSelector:    #定义Prometheus在哪些namespace中选择要被监控的Endpoints,根据标签选择
    matchNames:
    - monitoring
  selector:             #定义Prometheus选择哪些要被监控的Endpoints,根据标签选择Endpoints
    matchLabels:
      app.kubernetes.io/instance: kube-prometheus
      app.kubernetes.io/name: node-exporter

3、PrometheusRule

  • PrometheusRule CRD声明一个或多个Prometheus实例需要的Prometheus rule。
  • Alerts和recording rules可以保存并应用为yaml文件,可以被动态加载而不需要重启。
  • 获取PrometheusRule资源:https://github.com/prometheus-operator/kube-prometheus/tree/main/manifests
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  labels:
    app.kubernetes.io/component: exporter
    app.kubernetes.io/name: node-exporter
    app.kubernetes.io/part-of: kube-prometheus
    prometheus: k8s
    role: alert-rules
  name: node-exporter-rules
  namespace: monitoring
spec:
  groups:
  - name: node-exporter
    rules:
    - alert: NodeFilesystemSpaceFillingUp
      annotations:
        description: Filesystem on {{ $labels.device }} at {{ $labels.instance }}
          has only {{ printf "%.2f" $value }}% available space left and is filling
          up.
        runbook_url: https://github.com/prometheus-operator/kube-prometheus/wiki/nodefilesystemspacefillingup
        summary: Filesystem is predicted to run out of space within the next 24 hours.
      expr: |
        (
          node_filesystem_avail_bytes{job="node-exporter",fstype!=""} / node_filesystem_size_bytes{job="node-exporter",fstype!=""} * 100 < 40
        and
          predict_linear(node_filesystem_avail_bytes{job="node-exporter",fstype!=""}[6h], 24*60*60) < 0
        and
          node_filesystem_readonly{job="node-exporter",fstype!=""} == 0
        )
      for: 1h
      labels:
        severity: warning
...

4、Alertmanager

  • Alertmanager资源:定义AlertManager statefulset控制器的期望状态,Prometheus Operator确保这个statefulset控制器运行时一直与定义保持一致。包含副本数量、持久化存储的选项。
  • Prometheus Operator会根据Alertmanager资源在相同namespace下生成一个StatefulSet控制器。Alertmanager的Pod都会挂载一个名为<prometheus-name>的Secret。
  • 当有两个或更多配置的副本时,Operator可以高可用性模式运行Alertmanager实例。
  • 查看Alertmanager资源(使用helm部署kube-Prometheus的内容)
]# kubectl edit -n monitoring alertmanager.monitoring.coreos.com/kube-prometheus-alertmanager
apiVersion: monitoring.coreos.com/v1
kind: Alertmanager
metadata:
  annotations:
    meta.helm.sh/release-name: kube-prometheus
    meta.helm.sh/release-namespace: monitoring
  generation: 1
  labels:
    app.kubernetes.io/component: alertmanager
    app.kubernetes.io/instance: kube-prometheus
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: kube-prometheus
    helm.sh/chart: kube-prometheus-8.1.11
  name: kube-prometheus-alertmanager
  namespace: monitoring
spec:
  affinity:
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - podAffinityTerm:
          labelSelector:
            matchLabels:
              app.kubernetes.io/component: alertmanager
              app.kubernetes.io/instance: kube-prometheus
              app.kubernetes.io/name: kube-prometheus
          namespaces:
          - monitoring
          topologyKey: kubernetes.io/hostname
        weight: 1
  containers:
  - livenessProbe:
      failureThreshold: 120
      httpGet:
        path: /-/healthy
        port: web
        scheme: HTTP
      initialDelaySeconds: 0
      periodSeconds: 5
      successThreshold: 1
      timeoutSeconds: 3
    name: alertmanager
    readinessProbe:
      failureThreshold: 120
      httpGet:
        path: /-/ready
        port: web
        scheme: HTTP
      initialDelaySeconds: 0
      periodSeconds: 5
      successThreshold: 1
      timeoutSeconds: 3
    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        drop:
        - ALL
      readOnlyRootFilesystem: false
      runAsNonRoot: true
  - livenessProbe:
      failureThreshold: 6
      initialDelaySeconds: 10
      periodSeconds: 10
      successThreshold: 1
      tcpSocket:
        port: reloader-web
      timeoutSeconds: 5
    name: config-reloader
    readinessProbe:
      failureThreshold: 6
      initialDelaySeconds: 15
      periodSeconds: 20
      successThreshold: 1
      tcpSocket:
        port: reloader-web
      timeoutSeconds: 5
    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        drop:
        - ALL
      readOnlyRootFilesystem: false
      runAsNonRoot: true
  externalUrl: http://127.0.0.1:9093/
  image: docker.io/bitnami/alertmanager:0.24.0-debian-11-r46
  listenLocal: false
  logFormat: logfmt
  logLevel: info
  paused: false
  podMetadata:
    labels:
      app.kubernetes.io/component: alertmanager
      app.kubernetes.io/instance: kube-prometheus
      app.kubernetes.io/name: kube-prometheus
  portName: web
  replicas: 1
  resources: {}
  retention: 120h
  routePrefix: /
  securityContext:
    fsGroup: 1001
    runAsUser: 1001
  serviceAccountName: kube-prometheus-alertmanager
  storage:    #定义存储卷
    volumeClaimTemplate:
      metadata: {}
      spec:
        accessModes:
        - ReadWriteOnce
        resources:
          requests:
            storage: 8Gi
        storageClassName: nfs-client
View Code
  • 查看Alertmanager资源生成的StatefulSet控制器(使用helm部署kube-Prometheus的内容)
]# kubectl edit -n monitoring statefulset.apps/alertmanager-kube-prometheus-alertmanager
apiVersion: apps/v1
kind: StatefulSet
metadata:
  annotations:
    meta.helm.sh/release-name: kube-prometheus
    meta.helm.sh/release-namespace: monitoring
    prometheus-operator-input-hash: "13509733468393518222"
  generation: 1
  labels:
    app.kubernetes.io/component: alertmanager
    app.kubernetes.io/instance: kube-prometheus
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: kube-prometheus
    helm.sh/chart: kube-prometheus-8.1.11
  name: alertmanager-kube-prometheus-alertmanager
  namespace: monitoring
  ownerReferences:
  - apiVersion: monitoring.coreos.com/v1
    blockOwnerDeletion: true
    controller: true
    kind: Alertmanager
    name: kube-prometheus-alertmanager
spec:
  podManagementPolicy: Parallel
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      alertmanager: kube-prometheus-alertmanager
      app.kubernetes.io/instance: kube-prometheus-alertmanager
      app.kubernetes.io/managed-by: prometheus-operator
      app.kubernetes.io/name: alertmanager
  serviceName: alertmanager-operated
  template:
    metadata:
      annotations:
        kubectl.kubernetes.io/default-container: alertmanager
      creationTimestamp: null
      labels:
        alertmanager: kube-prometheus-alertmanager
        app.kubernetes.io/component: alertmanager
        app.kubernetes.io/instance: kube-prometheus-alertmanager
        app.kubernetes.io/managed-by: prometheus-operator
        app.kubernetes.io/name: alertmanager
        app.kubernetes.io/version: 0.24.0
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - podAffinityTerm:
              labelSelector:
                matchLabels:
                  app.kubernetes.io/component: alertmanager
                  app.kubernetes.io/instance: kube-prometheus
                  app.kubernetes.io/name: kube-prometheus
              namespaces:
              - monitoring
              topologyKey: kubernetes.io/hostname
            weight: 1
      containers:
      - args:
        - --config.file=/etc/alertmanager/config_out/alertmanager.env.yaml
        - --storage.path=/alertmanager
        - --data.retention=120h
        - --cluster.listen-address=
        - --web.listen-address=:9093
        - --web.external-url=http://127.0.0.1:9093/
        - --web.route-prefix=/
        - --cluster.peer=alertmanager-kube-prometheus-alertmanager-0.alertmanager-operated:9094
        - --cluster.reconnect-timeout=5m
        - --web.config.file=/etc/alertmanager/web_config/web-config.yaml
        env:
        - name: POD_IP
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: status.podIP
        image: docker.io/bitnami/alertmanager:0.24.0-debian-11-r46
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 120
          httpGet:
            path: /-/healthy
            port: web
            scheme: HTTP
          periodSeconds: 5
          successThreshold: 1
          timeoutSeconds: 3
        name: alertmanager
        ports:
        - containerPort: 9093
          name: web
          protocol: TCP
        - containerPort: 9094
          name: mesh-tcp
          protocol: TCP
        - containerPort: 9094
          name: mesh-udp
          protocol: UDP
        readinessProbe:
          failureThreshold: 120
          httpGet:
            path: /-/ready
            port: web
            scheme: HTTP
          initialDelaySeconds: 3
          periodSeconds: 5
          successThreshold: 1
          timeoutSeconds: 3
        resources:
          requests:
            memory: 200Mi
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop:
            - ALL
          readOnlyRootFilesystem: false
          runAsNonRoot: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: FallbackToLogsOnError
        volumeMounts:
        - mountPath: /etc/alertmanager/config
          name: config-volume
        - mountPath: /etc/alertmanager/config_out
          name: config-out
          readOnly: true
        - mountPath: /etc/alertmanager/certs
          name: tls-assets
          readOnly: true
        - mountPath: /alertmanager
          name: alertmanager-kube-prometheus-alertmanager-db
          subPath: alertmanager-db
        - mountPath: /etc/alertmanager/web_config/web-config.yaml
          name: web-config
          readOnly: true
          subPath: web-config.yaml
      - args:
        - --listen-address=:8080
        - --reload-url=http://127.0.0.1:9093/-/reload
        - --config-file=/etc/alertmanager/config/alertmanager.yaml.gz
        - --config-envsubst-file=/etc/alertmanager/config_out/alertmanager.env.yaml
        command:
        - /bin/prometheus-config-reloader
        env:
        - name: POD_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.name
        - name: SHARD
          value: "-1"
        image: docker.io/bitnami/prometheus-operator:0.60.1-debian-11-r0
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 6
          initialDelaySeconds: 10
          periodSeconds: 10
          successThreshold: 1
          tcpSocket:
            port: reloader-web
          timeoutSeconds: 5
        name: config-reloader
        ports:
        - containerPort: 8080
          name: reloader-web
          protocol: TCP
        readinessProbe:
          failureThreshold: 6
          initialDelaySeconds: 15
          periodSeconds: 20
          successThreshold: 1
          tcpSocket:
            port: reloader-web
          timeoutSeconds: 5
        resources:
          limits:
            cpu: 100m
            memory: 50Mi
          requests:
            cpu: 100m
            memory: 50Mi
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop:
            - ALL
          readOnlyRootFilesystem: false
          runAsNonRoot: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: FallbackToLogsOnError
        volumeMounts:
        - mountPath: /etc/alertmanager/config
          name: config-volume
          readOnly: true
        - mountPath: /etc/alertmanager/config_out
          name: config-out
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext:
        fsGroup: 1001
        runAsUser: 1001
      serviceAccount: kube-prometheus-alertmanager
      serviceAccountName: kube-prometheus-alertmanager
      terminationGracePeriodSeconds: 120
      volumes:
      - name: config-volume
        secret:
          defaultMode: 420
          secretName: alertmanager-kube-prometheus-alertmanager-generated
      - name: tls-assets
        projected:
          defaultMode: 420
          sources:
          - secret:
              name: alertmanager-kube-prometheus-alertmanager-tls-assets-0
      - emptyDir: {}
        name: config-out
      - name: web-config
        secret:
          defaultMode: 420
          secretName: alertmanager-kube-prometheus-alertmanager-web-config
  updateStrategy:
    type: RollingUpdate
  volumeClaimTemplates:
  - apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      creationTimestamp: null
      name: alertmanager-kube-prometheus-alertmanager-db
    spec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 8Gi
      storageClassName: nfs-client
      volumeMode: Filesystem
    status:
      phase: Pending
status:
  collisionCount: 0
  currentReplicas: 1
  currentRevision: alertmanager-kube-prometheus-alertmanager-b74c5965d
  observedGeneration: 1
  readyReplicas: 1
  replicas: 1
  updateRevision: alertmanager-kube-prometheus-alertmanager-b74c5965d
  updatedReplicas: 1
View Code

2、使用helm部署kube-Prometheus

  • Prometheus部署环境如下:
    • Kubernetes版本为v1.20.14。
    • helm版本为v3.8.2。
    • kube-prometheus的版本bitnami/kube-prometheus:8.1.11。

2.1、创建动态存储卷

  • 创建动态存储卷
    • 参看:https://www.cnblogs.com/maiblogs/p/16392831.html的《6.2、动态存储卷》,只需创建到“创建NFS SotageClass”。

2.2、部署kube-prometheus

  • kube-prometheus:8.1.11会自动安装如下组件:
    • prometheus-operator
    • prometheus
    • state-metrics
    • node-exporter
    • blackbox-exporter
    • alertmanager

1、创建名称空间

]# kubectl create namespace monitoring

2、下载kube-prometheus的chart

]# helm repo add bitnami https://charts.bitnami.com/bitnami
]# helm search repo prometheus
]# helm pull bitnami/kube-prometheus

3、修改values.yaml

//解压
]# tar zvfx kube-prometheus-8.1.11.tgz

//修改values.yaml
]# vim ./kube-prometheus/values.yaml
prometheus:
  ingress:
    enabled: true
    hostname: 
    annotations: {kubernetes.io/ingress.class: "nginx"}
    extraRules:
    - host: prometheus.local
      http:
        paths:
        - path: /
          pathType: Prefix
          backend:
            service:
              name: kube-prometheus-prometheus
              port:
                number: 9090
  externalUrl: "http://127.0.0.1:9090/"
  persistence:
    enabled: true
    storageClass: "nfs-client"

alertmanager:
  ingress:
    enabled: true
    hostname: 
    annotations: {kubernetes.io/ingress.class: "nginx"}
    extraRules:
    - host: alertmanager.local
      http:
        paths:
        - path: /
          pathType: Prefix
          backend:
            service:
              name: kube-prometheus-alertmanager
              port:
                number: 9093
  externalUrl: "http://127.0.0.1:9093/"
  persistence:
    enabled: true
    storageClass: "nfs-client"
  • 修改后的values.yaml文件

4、应用kube-prometheus

]# helm install kube-prometheus kube-prometheus/ -n monitoring

5、访问prometheus和alertmanager

//修改hosts文件(C:\Windows\System32\drivers\etc)
10.1.1.11 prometheus.local alertmanager.local
  • 使用http://prometheus.local:32080/访问prometheus。

  •  使用http://alertmanager.local:32080/访问alertmanager。

2.3、实现告警

2.3.1、配置alertmanager

1、查看alertmanager配置文件

]# kubectl exec alertmanager-kube-prometheus-alertmanager-0 -n monitoring -- cat /etc/alertmanager/config_out/alertmanager.env.yaml
]# kubectl get secret alertmanager-kube-prometheus-alertmanager -n monitoring -o go-template='{{ index .data "alertmanager.yaml" }}' | base64 -d
global:
  resolve_timeout: 5m
receivers:
- name: "null"
route:
  group_by:
  - job
  group_interval: 5m
  group_wait: 30s
  receiver: "null"
  repeat_interval: 12h
  routes:
  - match:
      alertname: Watchdog
    receiver: "null"

2、修改alertmanager的配置文件

  • 创建alertmanager.yaml
    • 注意,这里的alertmanager.yaml顶层多了两级alertmanager和config。
    • 注意,route如果没有子节点,就必须设置routes: []。
      • 报错信息,level=error component=configuration msg="Loading configuration file failed" file=/etc/alertmanager/config_out/alertmanager.env.yaml err="undefined receiver \"null\" used in route"
    • 注意,pod将存储卷挂在到了/alertmanager/。
]# vim alertmanager.yaml
alertmanager:
  config:
    global:
      resolve_timeout: 5m
      smtp_smarthost: 'smtp.qq.com:465'
      smtp_from: 'xxx@qq.com'
      smtp_auth_username: 'xxx@qq.com'
      smtp_auth_password: 'xxx'
      smtp_require_tls: false
    route:
      group_by: ['alertname']
      group_wait: 10s
      group_interval: 10s
      repeat_interval: 10s
      receiver: 'email'
      routes: []
    receivers:
    - name: 'email'
      email_configs:
      - to: 'xxx@xxx.com.cn'
    templates:
    - '/alertmanager/template.tmpl'

3、将告警模板template.tmpl放到存储卷上

]# vim /data1/monitoring-alertmanager-kube-prometheus-alertmanager-db-alertmanager-kube-prometheus-alertmanager-0-pvc-85d5342a-9f8c-41c3-95fe-f11c77579b0c/alertmanager-db/template.tmpl 
{{ define "__subject" }}
{{ if gt (len .Alerts.Firing) 0 -}}
{{ range .Alerts }}
{{ .Labels.alertname }}{{ .Annotations.title }}
{{ end }}{{ end }}{{ end }}

{{ define "email.default.html" }}
{{ range .Alerts }}
告警名称: {{ .Annotations.title }} <br>
告警级别: {{ .Labels.severity }} <br>
告警主机: {{ .Labels.instance }} <br>
告警信息: {{ .Annotations.description }} <br>
维护团队: {{ .Labels.team }} <br>
告警时间:{{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }} <br>
{{ end }}{{ end }}

4、滚动更新kube-prometheus

  • 更新alertmanager.yaml配置文件
]# helm upgrade kube-prometheus kube-prometheus/ --values=alertmanager.yaml -n monitoring

2.3.2、创建告警规则

1、创建PrometheusRule资源

]# vim node-exporter-rules.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  labels:
    app.kubernetes.io/component: exporter
    app.kubernetes.io/name: node-exporter
    app.kubernetes.io/part-of: kube-prometheus
    prometheus: k8s
    role: alert-rules
  name: node-exporter-rules
  namespace: monitoring
spec:
  groups:
  - name: node-exporter
    rules:
    - alert: NodeFilesystemSpaceFillingUp
      expr: up == 3
      for: 10s
      labels:
        severity: "告警级别critical"
        team: "维护团队OPS"
      annotations:
        title: "告警名称Instance {{ $labels.instance }} down"
        description: "告警信息{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 3 minutes."

2、应用PrometheusRule资源

]# kubectl apply -f node-exporter-rules.yaml

//查看prometheusrule资源
]# kubectl get prometheusrule -A
NAMESPACE    NAME                  AGE
monitoring   node-exporter-rules   39s

2.4、部署grafana

1、下载grafana的chart

]# helm repo add bitnami https://charts.bitnami.com/bitnami
]# helm search repo grafana
]# helm pull bitnami/grafana

2、修改values.yaml

//解压
]# tar zvfx grafana-8.2.12.tgz

//修改values.yaml
]# vim grafana/values.yaml 
admin:
  password: "admin"
persistence:
  storageClass: "nfs-client"
ingress:
  enabled: true
  hostname: 
  annotations: {kubernetes.io/ingress.class: "nginx"}
  extraRules:
  - host: grafana.local
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: grafana
            port:
              number: 3000

3、应用grafana

]# helm install grafana grafana/ -n monitoring

4、访问grafana

//修改hosts文件(C:\Windows\System32\drivers\etc)
10.1.1.11 prometheus.local alertmanager.local grafana.local
  • 使用http://grafana.local:32080/访问grafana。

5、添加数据源

  • 在Kubernetes中,集群内部的服务可用通过Kubernetes内部的域名相互访问,Kubernetes内部的域名是:Service_Name.Namespace_Name.svc.cluster.local。
  • 同一个名称空间中的服务,可以直接通过Service_Name相互访问。

3、使用github部署kube-Prometheus

  • 注意,这里只是快速入门安装,并没有使用持久卷
  • Prometheus部署环境如下:
    • Kubernetes版本为v1.20.14。
    • kube-prometheus的版本0.8.o。
  • kube-prometheus和Kubernetes的兼容性

1、下载kube-Prometheus

]# wget https://github.com/prometheus-operator/kube-prometheus/archive/refs/tags/v0.8.0.tar.gz

2、快速部署kube-prometheus

]# tar zvfx v0.8.0.tar.gz
]# cd kube-prometheus-0.8.0/

//将k8s.gcr.io/kube-state-metrics/kube-state-metrics:v2.0.0替换为bitnami/kube-state-metrics:2.0.0
]# vim ./manifests/kube-state-metrics-deployment.yaml

//先创建名称空间、prometheus-operator等
]# kubectl create -f manifests/setup
//部署全部组件
]# kubectl create -f manifests/
  • 如果之前部署过prometheus,请先清除可能的残留
]# cd kube-prometheus-0.8.0/
]# kubectl delete --ignore-not-found=true -f manifests/ -f manifests/setup
  • 查看相关资源

//查看pod
]# kubectl get pods -A
NAMESPACE         NAME                                        READY   STATUS      RESTARTS   AGE
monitoring        alertmanager-main-0                         2/2     Running     0          31s
monitoring        alertmanager-main-1                         2/2     Running     0          31s
monitoring        alertmanager-main-2                         2/2     Running     0          31s
monitoring        blackbox-exporter-55c457d5fb-4jvmm          3/3     Running     0          30s
monitoring        grafana-9df57cdc4-l7gxk                     1/1     Running     0          29s
monitoring        kube-state-metrics-6cb48468f8-dbdnc         3/3     Running     0          29s
monitoring        node-exporter-6svtr                         2/2     Running     0          29s
monitoring        node-exporter-hpfw9                         2/2     Running     0          29s
monitoring        node-exporter-jksr2                         2/2     Running     0          29s
monitoring        prometheus-adapter-59df95d9f5-rxzdg         1/1     Running     0          29s
monitoring        prometheus-adapter-59df95d9f5-zs46x         1/1     Running     0          29s
monitoring        prometheus-k8s-0                            2/2     Running     1          29s
monitoring        prometheus-k8s-1                            2/2     Running     1          29s
monitoring        prometheus-operator-7775c66ccf-bfxd9        2/2     Running     0          29m
...

//查看service
]# kubectl get svc -A
NAMESPACE       NAME                                 TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                        AGE
monitoring      alertmanager-main                    ClusterIP   10.20.24.158    <none>        9093/TCP                       64s
monitoring      alertmanager-operated                ClusterIP   None            <none>        9093/TCP,9094/TCP,9094/UDP     64s
monitoring      blackbox-exporter                    ClusterIP   10.20.248.189   <none>        9115/TCP,19115/TCP             64s
monitoring      grafana                              ClusterIP   10.20.214.103   <none>        3000/TCP                       63s
monitoring      kube-state-metrics                   ClusterIP   None            <none>        8443/TCP,9443/TCP              63s
monitoring      node-exporter                        ClusterIP   None            <none>        9100/TCP                       63s
monitoring      prometheus-adapter                   ClusterIP   10.20.74.223    <none>        443/TCP                        63s
monitoring      prometheus-k8s                       ClusterIP   10.20.73.57     <none>        9090/TCP                       63s
monitoring      prometheus-operated                  ClusterIP   None            <none>        9090/TCP                       62s
monitoring      prometheus-operator                  ClusterIP   None            <none>        8443/TCP                       30m
...

//查看pod控制器
]# kubectl get deployment -A
NAMESPACE         NAME                       READY   UP-TO-DATE   AVAILABLE   AGE
monitoring        blackbox-exporter          1/1     1            1           109s
monitoring        grafana                    1/1     1            1           108s
monitoring        kube-state-metrics         1/1     1            1           108s
monitoring        prometheus-adapter         2/2     2            2           108s
monitoring        prometheus-operator        1/1     1            1           30m
...

//查看有状态的pod控制器
]# kubectl get sts -A
NAMESPACE    NAME                READY   AGE
monitoring   alertmanager-main   3/3     2m1s
monitoring   prometheus-k8s      2/2     119s
...

3、访问服务

  • 访问prometheus
    • http://10.1.1.11:19090/
//监听10.1.1.11:19090,并将请求转发到service后面的pod的9090端口
]# kubectl port-forward svc/prometheus-k8s --address=10.1.1.11 19090:9090 -n monitoring
  • 访问grafana
    • http://10.1.1.11:13000/    (admin:admin)
//监听10.1.1.11:13000,并将请求转发到service后面的pod的3000端口
]# kubectl port-forward svc/grafana --address=10.1.1.11 13000:3000 -n monitoring
  • 访问alertmanager
    • http://10.1.1.11:19093/
//监听10.1.1.11:19093,并将请求转发到service后面的pod的9093端口
]# kubectl port-forward svc/alertmanager-main --address=10.1.1.11 19093:9093 -n monitoring

4、创建ingress规则

]# vim prometheus-ingress
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: prometheus-ingress
  namespace: monitoring
  annotations:
    kubernetes.io/ingress.class: "nginx"
spec:
  rules:
  - host: prometheus.local
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: prometheus-k8s
            port:
              number: 9090
  - host: alertmanager.local
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: alertmanager-main
            port:
              number: 9093
  - host: grafana.local
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: grafana
            port:
              number: 3000

1

#                                                                                                                                        #
posted @ 2022-10-20 20:23  麦恒  阅读(464)  评论(0编辑  收藏  举报