监控-Prometheus09-监控Kubernetes
- 在过去的几年中,云计算已经成为及分布式计算最火热的技术之一,其中Docker、Kubernetes、Prometheus等开源软件的发展极大地推动了云计算的发展。
- Kubernetes使用Docker进行容器管理,如果说Docker和kubernetes的搭配是云原生时代的基石,那么Prometheus为云原生插上了飞翔的翅膀。随着云原生社区的不断壮大,应用场景越来越复杂,需要一套针对云原生环境的完善并且开放的监控平台。在这样的环境下,Prometheus应运而生,天然支持Kubernetes。
- 传统方式部署步骤相对复杂,随着Operator的日益成熟,推荐使用Operator方式部署Prometheus。通过Operator方式部署Prometheus,可将更多的操作集成到Operator中,简化了操作过程,也使部署更加简单。
1、Prometheus Operator介绍
- Kubernetes的Prometheus Operator为Kubernetes服务和Prometheus实例的部署和管理提供了简单的监控定义。
- Prometheus Operator(后面都简称Operater)提供如下功能:
- 创建/销毁:在Kubernetes namespace中更加容易地启动一个Prometheues实例,一个特定应用程序或者团队可以更容易使用Operator。
- 便捷配置:可以通过Kubernetes资源配置Prometheus的基本信息,比如版本、存储、保留策略和副本集等。
1.1、Prometheus Operator架构
- Prometheus Operator架构如图11-1所示:
- 架构中的各组以k8s自定义资源的方式运行在Kubernetes集群中,它们各自有不同的作用。
- Operator:根据自定义资源(Custom ResourceDefinition,CRD)来部署和管理Prometheus Server,同时监控这些自定义资源事件的变化来做相应的处理,是整个系统的控制中心。
- Prometheus资源:声明Prometheus statefulset控制器的期望状态,Prometheus Operator确保这个statefulset控制器运行时一直与定义保持一致。
- Prometheus Server:Operator根据Prometheus资源部署Prometheus Server集群,这些自定义资源可以看作是用来管理Prometheus Server集群的StatefulSets资源。
- Alertmanager资源:定义AlertManager statefulset控制器的期望状态,Prometheus Operator确保这个statefulset控制器运行时一直与定义保持一致。
- ServiceMonitor资源:声明Prometheus监控的target列表。该资源通过Labels来选取对应的Service Endpoint,让Prometheus Server通过选取的Service来获取Metrics信息。
- Service:简单的说就是Prometheus监控的对象。
2.2、Prometheus Operator的自定义资源
- Prometheus Operater有四种自定义资源:
- Prometheus
- ServiceMonitor
- Alertmanager
- PrometheusRule
- 查看名称空间中的所有资源
1 | ]# kubectl api-resources --verbs=list --namespaced -o name | xargs -n 1 kubectl get --show-kind --ignore-not-found -n monitoring |
1、Prometheus资源
- Prometheus自定义资源(CRD):声明Prometheus statefulset控制器的期望状态,Prometheus Operator确保这个statefulset控制器运行时一直与定义保持一致。包含副本数量、持久化存储以及Prometheus实例发送警告到的Alertmanagers等配置选项。
- Prometheus Operator会根据Prometheus资源在相同namespace下生成一个StatefulSet控制器。Prometheus的Pod都会挂载一个名为<prometheus-name>的Secret,里面包含了Prometheus的配置。Prometheus Operator根据包含的ServiceMonitor生成配置,并且更新含有配置的Secret。无论是对ServiceMonitors或者Prometheus的修改,都会持续不断的被按照前面的步骤更新。
- 查看Prometheus资源(使用helm部署kube-Prometheus的内容)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 | ] # kubectl edit -n monitoring prometheus.monitoring.coreos.com/kube-promet apiVersion: monitoring.coreos.com /v1 kind: Prometheus metadata: annotations: meta.helm.sh /release-name : kube-prometheus meta.helm.sh /release-namespace : monitoring labels: app.kubernetes.io /component : prometheus app.kubernetes.io /instance : kube-prometheus app.kubernetes.io /managed-by : Helm app.kubernetes.io /name : kube-prometheus helm.sh /chart : kube-prometheus-8.1.11 name: kube-prometheus-prometheus namespace: monitoring spec: affinity: #定义亲和性 podAntiAffinity: preferredDuringSchedulingIgnoredDuringExecution: - podAffinityTerm: labelSelector: matchLabels: app.kubernetes.io /component : prometheus app.kubernetes.io /instance : kube-prometheus app.kubernetes.io /name : kube-prometheus namespaces: - monitoring topologyKey: kubernetes.io /hostname weight: 1 alerting: #定义Prometheus关联的Alertmanager alertmanagers: - name: kube-prometheus-alertmanager namespace: monitoring pathPrefix: / port: http containers: #定义容器 - name: prometheus #prometheus容器 livenessProbe: #容器存活性探针 failureThreshold: 10 httpGet: path: /-/healthy port: web scheme: HTTP initialDelaySeconds: 0 periodSeconds: 10 successThreshold: 1 timeoutSeconds: 3 readinessProbe: #容器可用性探针 failureThreshold: 10 httpGet: path: /-/ready port: web scheme: HTTP initialDelaySeconds: 0 periodSeconds: 10 successThreshold: 1 timeoutSeconds: 3 securityContext: allowPrivilegeEscalation: false capabilities: drop: - ALL readOnlyRootFilesystem: false runAsNonRoot: true startupProbe: #容器启动性探针 failureThreshold: 60 httpGet: path: /-/ready port: web scheme: HTTP initialDelaySeconds: 0 periodSeconds: 15 successThreshold: 1 timeoutSeconds: 3 - name: config-reloader #config-reloader容器 livenessProbe: failureThreshold: 6 initialDelaySeconds: 10 periodSeconds: 10 successThreshold: 1 tcpSocket: port: reloader-web timeoutSeconds: 5 readinessProbe: failureThreshold: 6 initialDelaySeconds: 15 periodSeconds: 20 successThreshold: 1 tcpSocket: port: reloader-web timeoutSeconds: 5 securityContext: allowPrivilegeEscalation: false capabilities: drop: - ALL readOnlyRootFilesystem: false runAsNonRoot: true enableAdminAPI: false evaluationInterval: 30s externalUrl: http: //127 .0.0.1:9090/ image: docker.io /bitnami/prometheus :2.39.1-debian-11-r1 #镜像 listenLocal: false logFormat: logfmt #日志风格 logLevel: info paused: false podMetadata: labels: app.kubernetes.io /component : prometheus app.kubernetes.io /instance : kube-prometheus app.kubernetes.io /name : kube-prometheus podMonitorNamespaceSelector: {} podMonitorSelector: {} portName: web probeNamespaceSelector: {} probeSelector: {} replicas: 1 #定义Proemtheus“集群”有两个副本。说是集群,其实Prometheus自身不带集群功能,这里只是起两个完全一样的Prometheus来避免单点故障 retention: 10d routePrefix: / ruleNamespaceSelector: {} ruleSelector: {} #定义Prometheus使用哪些PrometheusRule,根据标签选择。 scrapeInterval: 30s securityContext: fsGroup: 1001 runAsUser: 1001 serviceAccountName: kube-prometheus-prometheus serviceMonitorNamespaceSelector: {} #定义Prometheus在哪些namespace中选择要被监控的ServiceMonitor,根据标签选择namespace。不声明则会全部选中 serviceMonitorSelector: {} #定义Prometheus选择哪些要被监控的ServiceMonitor,根据标签选择ServiceMonitor。不声明则会全部选中 shards: 1 storage: #定义存储卷 volumeClaimTemplate: spec: accessModes: - ReadWriteOnce resources: requests: storage: 8Gi storageClassName: nfs-client status: availableReplicas: 1 conditions: - lastTransitionTime: "2022-10-22T21:24:42Z" status: "True" type : Available - lastTransitionTime: "2022-10-22T21:12:51Z" status: "True" type : Reconciled paused: false replicas: 1 shardStatuses: - availableReplicas: 1 replicas: 1 shardID: "0" unavailableReplicas: 0 updatedReplicas: 1 unavailableReplicas: 0 updatedReplicas: 1 |
- 查看Prometheus资源生成的StatefulSet控制器(使用helm部署kube-Prometheus的内容)

]# kubectl edit -n monitoring statefulset.apps/prometheus-kube-prometheus-prometheus apiVersion: apps/v1 kind: StatefulSet metadata: annotations: meta.helm.sh/release-name: kube-prometheus meta.helm.sh/release-namespace: monitoring generation: 1 labels: app.kubernetes.io/component: prometheus app.kubernetes.io/instance: kube-prometheus app.kubernetes.io/managed-by: Helm app.kubernetes.io/name: kube-prometheus helm.sh/chart: kube-prometheus-8.1.11 operator.prometheus.io/name: kube-prometheus-prometheus operator.prometheus.io/shard: "0" name: prometheus-kube-prometheus-prometheus namespace: monitoring ownerReferences: - apiVersion: monitoring.coreos.com/v1 blockOwnerDeletion: true controller: true kind: Prometheus name: kube-prometheus-prometheus spec: podManagementPolicy: Parallel replicas: 1 revisionHistoryLimit: 10 selector: matchLabels: app.kubernetes.io/instance: kube-prometheus-prometheus app.kubernetes.io/managed-by: prometheus-operator app.kubernetes.io/name: prometheus operator.prometheus.io/name: kube-prometheus-prometheus operator.prometheus.io/shard: "0" prometheus: kube-prometheus-prometheus serviceName: prometheus-operated template: metadata: annotations: kubectl.kubernetes.io/default-container: prometheus creationTimestamp: null labels: app.kubernetes.io/component: prometheus app.kubernetes.io/instance: kube-prometheus-prometheus app.kubernetes.io/managed-by: prometheus-operator app.kubernetes.io/name: prometheus app.kubernetes.io/version: 2.39.0 operator.prometheus.io/name: kube-prometheus-prometheus operator.prometheus.io/shard: "0" prometheus: kube-prometheus-prometheus spec: affinity: podAntiAffinity: preferredDuringSchedulingIgnoredDuringExecution: - podAffinityTerm: labelSelector: matchLabels: app.kubernetes.io/component: prometheus app.kubernetes.io/instance: kube-prometheus app.kubernetes.io/name: kube-prometheus namespaces: - monitoring topologyKey: kubernetes.io/hostname weight: 1 automountServiceAccountToken: true containers: - args: - --web.console.templates=/etc/prometheus/consoles - --web.console.libraries=/etc/prometheus/console_libraries - --storage.tsdb.retention.time=10d - --config.file=/etc/prometheus/config_out/prometheus.env.yaml - --storage.tsdb.path=/prometheus - --web.enable-lifecycle - --web.external-url=http://127.0.0.1:9090/ - --web.route-prefix=/ - --web.config.file=/etc/prometheus/web_config/web-config.yaml image: docker.io/bitnami/prometheus:2.39.1-debian-11-r1 imagePullPolicy: IfNotPresent livenessProbe: failureThreshold: 10 httpGet: path: /-/healthy port: web scheme: HTTP periodSeconds: 10 successThreshold: 1 timeoutSeconds: 3 name: prometheus ports: - containerPort: 9090 name: web protocol: TCP readinessProbe: failureThreshold: 10 httpGet: path: /-/ready port: web scheme: HTTP periodSeconds: 10 successThreshold: 1 timeoutSeconds: 3 resources: {} securityContext: allowPrivilegeEscalation: false capabilities: drop: - ALL readOnlyRootFilesystem: false runAsNonRoot: true startupProbe: failureThreshold: 60 httpGet: path: /-/ready port: web scheme: HTTP periodSeconds: 15 successThreshold: 1 timeoutSeconds: 3 terminationMessagePath: /dev/termination-log terminationMessagePolicy: FallbackToLogsOnError volumeMounts: - mountPath: /etc/prometheus/config_out name: config-out readOnly: true - mountPath: /etc/prometheus/certs name: tls-assets readOnly: true - mountPath: /prometheus name: prometheus-kube-prometheus-prometheus-db subPath: prometheus-db - mountPath: /etc/prometheus/rules/prometheus-kube-prometheus-prometheus-rulefiles-0 name: prometheus-kube-prometheus-prometheus-rulefiles-0 - mountPath: /etc/prometheus/web_config/web-config.yaml name: web-config readOnly: true subPath: web-config.yaml - args: - --listen-address=:8080 - --reload-url=http://127.0.0.1:9090/-/reload - --config-file=/etc/prometheus/config/prometheus.yaml.gz - --config-envsubst-file=/etc/prometheus/config_out/prometheus.env.yaml - --watched-dir=/etc/prometheus/rules/prometheus-kube-prometheus-prometheus-rulefiles-0 command: - /bin/prometheus-config-reloader env: - name: POD_NAME valueFrom: fieldRef: apiVersion: v1 fieldPath: metadata.name - name: SHARD value: "0" image: docker.io/bitnami/prometheus-operator:0.60.1-debian-11-r0 imagePullPolicy: IfNotPresent livenessProbe: failureThreshold: 6 initialDelaySeconds: 10 periodSeconds: 10 successThreshold: 1 tcpSocket: port: reloader-web timeoutSeconds: 5 name: config-reloader ports: - containerPort: 8080 name: reloader-web protocol: TCP readinessProbe: failureThreshold: 6 initialDelaySeconds: 15 periodSeconds: 20 successThreshold: 1 tcpSocket: port: reloader-web timeoutSeconds: 5 resources: limits: cpu: 100m memory: 50Mi requests: cpu: 100m memory: 50Mi securityContext: allowPrivilegeEscalation: false capabilities: drop: - ALL readOnlyRootFilesystem: false runAsNonRoot: true terminationMessagePath: /dev/termination-log terminationMessagePolicy: FallbackToLogsOnError volumeMounts: - mountPath: /etc/prometheus/config name: config - mountPath: /etc/prometheus/config_out name: config-out - mountPath: /etc/prometheus/rules/prometheus-kube-prometheus-prometheus-rulefiles-0 name: prometheus-kube-prometheus-prometheus-rulefiles-0 dnsPolicy: ClusterFirst initContainers: - args: - --watch-interval=0 - --listen-address=:8080 - --config-file=/etc/prometheus/config/prometheus.yaml.gz - --config-envsubst-file=/etc/prometheus/config_out/prometheus.env.yaml - --watched-dir=/etc/prometheus/rules/prometheus-kube-prometheus-prometheus-rulefiles-0 command: - /bin/prometheus-config-reloader env: - name: POD_NAME valueFrom: fieldRef: apiVersion: v1 fieldPath: metadata.name - name: SHARD value: "0" image: docker.io/bitnami/prometheus-operator:0.60.1-debian-11-r0 imagePullPolicy: IfNotPresent name: init-config-reloader ports: - containerPort: 8080 name: reloader-web protocol: TCP resources: limits: cpu: 100m memory: 50Mi requests: cpu: 100m memory: 50Mi securityContext: allowPrivilegeEscalation: false capabilities: drop: - ALL readOnlyRootFilesystem: true terminationMessagePath: /dev/termination-log terminationMessagePolicy: FallbackToLogsOnError volumeMounts: - mountPath: /etc/prometheus/config name: config - mountPath: /etc/prometheus/config_out name: config-out - mountPath: /etc/prometheus/rules/prometheus-kube-prometheus-prometheus-rulefiles-0 name: prometheus-kube-prometheus-prometheus-rulefiles-0 restartPolicy: Always schedulerName: default-scheduler securityContext: fsGroup: 1001 runAsUser: 1001 serviceAccount: kube-prometheus-prometheus serviceAccountName: kube-prometheus-prometheus terminationGracePeriodSeconds: 600 volumes: - name: config secret: defaultMode: 420 secretName: prometheus-kube-prometheus-prometheus - name: tls-assets projected: defaultMode: 420 sources: - secret: name: prometheus-kube-prometheus-prometheus-tls-assets-0 - emptyDir: {} name: config-out - configMap: defaultMode: 420 name: prometheus-kube-prometheus-prometheus-rulefiles-0 name: prometheus-kube-prometheus-prometheus-rulefiles-0 - name: web-config secret: defaultMode: 420 secretName: prometheus-kube-prometheus-prometheus-web-config updateStrategy: type: RollingUpdate volumeClaimTemplates: - apiVersion: v1 kind: PersistentVolumeClaim metadata: creationTimestamp: null name: prometheus-kube-prometheus-prometheus-db spec: accessModes: - ReadWriteOnce resources: requests: storage: 8Gi storageClassName: nfs-client volumeMode: Filesystem status: phase: Pending status: collisionCount: 0 currentReplicas: 1 currentRevision: prometheus-kube-prometheus-prometheus-8cbb4d97f observedGeneration: 1 readyReplicas: 1 replicas: 1 updateRevision: prometheus-kube-prometheus-prometheus-8cbb4d97f updatedReplicas: 1
2、ServiceMonitor资源
- ServiceMonitor自定义资源(CRD)能够声明如何监控一组动态的服务。它会使用标签选择一组需要被监控的服务(target)。
- Prometheus Operator想要监控Kubernetes集群中的应用时,它的Endpoints必须存在。
- Endpoints对象本质是一个IP地址列表。
- Endpoints对象由Service构建。Service对象通过对象选择器发现Pod并将它们添加到Endpoints对象中。
- 一个Service可以公开一个或多个端口,通常情况下,这些端口由指向一个Pod的多个Endpoints支持。
- Prometheus Operator引入ServiceMonitor对象,通过它发现Endpoints对象,然后让Prometheus去监控这些Pods。
- ServiceMonitor.Spec的endpoints部分用于配置需要收集metrics的Endpoints的端口和其他参数。在endpoints部分指定endpoint时,请严格使用。
- 注意:endpoints(小写)是ServiceMonitor CRD中的一个字段,而Endpoints(大写)是Kubernetes资源类型。
- ServiceMonitor和发现的目标可能来自任何namespace。这对于跨namespace的监控十分重要,比如monitoring。
- 使用Prometheus.Spec下ServiceMonitorNamespaceSelector,通过各自Prometheus server限制ServiceMonitors作用namespece。
- 使用ServiceMonitor.Spec下的namespaceSelector可以现在允许发现Endpoints对象的命名空间。要发现所有命名空间下的目标,namespaceSelector必须为空。
- 查看ServiceMonitor资源(使用helm部署kube-Prometheus的内容)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 | ] # kubectl edit -n monitoring servicemonitor.monitoring.coreos.com/kube-prometheus-node-exporter apiVersion: monitoring.coreos.com /v1 kind: ServiceMonitor metadata: annotations: meta.helm.sh /release-name : kube-prometheus meta.helm.sh /release-namespace : monitoring generation: 1 labels: app.kubernetes.io /instance : kube-prometheus app.kubernetes.io /managed-by : Helm app.kubernetes.io /name : node-exporter helm.sh /chart : node-exporter-3.2.1 name: kube-prometheus-node-exporter namespace: monitoring spec: endpoints: - port: metrics interval: 15s #抓取Endpoints的时间间隔 relabelings: #标签重写 - action: replace regex: (.*) replacement: $1 sourceLabels: - __meta_kubernetes_pod_node_name targetLabel: instance jobLabel: jobLabel namespaceSelector: #定义Prometheus在哪些namespace中选择要被监控的Endpoints,根据标签选择 matchNames: - monitoring selector: #定义Prometheus选择哪些要被监控的Endpoints,根据标签选择Endpoints matchLabels: app.kubernetes.io /instance : kube-prometheus app.kubernetes.io /name : node-exporter |
3、PrometheusRule
- PrometheusRule CRD声明一个或多个Prometheus实例需要的Prometheus rule。
- Alerts和recording rules可以保存并应用为yaml文件,可以被动态加载而不需要重启。
- 获取PrometheusRule资源:https://github.com/prometheus-operator/kube-prometheus/tree/main/manifests
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 | apiVersion: monitoring.coreos.com /v1 kind: PrometheusRule metadata: labels: app.kubernetes.io /component : exporter app.kubernetes.io /name : node-exporter app.kubernetes.io /part-of : kube-prometheus prometheus: k8s role: alert-rules name: node-exporter-rules namespace: monitoring spec: groups : - name: node-exporter rules: - alert: NodeFilesystemSpaceFillingUp annotations: description: Filesystem on {{ $labels.device }} at {{ $labels.instance }} has only {{ printf "%.2f" $value }}% available space left and is filling up. runbook_url: https: //github .com /prometheus-operator/kube-prometheus/wiki/nodefilesystemspacefillingup summary: Filesystem is predicted to run out of space within the next 24 hours. expr : | ( node_filesystem_avail_bytes{job= "node-exporter" ,fstype!= "" } / node_filesystem_size_bytes{job= "node-exporter" ,fstype!= "" } * 100 < 40 and predict_linear(node_filesystem_avail_bytes{job= "node-exporter" ,fstype!= "" }[6h], 24*60*60) < 0 and node_filesystem_readonly{job= "node-exporter" ,fstype!= "" } == 0 ) for : 1h labels: severity: warning ... |
4、Alertmanager
- Alertmanager资源:定义AlertManager statefulset控制器的期望状态,Prometheus Operator确保这个statefulset控制器运行时一直与定义保持一致。包含副本数量、持久化存储的选项。
- Prometheus Operator会根据Alertmanager资源在相同namespace下生成一个StatefulSet控制器。Alertmanager的Pod都会挂载一个名为<prometheus-name>的Secret。
- 当有两个或更多配置的副本时,Operator可以高可用性模式运行Alertmanager实例。
- 查看Alertmanager资源(使用helm部署kube-Prometheus的内容)

]# kubectl edit -n monitoring alertmanager.monitoring.coreos.com/kube-prometheus-alertmanager apiVersion: monitoring.coreos.com/v1 kind: Alertmanager metadata: annotations: meta.helm.sh/release-name: kube-prometheus meta.helm.sh/release-namespace: monitoring generation: 1 labels: app.kubernetes.io/component: alertmanager app.kubernetes.io/instance: kube-prometheus app.kubernetes.io/managed-by: Helm app.kubernetes.io/name: kube-prometheus helm.sh/chart: kube-prometheus-8.1.11 name: kube-prometheus-alertmanager namespace: monitoring spec: affinity: podAntiAffinity: preferredDuringSchedulingIgnoredDuringExecution: - podAffinityTerm: labelSelector: matchLabels: app.kubernetes.io/component: alertmanager app.kubernetes.io/instance: kube-prometheus app.kubernetes.io/name: kube-prometheus namespaces: - monitoring topologyKey: kubernetes.io/hostname weight: 1 containers: - livenessProbe: failureThreshold: 120 httpGet: path: /-/healthy port: web scheme: HTTP initialDelaySeconds: 0 periodSeconds: 5 successThreshold: 1 timeoutSeconds: 3 name: alertmanager readinessProbe: failureThreshold: 120 httpGet: path: /-/ready port: web scheme: HTTP initialDelaySeconds: 0 periodSeconds: 5 successThreshold: 1 timeoutSeconds: 3 securityContext: allowPrivilegeEscalation: false capabilities: drop: - ALL readOnlyRootFilesystem: false runAsNonRoot: true - livenessProbe: failureThreshold: 6 initialDelaySeconds: 10 periodSeconds: 10 successThreshold: 1 tcpSocket: port: reloader-web timeoutSeconds: 5 name: config-reloader readinessProbe: failureThreshold: 6 initialDelaySeconds: 15 periodSeconds: 20 successThreshold: 1 tcpSocket: port: reloader-web timeoutSeconds: 5 securityContext: allowPrivilegeEscalation: false capabilities: drop: - ALL readOnlyRootFilesystem: false runAsNonRoot: true externalUrl: http://127.0.0.1:9093/ image: docker.io/bitnami/alertmanager:0.24.0-debian-11-r46 listenLocal: false logFormat: logfmt logLevel: info paused: false podMetadata: labels: app.kubernetes.io/component: alertmanager app.kubernetes.io/instance: kube-prometheus app.kubernetes.io/name: kube-prometheus portName: web replicas: 1 resources: {} retention: 120h routePrefix: / securityContext: fsGroup: 1001 runAsUser: 1001 serviceAccountName: kube-prometheus-alertmanager storage: #定义存储卷 volumeClaimTemplate: metadata: {} spec: accessModes: - ReadWriteOnce resources: requests: storage: 8Gi storageClassName: nfs-client
- 查看Alertmanager资源生成的StatefulSet控制器(使用helm部署kube-Prometheus的内容)

]# kubectl edit -n monitoring statefulset.apps/alertmanager-kube-prometheus-alertmanager apiVersion: apps/v1 kind: StatefulSet metadata: annotations: meta.helm.sh/release-name: kube-prometheus meta.helm.sh/release-namespace: monitoring prometheus-operator-input-hash: "13509733468393518222" generation: 1 labels: app.kubernetes.io/component: alertmanager app.kubernetes.io/instance: kube-prometheus app.kubernetes.io/managed-by: Helm app.kubernetes.io/name: kube-prometheus helm.sh/chart: kube-prometheus-8.1.11 name: alertmanager-kube-prometheus-alertmanager namespace: monitoring ownerReferences: - apiVersion: monitoring.coreos.com/v1 blockOwnerDeletion: true controller: true kind: Alertmanager name: kube-prometheus-alertmanager spec: podManagementPolicy: Parallel replicas: 1 revisionHistoryLimit: 10 selector: matchLabels: alertmanager: kube-prometheus-alertmanager app.kubernetes.io/instance: kube-prometheus-alertmanager app.kubernetes.io/managed-by: prometheus-operator app.kubernetes.io/name: alertmanager serviceName: alertmanager-operated template: metadata: annotations: kubectl.kubernetes.io/default-container: alertmanager creationTimestamp: null labels: alertmanager: kube-prometheus-alertmanager app.kubernetes.io/component: alertmanager app.kubernetes.io/instance: kube-prometheus-alertmanager app.kubernetes.io/managed-by: prometheus-operator app.kubernetes.io/name: alertmanager app.kubernetes.io/version: 0.24.0 spec: affinity: podAntiAffinity: preferredDuringSchedulingIgnoredDuringExecution: - podAffinityTerm: labelSelector: matchLabels: app.kubernetes.io/component: alertmanager app.kubernetes.io/instance: kube-prometheus app.kubernetes.io/name: kube-prometheus namespaces: - monitoring topologyKey: kubernetes.io/hostname weight: 1 containers: - args: - --config.file=/etc/alertmanager/config_out/alertmanager.env.yaml - --storage.path=/alertmanager - --data.retention=120h - --cluster.listen-address= - --web.listen-address=:9093 - --web.external-url=http://127.0.0.1:9093/ - --web.route-prefix=/ - --cluster.peer=alertmanager-kube-prometheus-alertmanager-0.alertmanager-operated:9094 - --cluster.reconnect-timeout=5m - --web.config.file=/etc/alertmanager/web_config/web-config.yaml env: - name: POD_IP valueFrom: fieldRef: apiVersion: v1 fieldPath: status.podIP image: docker.io/bitnami/alertmanager:0.24.0-debian-11-r46 imagePullPolicy: IfNotPresent livenessProbe: failureThreshold: 120 httpGet: path: /-/healthy port: web scheme: HTTP periodSeconds: 5 successThreshold: 1 timeoutSeconds: 3 name: alertmanager ports: - containerPort: 9093 name: web protocol: TCP - containerPort: 9094 name: mesh-tcp protocol: TCP - containerPort: 9094 name: mesh-udp protocol: UDP readinessProbe: failureThreshold: 120 httpGet: path: /-/ready port: web scheme: HTTP initialDelaySeconds: 3 periodSeconds: 5 successThreshold: 1 timeoutSeconds: 3 resources: requests: memory: 200Mi securityContext: allowPrivilegeEscalation: false capabilities: drop: - ALL readOnlyRootFilesystem: false runAsNonRoot: true terminationMessagePath: /dev/termination-log terminationMessagePolicy: FallbackToLogsOnError volumeMounts: - mountPath: /etc/alertmanager/config name: config-volume - mountPath: /etc/alertmanager/config_out name: config-out readOnly: true - mountPath: /etc/alertmanager/certs name: tls-assets readOnly: true - mountPath: /alertmanager name: alertmanager-kube-prometheus-alertmanager-db subPath: alertmanager-db - mountPath: /etc/alertmanager/web_config/web-config.yaml name: web-config readOnly: true subPath: web-config.yaml - args: - --listen-address=:8080 - --reload-url=http://127.0.0.1:9093/-/reload - --config-file=/etc/alertmanager/config/alertmanager.yaml.gz - --config-envsubst-file=/etc/alertmanager/config_out/alertmanager.env.yaml command: - /bin/prometheus-config-reloader env: - name: POD_NAME valueFrom: fieldRef: apiVersion: v1 fieldPath: metadata.name - name: SHARD value: "-1" image: docker.io/bitnami/prometheus-operator:0.60.1-debian-11-r0 imagePullPolicy: IfNotPresent livenessProbe: failureThreshold: 6 initialDelaySeconds: 10 periodSeconds: 10 successThreshold: 1 tcpSocket: port: reloader-web timeoutSeconds: 5 name: config-reloader ports: - containerPort: 8080 name: reloader-web protocol: TCP readinessProbe: failureThreshold: 6 initialDelaySeconds: 15 periodSeconds: 20 successThreshold: 1 tcpSocket: port: reloader-web timeoutSeconds: 5 resources: limits: cpu: 100m memory: 50Mi requests: cpu: 100m memory: 50Mi securityContext: allowPrivilegeEscalation: false capabilities: drop: - ALL readOnlyRootFilesystem: false runAsNonRoot: true terminationMessagePath: /dev/termination-log terminationMessagePolicy: FallbackToLogsOnError volumeMounts: - mountPath: /etc/alertmanager/config name: config-volume readOnly: true - mountPath: /etc/alertmanager/config_out name: config-out dnsPolicy: ClusterFirst restartPolicy: Always schedulerName: default-scheduler securityContext: fsGroup: 1001 runAsUser: 1001 serviceAccount: kube-prometheus-alertmanager serviceAccountName: kube-prometheus-alertmanager terminationGracePeriodSeconds: 120 volumes: - name: config-volume secret: defaultMode: 420 secretName: alertmanager-kube-prometheus-alertmanager-generated - name: tls-assets projected: defaultMode: 420 sources: - secret: name: alertmanager-kube-prometheus-alertmanager-tls-assets-0 - emptyDir: {} name: config-out - name: web-config secret: defaultMode: 420 secretName: alertmanager-kube-prometheus-alertmanager-web-config updateStrategy: type: RollingUpdate volumeClaimTemplates: - apiVersion: v1 kind: PersistentVolumeClaim metadata: creationTimestamp: null name: alertmanager-kube-prometheus-alertmanager-db spec: accessModes: - ReadWriteOnce resources: requests: storage: 8Gi storageClassName: nfs-client volumeMode: Filesystem status: phase: Pending status: collisionCount: 0 currentReplicas: 1 currentRevision: alertmanager-kube-prometheus-alertmanager-b74c5965d observedGeneration: 1 readyReplicas: 1 replicas: 1 updateRevision: alertmanager-kube-prometheus-alertmanager-b74c5965d updatedReplicas: 1
2、使用helm部署kube-Prometheus
- Prometheus部署环境如下:
- Kubernetes版本为v1.20.14。
- helm版本为v3.8.2。
- kube-prometheus的版本bitnami/kube-prometheus:8.1.11。
2.1、创建动态存储卷
- 创建动态存储卷
- 参看:https://www.cnblogs.com/maiblogs/p/16392831.html的《6.2、动态存储卷》,只需创建到“创建NFS SotageClass”。
2.2、部署kube-prometheus
- kube-prometheus:8.1.11会自动安装如下组件:
- prometheus-operator
- prometheus
- state-metrics
- node-exporter
- blackbox-exporter
- alertmanager
1、创建名称空间
1 | ]# kubectl create namespace monitoring |
2、下载kube-prometheus的chart
1 2 3 | ]# helm repo add bitnami https: //charts.bitnami.com/bitnami ]# helm search repo prometheus ]# helm pull bitnami/kube-prometheus |
3、修改values.yaml
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 | //解压 ]# tar zvfx kube-prometheus-8.1.11.tgz //修改values.yaml ]# vim ./kube-prometheus/values.yaml prometheus: ingress: enabled: true hostname: annotations: {kubernetes.io/ingress. class : "nginx" } extraRules: - host: prometheus.local http: paths: - path: / pathType: Prefix backend: service: name: kube-prometheus-prometheus port: number: 9090 externalUrl: "http://127.0.0.1:9090/" persistence: enabled: true storageClass: "nfs-client" alertmanager: ingress: enabled: true hostname: annotations: {kubernetes.io/ingress. class : "nginx" } extraRules: - host: alertmanager.local http: paths: - path: / pathType: Prefix backend: service: name: kube-prometheus-alertmanager port: number: 9093 externalUrl: "http://127.0.0.1:9093/" persistence: enabled: true storageClass: "nfs-client" |
- 修改后的values.yaml文件
4、应用kube-prometheus
1 | ]# helm install kube-prometheus kube-prometheus/ -n monitoring |
5、访问prometheus和alertmanager
1 2 | //修改hosts文件(C:\Windows\System32\drivers\etc) 10.1.1.11 prometheus.local alertmanager.local |
- 使用http://prometheus.local:32080/访问prometheus。
- 使用http://alertmanager.local:32080/访问alertmanager。
2.3、实现告警
2.3.1、配置alertmanager
1、查看alertmanager配置文件
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | ]# kubectl exec alertmanager-kube-prometheus-alertmanager-0 -n monitoring -- cat /etc/alertmanager/config_out/alertmanager.env.yaml ]# kubectl get secret alertmanager-kube-prometheus-alertmanager -n monitoring -o go-template= '{{ index .data "alertmanager.yaml" }}' | base64 -d global : resolve_timeout: 5m receivers: - name: "null" route: group_by: - job group_interval: 5m group_wait: 30s receiver: "null" repeat_interval: 12h routes: - match: alertname: Watchdog receiver: "null" |
2、修改alertmanager的配置文件
- 创建alertmanager.yaml
- 注意,这里的alertmanager.yaml顶层多了两级alertmanager和config。
- 注意,route如果没有子节点,就必须设置routes: []。
- 报错信息,level=error component=configuration msg="Loading configuration file failed" file=/etc/alertmanager/config_out/alertmanager.env.yaml err="undefined receiver \"null\" used in route"
- 注意,pod将存储卷挂在到了/alertmanager/。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | ]# vim alertmanager.yaml alertmanager: config: global : resolve_timeout: 5m smtp_smarthost: 'smtp.qq.com:465' smtp_from: 'xxx@qq.com' smtp_auth_username: 'xxx@qq.com' smtp_auth_password: 'xxx' smtp_require_tls: false route: group_by: [ 'alertname' ] group_wait: 10s group_interval: 10s repeat_interval: 10s receiver: 'email' routes: [] receivers: - name: 'email' email_configs: - to: 'xxx@xxx.com.cn' templates: - '/alertmanager/template.tmpl' |
3、将告警模板template.tmpl放到存储卷上
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | ]# vim /data1/monitoring-alertmanager-kube-prometheus-alertmanager-db-alertmanager-kube-prometheus-alertmanager-0-pvc-85d5342a-9f8c-41c3-95fe-f11c77579b0c/alertmanager-db/template.tmpl {{ define "__subject" }} {{ if gt (len .Alerts.Firing) 0 -}} {{ range .Alerts }} {{ .Labels.alertname }}{{ .Annotations.title }} {{ end }}{{ end }}{{ end }} {{ define "email.default.html" }} {{ range .Alerts }} 告警名称: {{ .Annotations.title }} <br> 告警级别: {{ .Labels.severity }} <br> 告警主机: {{ .Labels.instance }} <br> 告警信息: {{ .Annotations.description }} <br> 维护团队: {{ .Labels.team }} <br> 告警时间:{{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }} <br> {{ end }}{{ end }} |
4、滚动更新kube-prometheus
- 更新alertmanager.yaml配置文件
1 | ]# helm upgrade kube-prometheus kube-prometheus/ --values=alertmanager.yaml -n monitoring |
2.3.2、创建告警规则
1、创建PrometheusRule资源
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 | ]# vim node-exporter-rules.yaml apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: labels: app.kubernetes.io/component: exporter app.kubernetes.io/name: node-exporter app.kubernetes.io/part-of: kube-prometheus prometheus: k8s role: alert-rules name: node-exporter-rules namespace : monitoring spec: groups: - name: node-exporter rules: - alert: NodeFilesystemSpaceFillingUp expr: up == 3 for : 10s labels: severity: "告警级别critical" team: "维护团队OPS" annotations: title: "告警名称Instance {{ $labels.instance }} down" description: "告警信息{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 3 minutes." |
2、应用PrometheusRule资源
1 2 3 4 5 6 | ]# kubectl apply -f node-exporter-rules.yaml //查看prometheusrule资源 ]# kubectl get prometheusrule -A NAMESPACE NAME AGE monitoring node-exporter-rules 39s |
2.4、部署grafana
1、下载grafana的chart
1 2 3 | ]# helm repo add bitnami https: //charts.bitnami.com/bitnami ]# helm search repo grafana ]# helm pull bitnami/grafana |
2、修改values.yaml
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | //解压 ]# tar zvfx grafana-8.2.12.tgz //修改values.yaml ]# vim grafana/values.yaml admin: password: "admin" persistence: storageClass: "nfs-client" ingress: enabled: true hostname: annotations: {kubernetes.io/ingress. class : "nginx" } extraRules: - host: grafana.local http: paths: - path: / pathType: Prefix backend: service: name: grafana port: number: 3000 |
3、应用grafana
1 | ]# helm install grafana grafana/ -n monitoring |
4、访问grafana
1 2 | //修改hosts文件(C:\Windows\System32\drivers\etc) 10.1.1.11 prometheus.local alertmanager.local grafana.local |
- 使用http://grafana.local:32080/访问grafana。
5、添加数据源
- 在Kubernetes中,集群内部的服务可用通过Kubernetes内部的域名相互访问,Kubernetes内部的域名是:Service_Name.Namespace_Name.svc.cluster.local。
- 同一个名称空间中的服务,可以直接通过Service_Name相互访问。
3、使用github部署kube-Prometheus
- 注意,这里只是快速入门安装,并没有使用持久卷。
- Prometheus部署环境如下:
- Kubernetes版本为v1.20.14。
- kube-prometheus的版本0.8.o。
- kube-prometheus和Kubernetes的兼容性
1、下载kube-Prometheus
1 | ]# wget https: //github.com/prometheus-operator/kube-prometheus/archive/refs/tags/v0.8.0.tar.gz |
2、快速部署kube-prometheus
1 2 3 4 5 6 7 8 9 10 | ]# tar zvfx v0.8.0.tar.gz ]# cd kube-prometheus-0.8.0/ //将k8s.gcr.io/kube-state-metrics/kube-state-metrics:v2.0.0替换为bitnami/kube-state-metrics:2.0.0 ]# vim ./manifests/kube-state-metrics-deployment.yaml //先创建名称空间、prometheus-operator等 ]# kubectl create -f manifests/setup //部署全部组件 ]# kubectl create -f manifests/ |
- 如果之前部署过prometheus,请先清除可能的残留
1 2 | ]# cd kube-prometheus-0.8.0/ ]# kubectl delete --ignore-not-found=true -f manifests/ -f manifests/setup |
-
查看相关资源
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 | //查看pod ]# kubectl get pods -A NAMESPACE NAME READY STATUS RESTARTS AGE monitoring alertmanager-main-0 2/2 Running 0 31s monitoring alertmanager-main-1 2/2 Running 0 31s monitoring alertmanager-main-2 2/2 Running 0 31s monitoring blackbox-exporter-55c457d5fb-4jvmm 3/3 Running 0 30s monitoring grafana-9df57cdc4-l7gxk 1/1 Running 0 29s monitoring kube-state-metrics-6cb48468f8-dbdnc 3/3 Running 0 29s monitoring node-exporter-6svtr 2/2 Running 0 29s monitoring node-exporter-hpfw9 2/2 Running 0 29s monitoring node-exporter-jksr2 2/2 Running 0 29s monitoring prometheus-adapter-59df95d9f5-rxzdg 1/1 Running 0 29s monitoring prometheus-adapter-59df95d9f5-zs46x 1/1 Running 0 29s monitoring prometheus-k8s-0 2/2 Running 1 29s monitoring prometheus-k8s-1 2/2 Running 1 29s monitoring prometheus-operator-7775c66ccf-bfxd9 2/2 Running 0 29m ... //查看service ]# kubectl get svc -A NAMESPACE NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE monitoring alertmanager-main ClusterIP 10.20.24.158 <none> 9093/TCP 64s monitoring alertmanager-operated ClusterIP None <none> 9093/TCP,9094/TCP,9094/UDP 64s monitoring blackbox-exporter ClusterIP 10.20.248.189 <none> 9115/TCP,19115/TCP 64s monitoring grafana ClusterIP 10.20.214.103 <none> 3000/TCP 63s monitoring kube-state-metrics ClusterIP None <none> 8443/TCP,9443/TCP 63s monitoring node-exporter ClusterIP None <none> 9100/TCP 63s monitoring prometheus-adapter ClusterIP 10.20.74.223 <none> 443/TCP 63s monitoring prometheus-k8s ClusterIP 10.20.73.57 <none> 9090/TCP 63s monitoring prometheus-operated ClusterIP None <none> 9090/TCP 62s monitoring prometheus-operator ClusterIP None <none> 8443/TCP 30m ... //查看pod控制器 ]# kubectl get deployment -A NAMESPACE NAME READY UP-TO- DATE AVAILABLE AGE monitoring blackbox-exporter 1/1 1 1 109s monitoring grafana 1/1 1 1 108s monitoring kube-state-metrics 1/1 1 1 108s monitoring prometheus-adapter 2/2 2 2 108s monitoring prometheus-operator 1/1 1 1 30m ... //查看有状态的pod控制器 ]# kubectl get sts -A NAMESPACE NAME READY AGE monitoring alertmanager-main 3/3 2m1s monitoring prometheus-k8s 2/2 119s ... |
3、访问服务
- 访问prometheus
- http://10.1.1.11:19090/
1 2 | //监听10.1.1.11:19090,并将请求转发到service后面的pod的9090端口 ]# kubectl port-forward svc/prometheus-k8s --address=10.1.1.11 19090:9090 -n monitoring |
- 访问grafana
- http://10.1.1.11:13000/ (admin:admin)
1 2 | //监听10.1.1.11:13000,并将请求转发到service后面的pod的3000端口 ]# kubectl port-forward svc/grafana --address=10.1.1.11 13000:3000 -n monitoring |
- 访问alertmanager
- http://10.1.1.11:19093/
1 2 | //监听10.1.1.11:19093,并将请求转发到service后面的pod的9093端口 ]# kubectl port-forward svc/alertmanager-main --address=10.1.1.11 19093:9093 -n monitoring |
4、创建ingress规则
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 | ]# vim prometheus-ingress apiVersion: networking.k8s.io/v1 kind: Ingress metadata: name: prometheus-ingress namespace : monitoring annotations: kubernetes.io/ingress. class : "nginx" spec: rules: - host: prometheus.local http: paths: - path: / pathType: Prefix backend: service: name: prometheus-k8s port: number: 9090 - host: alertmanager.local http: paths: - path: / pathType: Prefix backend: service: name: alertmanager-main port: number: 9093 - host: grafana.local http: paths: - path: / pathType: Prefix backend: service: name: grafana port: number: 3000 |
1
1 | # # |
分类:
监控
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· DeepSeek 开源周回顾「GitHub 热点速览」
· 物流快递公司核心技术能力-地址解析分单基础技术分享
· .NET 10首个预览版发布:重大改进与新特性概览!
· AI与.NET技术实操系列(二):开始使用ML.NET
· 单线程的Redis速度为什么快?