二、可视化UI界面Grafana的安装和配置
2.1 Grafana介绍
Grafana 是一个跨平台的开源的度量分析和可视化工具,可以将采集的数据可视化的展示,并及时通 知给告警接收方。它主要有以下六大特点:
1、展示方式:快速灵活的客户端图表,面板插件有许多不同方式的可视化指标和日志,官方库中具 有丰富的仪表盘插件,比如热图、折线图、图表等多种展示方式;
2、数据源:Graphite,InfluxDB,OpenTSDB,Prometheus,Elasticsearch,CloudWatch 和 KairosDB 等;
3、通知提醒:以可视方式定义最重要指标的警报规则,Grafana 将不断计算并发送通知,在数据达 到阈值时通过 Slack、PagerDuty 等获得通知;
4、混合展示:在同一图表中混合使用不同的数据源,可以基于每个查询指定数据源,甚至自定义数 据源;
5、注释:使用来自不同数据源的丰富事件注释图表,将鼠标悬停在事件上会显示完整的事件元数据 和标记。
2.2 安装Grafana
#安装 Grafana 需要的镜像 heapster-grafana-amd64_v5_0_4.tar.gz dockerhub:https://hub.docker.com/r/grafana/grafana [root@monitor ~]# docker load -i heapster-grafana-amd64_v5_0_4.tar.gz Loaded image: k8s.gcr.io/heapster-grafana-amd64:v5.0.4 [root@master prometheus]# kubectl apply -f grafana.yaml deployment.apps/monitoring-grafana created service/monitoring-grafana created
kubectl get pods -n kube-system -l task=monitoring monitoring-grafana-675798bf47-bjlhj 1/1 Running 0 43s 10.244.75.194 monitor <none> <none>
[root@master prometheus]# kubectl get svc -n kube-system
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
monitoring-grafana NodePort 10.105.217.150 <none> 80:30510/TCP 2m50s
访问:http://192.168.199.131:30510/
apiVersion: apps/v1 kind: Deployment metadata: name: monitoring-grafana namespace: kube-system spec: replicas: 1 selector: matchLabels: task: monitoring k8s-app: grafana template: metadata: labels: task: monitoring k8s-app: grafana spec: containers: - name: grafana image: k8s.gcr.io/heapster-grafana-amd64:v5.0.4 ports: - containerPort: 3000 protocol: TCP volumeMounts: - mountPath: /etc/ssl/certs name: ca-certificates readOnly: true - mountPath: /var name: grafana-storage env: - name: INFLUXDB_HOST value: monitoring-influxdb - name: GF_SERVER_HTTP_PORT value: "3000" # The following env variables are required to make Grafana accessible via # the kubernetes api-server proxy. On production clusters, we recommend # removing these env variables, setup auth for grafana, and expose the grafana # service using a LoadBalancer or a public IP. - name: GF_AUTH_BASIC_ENABLED value: "false" - name: GF_AUTH_ANONYMOUS_ENABLED value: "true" - name: GF_AUTH_ANONYMOUS_ORG_ROLE value: Admin - name: GF_SERVER_ROOT_URL # If you're only using the API Server proxy, set this value instead: # value: /api/v1/namespaces/kube-system/services/monitoring-grafana/proxy value: / volumes: - name: ca-certificates hostPath: path: /etc/ssl/certs - name: grafana-storage emptyDir: {} --- apiVersion: v1 kind: Service metadata: labels: # For use as a Cluster add-on (https://github.com/kubernetes/kubernetes/tree/master/cluster/addons) # If you are NOT using this as an addon, you should comment out this line. kubernetes.io/cluster-service: 'true' kubernetes.io/name: monitoring-grafana name: monitoring-grafana namespace: kube-system spec: # In a production setup, we recommend accessing Grafana through an external Loadbalancer # or through a public IP. # type: LoadBalancer # You could also use NodePort to expose the service at a randomly-generated port # type: NodePort ports: - port: 80 targetPort: 3000 selector: k8s-app: grafana type: NodePort
2.2 Grafana可视化展示node节点的资源指标
添加Prometheus,保存出现 绿色 "Data source is working" 说明正常了
导入监控模板:https://grafana.com/dashboards?dataSource=prometheus&search=kubernetes
https://grafana.com/grafana/dashboards/
选择一个本地json导入node-exporter.json:
如果你在这里看到某个图表没有数据N/A就将编辑后显示出来的命令复制到Prometheus上看看是否有数据
2.3 Grafana可视化展示docker容器监控数据
官网下载模板:https://grafana.com/grafana/dashboards/179
导入 docker_rev1.json 监控模板,很多模板需要对应的grafana版本,下载的时候请注意查看
2.4 安装kube-state-metrics组件
kube-state-metrics 是什么?
kube-state-metrics 通过监听 API Server 生成有关资源对象的状态指标,比如 Deployment、Node、Pod,需要注意的是 kube-state-metrics 只是简单的提供一个 metrics 数据,并不会存储这些指标数据,
所以我们可以使用 Prometheus 来抓取这些数据然后存储,主要关注的是业务相关的一些元数据,比如 Deployment、Pod、副本状态等;调度了多少个 replicas?现在可用的有几个?多少个 Pod 是 running/stopped/terminated
状态?Pod 重启了多少次?我有多少 job 在运行中。
安装kube-state-metrics组件
1)创建 sa,并对 sa 授权
在 k8s 的控制节点生成一个 kube-state-metrics-rbac.yaml 文件,通过 kubectl apply 更新资源清单 yaml 文件
[root@master prometheus]# kubectl apply -f kube-state-metrics-rbac.yaml serviceaccount/kube-state-metrics created clusterrole.rbac.authorization.k8s.io/kube-state-metrics created clusterrolebinding.rbac.authorization.k8s.io/kube-state-metrics created [root@master prometheus]# kubectl get sa -n kube-system | grep kube-state-metrics kube-state-metrics 1 31s
--- apiVersion: v1 kind: ServiceAccount metadata: name: kube-state-metrics namespace: kube-system --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: kube-state-metrics rules: - apiGroups: [""] resources: ["nodes", "pods", "services", "resourcequotas", "replicationcontrollers", "limitranges", "persistentvolumeclaims", "persistentvolumes", "namespaces", "endpoints"] verbs: ["list", "watch"] - apiGroups: ["extensions"] resources: ["daemonsets", "deployments", "replicasets"] verbs: ["list", "watch"] - apiGroups: ["apps"] resources: ["statefulsets"] verbs: ["list", "watch"] - apiGroups: ["batch"] resources: ["cronjobs", "jobs"] verbs: ["list", "watch"] - apiGroups: ["autoscaling"] resources: ["horizontalpodautoscalers"] verbs: ["list", "watch"] --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: kube-state-metrics roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: kube-state-metrics subjects: - kind: ServiceAccount name: kube-state-metrics namespace: kube-system
2)安装kube-sate-metrics组件
[root@monitor ~]# docker load -i kube-state-metrics_1_9_0.tar.gz Loaded image: quay.io/coreos/kube-state-metrics:v1.9.0 [root@master prometheus]# kubectl apply -f kube-state-metrics-deploy.yaml deployment.apps/kube-state-metrics created [root@master prometheus]# kubectl get pods -n kube-system -o wide -l app=kube-state-metrics NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES kube-state-metrics-58d4957bc5-f9m9m 1/1 Running 0 3m57s 10.244.75.195 monitor <none> <none>
3)创建service
[root@master prometheus]# kubectl apply -f kube-state-metrics-svc.yaml
service/kube-state-metrics created
[root@master prometheus]# kubectl get svc -n kube-system -l app=kube-state-metrics
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kube-state-metrics ClusterIP 10.100.13.94 <none> 8080/TCP 75s
apiVersion: v1 kind: Service metadata: annotations: prometheus.io/scrape: 'true' # rbac中可以抓取到scrape这些内容 name: kube-state-metrics namespace: kube-system labels: app: kube-state-metrics spec: ports: - name: kube-state-metrics port: 8080 protocol: TCP selector: app: kube-state-metrics
在Prometheus界面可以看到新增了很多kube_开头的命令
可视化展示:Grafana导入模板Kubernetes Cluster (Prometheus)-1577674936972.json:pod的运行情况,资源情况列出来
上面关于磁盘的出现N/A没有采集到数据的情况,可以edit看看是否是因为命令不对:node_filesystem_size改为node_filesystem_size_bytes 就好了
导入模板:Kubernetes cluster monitoring (via Prometheus) (k8s 1.16)-1577691996738.json 这个更全:k8s的pod资源指标,资源使用情况都会列出来
三、Alertmanager
3.1 alertmanger---发送报警到QQ邮箱
报警:指 prometheus 将监测到的异常事件发送给 alertmanager
通知:alertmanager 将报警信息发送到邮件、微信、钉钉等
#创建 alertmanager 配置文件
# 开启163.邮箱IMAP/SMTP和pop3/SMTP服务 [root@master prometheus]# cat alertmanager-cm.yaml kind: ConfigMap apiVersion: v1 metadata: name: alertmanager namespace: monitor-sa data: alertmanager.yml: |- global: resolve_timeout: 1m smtp_smarthost: 'smtp.163.com:25' # 163 邮箱的 SMTP 服务器地址+端口 smtp_from: 'yunwei@163.com' # 指定从哪个邮箱发送报警 smtp_auth_username: 'yunwei@163.com' #这是发送邮箱的认证用户,不是邮箱名 smtp_auth_password: 'ZQEHQIQGD' #这是发送邮箱的授权码而不是登录密码 smtp_require_tls: false route: #用于配置告警分发策略 group_by: [alertname] # 采用哪个标签来作为分组依据 group_wait: 10s # 组告警等待时间。也就是告警产生后等待 10s,如果有同组告警一起发出 group_interval: 10s # 上下两组发送告警的间隔时间 repeat_interval: 10m # 重复发送告警的时间,减少相同邮件的发送频率,默认是1h receiver: default-receiver #定义谁来收告警 receivers: - name: 'default-receiver' email_configs: - to: '78027@qq.com' # 收件箱 send_resolved: true
[root@master prometheus]# kubectl apply -f alertmanager-cm.yaml configmap/alertmanager created [root@master prometheus]# kubectl get configmap -n monitor-sa NAME DATA AGE alertmanager 1 22s
3.2 Prometheus 一条告警的触发流程、等待时间
报警处理流程如下:
(1) Prometheus Server 监控目标主机上暴露的 http 接口(这里假设接口 A),通过 Promethes 配置的 'scrape_interval'定义的时间间隔,定期采集目标主机上监控数据。
(2) 当接口 A 不可用的时候,Server 端会持续的尝试从接口中取数据,直到"scrape_timeout"时间后 停止尝试。这时候把接口的状态变为“DOWN”。
(3) Prometheus 同时根据配置的"evaluation_interval"的时间间隔,定期(默认 1min)的对 Alert Rule 进行评估;当到达评估周期的时候,发现接口 A 为 DOWN,即 UP=0 为真,激活 Alert,进入 “PENDING”状态,并记录当前 active 的时间;
(4) 当下一个 alert rule 的评估周期到来的时候,发现 UP=0 继续为真,然后判断警报 Active 的时间 是否已经超出 rule 里的‘for’ 持续时间,如果未超出,则进入下一个评估周期;如果时间超出, 则 alert 的状态变为“FIRING”;同时调用 Alertmanager 接口,发送相关报警数据。
(5) AlertManager 收到报警数据后,会将警报信息进行分组,然后根据 alertmanager 配置的 “group_wait”时间先进行等待。等 wait 时间过后再发送报警信息
(6) 属于同一个 Alert Group 的警报,在等待的过程中可能进入新的 alert,如果之前的报警已经成 功发出,那么间隔“group_interval”的时间间隔后再重新发送报警信息。比如配置的是邮件报警, 那么同属一个 group 的报警信息会汇总在一个邮件里进行发送.
(7) 如果 Alert Group 里的警报一直没发生变化并且已经成功发送,等待‘repeat_interval’时间间 隔之后再重复发送相同的报警邮件;如果之前的警报没有成功发送,则相当于触发第 6 条条件,则需 要等待 group_interval 时间间隔后重复发送。
同时最后至于警报信息具体发给谁,满足什么样的条件下指定警报接收人,设置不同报警发送频率, 这里有 alertmanager 的 route 路由规则进行配置。
3.3 安装 prometheus 和告警规则配置文件(注意修改对应组件所在节点IP)
github alter告警规则:
https://github.com/samber/awesome-prometheus-alerts
https://awesome-prometheus-alerts.grep.to/alertmanager
https://awesome-prometheus-alerts.grep.to/rules#netdata
kind: ConfigMap apiVersion: v1 metadata: labels: app: prometheus name: prometheus-config namespace: monitor-sa data: prometheus.yml: | rule_files: - /etc/prometheus/rules.yml alerting: alertmanagers: - static_configs: - targets: ["localhost:9093"] global: scrape_interval: 15s scrape_timeout: 10s evaluation_interval: 1m scrape_configs: - job_name: 'kubernetes-node' kubernetes_sd_configs: - role: node relabel_configs: - source_labels: [__address__] regex: '(.*):10250' replacement: '${1}:9100' target_label: __address__ action: replace - action: labelmap regex: __meta_kubernetes_node_label_(.+) - job_name: 'kubernetes-node-cadvisor' kubernetes_sd_configs: - role: node scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token relabel_configs: - action: labelmap regex: __meta_kubernetes_node_label_(.+) - target_label: __address__ replacement: kubernetes.default.svc:443 - source_labels: [__meta_kubernetes_node_name] regex: (.+) target_label: __metrics_path__ replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor - job_name: 'kubernetes-apiserver' kubernetes_sd_configs: - role: endpoints scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token relabel_configs: - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name] action: keep regex: default;kubernetes;https - job_name: 'kubernetes-service-endpoints' kubernetes_sd_configs: - role: endpoints relabel_configs: - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape] action: keep regex: true - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme] action: replace target_label: __scheme__ regex: (https?) - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+) - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port] action: replace target_label: __address__ regex: ([^:]+)(?::\d+)?;(\d+) replacement: $1:$2 - action: labelmap regex: __meta_kubernetes_service_label_(.+) - source_labels: [__meta_kubernetes_namespace] action: replace target_label: kubernetes_namespace - source_labels: [__meta_kubernetes_service_name] action: replace target_label: kubernetes_name - job_name: 'kubernetes-pods' kubernetes_sd_configs: - role: pod relabel_configs: - action: keep regex: true source_labels: - __meta_kubernetes_pod_annotation_prometheus_io_scrape - action: replace regex: (.+) source_labels: - __meta_kubernetes_pod_annotation_prometheus_io_path target_label: __metrics_path__ - action: replace regex: ([^:]+)(?::\d+)?;(\d+) replacement: $1:$2 source_labels: - __address__ - __meta_kubernetes_pod_annotation_prometheus_io_port target_label: __address__ - action: labelmap regex: __meta_kubernetes_pod_label_(.+) - action: replace source_labels: - __meta_kubernetes_namespace target_label: kubernetes_namespace - action: replace source_labels: - __meta_kubernetes_pod_name target_label: kubernetes_pod_name - job_name: 'kubernetes-schedule' scrape_interval: 5s static_configs: - targets: ['192.168.199.131:10251'] - job_name: 'kubernetes-controller-manager' scrape_interval: 5s static_configs: - targets: ['192.168.199.131:10252'] - job_name: 'kubernetes-kube-proxy' scrape_interval: 5s static_configs: - targets: ['192.168.199.131:10249','192.168.199.128:10249'] - job_name: 'kubernetes-etcd' scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/k8s-certs/etcd/ca.crt cert_file: /var/run/secrets/kubernetes.io/k8s-certs/etcd/server.crt key_file: /var/run/secrets/kubernetes.io/k8s-certs/etcd/server.key scrape_interval: 5s static_configs: - targets: ['192.168.199.131:2379'] rules.yml: | groups: - name: example rules: - alert: kube-proxy的cpu使用率大于80% expr: rate(process_cpu_seconds_total{job=~"kubernetes-kube-proxy"}[1m]) * 100 > 80 for: 2s labels: severity: warnning annotations: description: "{{$labels.instance}}的{{$labels.job}}组件的cpu使用率超过80%" - alert: kube-proxy的cpu使用率大于90% expr: rate(process_cpu_seconds_total{job=~"kubernetes-kube-proxy"}[1m]) * 100 > 90 for: 2s labels: severity: critical annotations: description: "{{$labels.instance}}的{{$labels.job}}组件的cpu使用率超过90%" - alert: scheduler的cpu使用率大于80% expr: rate(process_cpu_seconds_total{job=~"kubernetes-schedule"}[1m]) * 100 > 80 for: 2s labels: severity: warnning annotations: description: "{{$labels.instance}}的{{$labels.job}}组件的cpu使用率超过80%" - alert: scheduler的cpu使用率大于90% expr: rate(process_cpu_seconds_total{job=~"kubernetes-schedule"}[1m]) * 100 > 90 for: 2s labels: severity: critical annotations: description: "{{$labels.instance}}的{{$labels.job}}组件的cpu使用率超过90%" - alert: controller-manager的cpu使用率大于80% expr: rate(process_cpu_seconds_total{job=~"kubernetes-controller-manager"}[1m]) * 100 > 80 for: 2s labels: severity: warnning annotations: description: "{{$labels.instance}}的{{$labels.job}}组件的cpu使用率超过80%" - alert: controller-manager的cpu使用率大于90% expr: rate(process_cpu_seconds_total{job=~"kubernetes-controller-manager"}[1m]) * 100 > 0 for: 2s labels: severity: critical annotations: description: "{{$labels.instance}}的{{$labels.job}}组件的cpu使用率超过90%" - alert: apiserver的cpu使用率大于80% expr: rate(process_cpu_seconds_total{job=~"kubernetes-apiserver"}[1m]) * 100 > 80 for: 2s labels: severity: warnning annotations: description: "{{$labels.instance}}的{{$labels.job}}组件的cpu使用率超过80%" - alert: apiserver的cpu使用率大于90% expr: rate(process_cpu_seconds_total{job=~"kubernetes-apiserver"}[1m]) * 100 > 90 for: 2s labels: severity: critical annotations: description: "{{$labels.instance}}的{{$labels.job}}组件的cpu使用率超过90%" - alert: etcd的cpu使用率大于80% expr: rate(process_cpu_seconds_total{job=~"kubernetes-etcd"}[1m]) * 100 > 80 for: 2s labels: severity: warnning annotations: description: "{{$labels.instance}}的{{$labels.job}}组件的cpu使用率超过80%" - alert: etcd的cpu使用率大于90% expr: rate(process_cpu_seconds_total{job=~"kubernetes-etcd"}[1m]) * 100 > 90 for: 2s labels: severity: critical annotations: description: "{{$labels.instance}}的{{$labels.job}}组件的cpu使用率超过90%" - alert: kube-state-metrics的cpu使用率大于80% expr: rate(process_cpu_seconds_total{k8s_app=~"kube-state-metrics"}[1m]) * 100 > 80 for: 2s labels: severity: warnning annotations: description: "{{$labels.instance}}的{{$labels.k8s_app}}组件的cpu使用率超过80%" value: "{{ $value }}%" threshold: "80%" - alert: kube-state-metrics的cpu使用率大于90% expr: rate(process_cpu_seconds_total{k8s_app=~"kube-state-metrics"}[1m]) * 100 > 0 for: 2s labels: severity: critical annotations: description: "{{$labels.instance}}的{{$labels.k8s_app}}组件的cpu使用率超过90%" value: "{{ $value }}%" threshold: "90%" - alert: coredns的cpu使用率大于80% expr: rate(process_cpu_seconds_total{k8s_app=~"kube-dns"}[1m]) * 100 > 80 for: 2s labels: severity: warnning annotations: description: "{{$labels.instance}}的{{$labels.k8s_app}}组件的cpu使用率超过80%" value: "{{ $value }}%" threshold: "80%" - alert: coredns的cpu使用率大于90% expr: rate(process_cpu_seconds_total{k8s_app=~"kube-dns"}[1m]) * 100 > 90 for: 2s labels: severity: critical annotations: description: "{{$labels.instance}}的{{$labels.k8s_app}}组件的cpu使用率超过90%" value: "{{ $value }}%" threshold: "90%" - alert: kube-proxy打开句柄数>600 expr: process_open_fds{job=~"kubernetes-kube-proxy"} > 600 for: 2s labels: severity: warnning annotations: description: "{{$labels.instance}}的{{$labels.job}}打开句柄数>600" value: "{{ $value }}" - alert: kube-proxy打开句柄数>1000 expr: process_open_fds{job=~"kubernetes-kube-proxy"} > 1000 for: 2s labels: severity: critical annotations: description: "{{$labels.instance}}的{{$labels.job}}打开句柄数>1000" value: "{{ $value }}" - alert: kubernetes-schedule打开句柄数>600 expr: process_open_fds{job=~"kubernetes-schedule"} > 600 for: 2s labels: severity: warnning annotations: description: "{{$labels.instance}}的{{$labels.job}}打开句柄数>600" value: "{{ $value }}" - alert: kubernetes-schedule打开句柄数>1000 expr: process_open_fds{job=~"kubernetes-schedule"} > 1000 for: 2s labels: severity: critical annotations: description: "{{$labels.instance}}的{{$labels.job}}打开句柄数>1000" value: "{{ $value }}" - alert: kubernetes-controller-manager打开句柄数>600 expr: process_open_fds{job=~"kubernetes-controller-manager"} > 600 for: 2s labels: severity: warnning annotations: description: "{{$labels.instance}}的{{$labels.job}}打开句柄数>600" value: "{{ $value }}" - alert: kubernetes-controller-manager打开句柄数>1000 expr: process_open_fds{job=~"kubernetes-controller-manager"} > 1000 for: 2s labels: severity: critical annotations: description: "{{$labels.instance}}的{{$labels.job}}打开句柄数>1000" value: "{{ $value }}" - alert: kubernetes-apiserver打开句柄数>600 expr: process_open_fds{job=~"kubernetes-apiserver"} > 600 for: 2s labels: severity: warnning annotations: description: "{{$labels.instance}}的{{$labels.job}}打开句柄数>600" value: "{{ $value }}" - alert: kubernetes-apiserver打开句柄数>1000 expr: process_open_fds{job=~"kubernetes-apiserver"} > 1000 for: 2s labels: severity: critical annotations: description: "{{$labels.instance}}的{{$labels.job}}打开句柄数>1000" value: "{{ $value }}" - alert: kubernetes-etcd打开句柄数>600 expr: process_open_fds{job=~"kubernetes-etcd"} > 600 for: 2s labels: severity: warnning annotations: description: "{{$labels.instance}}的{{$labels.job}}打开句柄数>600" value: "{{ $value }}" - alert: kubernetes-etcd打开句柄数>1000 expr: process_open_fds{job=~"kubernetes-etcd"} > 1000 for: 2s labels: severity: critical annotations: description: "{{$labels.instance}}的{{$labels.job}}打开句柄数>1000" value: "{{ $value }}" - alert: coredns expr: process_open_fds{k8s_app=~"kube-dns"} > 600 for: 2s labels: severity: warnning annotations: description: "插件{{$labels.k8s_app}}({{$labels.instance}}): 打开句柄数超过600" value: "{{ $value }}" - alert: coredns expr: process_open_fds{k8s_app=~"kube-dns"} > 1000 for: 2s labels: severity: critical annotations: description: "插件{{$labels.k8s_app}}({{$labels.instance}}): 打开句柄数超过1000" value: "{{ $value }}" - alert: kube-proxy expr: process_virtual_memory_bytes{job=~"kubernetes-kube-proxy"} > 2000000000 for: 2s labels: severity: warnning annotations: description: "组件{{$labels.job}}({{$labels.instance}}): 使用虚拟内存超过2G" value: "{{ $value }}" - alert: scheduler expr: process_virtual_memory_bytes{job=~"kubernetes-schedule"} > 2000000000 for: 2s labels: severity: warnning annotations: description: "组件{{$labels.job}}({{$labels.instance}}): 使用虚拟内存超过2G" value: "{{ $value }}" - alert: kubernetes-controller-manager expr: process_virtual_memory_bytes{job=~"kubernetes-controller-manager"} > 2000000000 for: 2s labels: severity: warnning annotations: description: "组件{{$labels.job}}({{$labels.instance}}): 使用虚拟内存超过2G" value: "{{ $value }}" - alert: kubernetes-apiserver expr: process_virtual_memory_bytes{job=~"kubernetes-apiserver"} > 2000000000 for: 2s labels: severity: warnning annotations: description: "组件{{$labels.job}}({{$labels.instance}}): 使用虚拟内存超过2G" value: "{{ $value }}" - alert: kubernetes-etcd expr: process_virtual_memory_bytes{job=~"kubernetes-etcd"} > 2000000000 for: 2s labels: severity: warnning annotations: description: "组件{{$labels.job}}({{$labels.instance}}): 使用虚拟内存超过2G" value: "{{ $value }}" - alert: kube-dns expr: process_virtual_memory_bytes{k8s_app=~"kube-dns"} > 2000000000 for: 2s labels: severity: warnning annotations: description: "插件{{$labels.k8s_app}}({{$labels.instance}}): 使用虚拟内存超过2G" value: "{{ $value }}" - alert: HttpRequestsAvg expr: sum(rate(rest_client_requests_total{job=~"kubernetes-kube-proxy|kubernetes-kubelet|kubernetes-schedule|kubernetes-control-manager|kubernetes-apiservers"}[1m])) > 1000 for: 2s labels: team: admin annotations: description: "组件{{$labels.job}}({{$labels.instance}}): TPS超过1000" value: "{{ $value }}" threshold: "1000" - alert: Pod_restarts expr: kube_pod_container_status_restarts_total{namespace=~"kube-system|default|monitor-sa"} > 0 for: 2s labels: severity: warnning annotations: description: "在{{$labels.namespace}}名称空间下发现{{$labels.pod}}这个pod下的容器{{$labels.container}}被重启,这个监控指标是由{{$labels.instance}}采集的" value: "{{ $value }}" threshold: "0" - alert: Pod_waiting expr: kube_pod_container_status_waiting_reason{namespace=~"kube-system|default"} == 1 for: 2s labels: team: admin annotations: description: "空间{{$labels.namespace}}({{$labels.instance}}): 发现{{$labels.pod}}下的{{$labels.container}}启动异常等待中" value: "{{ $value }}" threshold: "1" - alert: Pod_terminated expr: kube_pod_container_status_terminated_reason{namespace=~"kube-system|default|monitor-sa"} == 1 for: 2s labels: team: admin annotations: description: "空间{{$labels.namespace}}({{$labels.instance}}): 发现{{$labels.pod}}下的{{$labels.container}}被删除" value: "{{ $value }}" threshold: "1" - alert: Etcd_leader expr: etcd_server_has_leader{job="kubernetes-etcd"} == 0 for: 2s labels: team: admin annotations: description: "组件{{$labels.job}}({{$labels.instance}}): 当前没有leader" value: "{{ $value }}" threshold: "0" - alert: Etcd_leader_changes expr: rate(etcd_server_leader_changes_seen_total{job="kubernetes-etcd"}[1m]) > 0 for: 2s labels: team: admin annotations: description: "组件{{$labels.job}}({{$labels.instance}}): 当前leader已发生改变" value: "{{ $value }}" threshold: "0" - alert: Etcd_failed expr: rate(etcd_server_proposals_failed_total{job="kubernetes-etcd"}[1m]) > 0 for: 2s labels: team: admin annotations: description: "组件{{$labels.job}}({{$labels.instance}}): 服务失败" value: "{{ $value }}" threshold: "0" - alert: Etcd_db_total_size expr: etcd_debugging_mvcc_db_total_size_in_bytes{job="kubernetes-etcd"} > 10000000000 for: 2s labels: team: admin annotations: description: "组件{{$labels.job}}({{$labels.instance}}):db空间超过10G" value: "{{ $value }}" threshold: "10G" - alert: Endpoint_ready expr: kube_endpoint_address_not_ready{namespace=~"kube-system|default"} == 1 for: 2s labels: team: admin annotations: description: "空间{{$labels.namespace}}({{$labels.instance}}): 发现{{$labels.endpoint}}不可用" value: "{{ $value }}" threshold: "1" - name: 物理节点状态-监控告警 rules: - alert: 物理节点cpu使用率 expr: 100-avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by(instance)*100 > 90 for: 2s labels: severity: ccritical annotations: summary: "{{ $labels.instance }}cpu使用率过高" description: "{{ $labels.instance }}的cpu使用率超过90%,当前使用率[{{ $value }}],需要排查处理" - alert: 物理节点内存使用率 expr: (node_memory_MemTotal_bytes - (node_memory_MemFree_bytes + node_memory_Buffers_bytes + node_memory_Cached_bytes)) / node_memory_MemTotal_bytes * 100 > 90 for: 2s labels: severity: critical annotations: summary: "{{ $labels.instance }}内存使用率过高" description: "{{ $labels.instance }}的内存使用率超过90%,当前使用率[{{ $value }}],需要排查处理" - alert: InstanceDown expr: up == 0 for: 2s labels: severity: critical annotations: summary: "{{ $labels.instance }}: 服务器宕机" description: "{{ $labels.instance }}: 服务器延时超过2分钟" - alert: 物理节点磁盘的IO性能 expr: 100-(avg(irate(node_disk_io_time_seconds_total[1m])) by(instance)* 100) < 60 for: 2s labels: severity: critical annotations: summary: "{{$labels.mountpoint}} 流入磁盘IO使用率过高!" description: "{{$labels.mountpoint }} 流入磁盘IO大于60%(目前使用:{{$value}})" - alert: 入网流量带宽 expr: ((sum(rate (node_network_receive_bytes_total{device!~'tap.*|veth.*|br.*|docker.*|virbr*|lo*'}[5m])) by (instance)) / 100) > 102400 for: 2s labels: severity: critical annotations: summary: "{{$labels.mountpoint}} 流入网络带宽过高!" description: "{{$labels.mountpoint }}流入网络带宽持续5分钟高于100M. RX带宽使用率{{$value}}" - alert: 出网流量带宽 expr: ((sum(rate (node_network_transmit_bytes_total{device!~'tap.*|veth.*|br.*|docker.*|virbr*|lo*'}[5m])) by (instance)) / 100) > 102400 for: 2s labels: severity: critical annotations: summary: "{{$labels.mountpoint}} 流出网络带宽过高!" description: "{{$labels.mountpoint }}流出网络带宽持续5分钟高于100M. RX带宽使用率{{$value}}" - alert: TCP会话 expr: node_netstat_Tcp_CurrEstab > 1000 for: 2s labels: severity: critical annotations: summary: "{{$labels.mountpoint}} TCP_ESTABLISHED过高!" description: "{{$labels.mountpoint }} TCP_ESTABLISHED大于1000%(目前使用:{{$value}}%)" - alert: 磁盘容量 expr: 100-(node_filesystem_free_bytes{fstype=~"ext4|xfs"}/node_filesystem_size_bytes {fstype=~"ext4|xfs"}*100) > 80 for: 2s labels: severity: critical annotations: summary: "{{$labels.mountpoint}} 磁盘分区使用率过高!" description: "{{$labels.mountpoint }} 磁盘分区使用大于80%(目前使用:{{$value}}%)"
制造告警规则:prometheus-alertmanager-cfg.yaml
1. 一些使用率大于90%就发报警邮件,都改成了大于0就发
2. 网络流量改成了2s,而不是描写的5分钟
- alert: controller-manager的cpu使用率大于90%
expr: rate(process_cpu_seconds_total{job=~"kubernetes-controller-manager"}[1m]) * 100 > 0
--- apiVersion: apps/v1 kind: Deployment metadata: name: prometheus-server namespace: monitor-sa labels: app: prometheus spec: replicas: 1 selector: matchLabels: app: prometheus component: server #matchExpressions: #- {key: app, operator: In, values: [prometheus]} #- {key: component, operator: In, values: [server]} template: metadata: labels: app: prometheus component: server annotations: prometheus.io/scrape: 'false' spec: nodeName: monitor serviceAccountName: monitor containers: - name: prometheus image: prom/prometheus:v2.2.1 imagePullPolicy: IfNotPresent command: - "/bin/prometheus" args: - "--config.file=/etc/prometheus/prometheus.yml" - "--storage.tsdb.path=/prometheus" - "--storage.tsdb.retention=24h" - "--web.enable-lifecycle" ports: - containerPort: 9090 protocol: TCP volumeMounts: - mountPath: /etc/prometheus name: prometheus-config - mountPath: /prometheus/ name: prometheus-storage-volume - name: k8s-certs mountPath: /var/run/secrets/kubernetes.io/k8s-certs/etcd/ - name: alertmanager image: prom/alertmanager:v0.14.0 imagePullPolicy: IfNotPresent args: - "--config.file=/etc/alertmanager/alertmanager.yml" - "--log.level=debug" ports: - containerPort: 9093 protocol: TCP name: alertmanager volumeMounts: - name: alertmanager-config mountPath: /etc/alertmanager - name: alertmanager-storage mountPath: /alertmanager - name: localtime # 时间和本机同步 mountPath: /etc/localtime volumes: - name: prometheus-config configMap: name: prometheus-config - name: prometheus-storage-volume hostPath: path: /data type: Directory - name: k8s-certs secret: secretName: etcd-certs - name: alertmanager-config configMap: name: alertmanager - name: alertmanager-storage hostPath: path: /data/alertmanager type: DirectoryOrCreate - name: localtime # 挂载时区,让容器时区和本机一样 hostPath: path: /usr/share/zoneinfo/Asia/Shanghai
# prometheus-alertmanager-deploy.yaml 配置文件指定了k8s客户端节点,需要修改nodeName: monitor,修改为对应节点
#altermanager镜像包节点解压 [root@monitor ~]# docker load -i alertmanager.tar.gz Loaded image: prom/alertmanager:v0.14.0 生成一个 etcd-certs,这个在部署 prometheus 需要 [root@master prometheus]# kubectl -n monitor-sa create secret generic etcd-certs --from-file=/etc/kubernetes/pki/etcd/server.key --from-file=/etc/kubernetes/pki/etcd/server.crt --from-file=/etc/kubernetes/pki/etcd/ca.crt secret/etcd-certs created 通过 kubectl apply 更新资源清单 yaml 文件 [root@master prometheus]# kubectl delete -f prometheus-cfg.yaml configmap "prometheus-config" deleted [root@master prometheus]# kubectl apply -f prometheus-alertmanager-cfg.yaml configmap/prometheus-config created [root@master prometheus]# kubectl get configmap -n monitor-sa NAME DATA AGE alertmanager 1 82m kube-root-ca.crt 1 25h prometheus-config 2 93s [root@master prometheus]# kubectl delete -f prometheus-deploy.yaml deployment.apps "prometheus-server" deleted [root@master prometheus]# kubectl apply -f prometheus-alertmanager-deploy.yaml deployment.apps/prometheus-server configured [root@master prometheus]# kubectl get pods -n monitor-sa -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES node-exporter-bg78z 1/1 Running 0 24h 192.168.199.131 master <none> <none> node-exporter-c6hlb 1/1 Running 0 24h 192.168.199.128 monitor <none> <none> prometheus-server-6d84cc9588-6nt5p 2/2 Running 0 64s 10.244.75.198 monitor <none> <none>
# 部署alertmanager的service,方便浏览器访问
在 k8s 的控制节点生成一个 alertmanager-svc.yaml 文件
--- apiVersion: v1 kind: Service metadata: labels: name: prometheus kubernetes.io/cluster-service: 'true' name: alertmanager namespace: monitor-sa spec: ports: - name: alertmanager nodePort: 30066 port: 9093 protocol: TCP targetPort: 9093 selector: app: prometheus sessionAffinity: None type: NodePort [root@master prometheus]# kubectl apply -f alertmanager-svc.yaml service/alertmanager created [root@master prometheus]# kubectl get svc -n monitor-sa NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE alertmanager NodePort 10.98.60.180 <none> 9093:30066/TCP 10s prometheus NodePort 10.106.76.183 <none> 9090:30051/TCP 22h
浏览器访问altermanager:http://192.168.199.131:30066/
Prometheus:http://192.168.199.131:30051/targets多出了很多服务
Prometheus中kubernetes-controller-manager、kubernetes-kube-proxy、kubernetes-schedule都显示链接不上对应的端口
处理方法如下:
# kube-schedule vim /etc/kubernetes/manifests/kube-scheduler.yaml 修改以下内容: --bind-address=127.0.0.1 变成--bind-address=192.168.199.131 --port=0 删除 # kube-controller-manager vim /etc/kubernetes/manifests/kube-controller-manager.yaml 修改以下内容: --bind-address=127.0.0.1 变成--bind-address=192.168.199.131 --port=0 删除 # 修改后在k8s各个节点执行 systemctl restart kubelet [root@master prometheus]# kubectl get cs Warning: v1 ComponentStatus is deprecated in v1.19+ NAME STATUS MESSAGE ERROR scheduler Healthy ok controller-manager Healthy ok etcd-0 Healthy {"health":"true"} 查看Prometheus就会发现可以监听到了
kubernetes-kube-proxy显示如下:
是因为 kube-proxy 默认端口 10249 是监听在 127.0.0.1 上的,需要改成监听到物理节点上,,线上建议在安装 k8s 的时候就做修改,这样风险小一些(有一些版本中绑定物理机的名称不一样请注意):
[root@master prometheus]# kubectl edit configmap kube-proxy -n kube-system configmap/kube-proxy edited # 修改以下内容 metricsBindAddress: "" 改成了metricsBindAddress: "0.0.0.0:10249" # 重新启动 kube-proxy [root@master prometheus]# kubectl get pods -n kube-system | grep kube-proxy |awk '{print $1}' | xargs kubectl delete pods -n kube-system pod "kube-proxy-4vqkw" deleted pod "kube-proxy-n49vk" deleted [root@master prometheus]# kubectl get pods -n kube-system | grep kube-proxy kube-proxy-9bzl5 1/1 Running 0 35s kube-proxy-srq8v 1/1 Running 0 32s
[root@master prometheus]# ss -antulp | grep 10249
tcp LISTEN 0 128 [::]:10249 [::]:* users:(("kube-proxy",pid=119544,fd=15))
# 如果一个容器启动不了,可以查看日志
kubectl describe pods kube-proxy-9bzl5 -n kube-system
kubectl logs kube-proxy-9bzl5 -n kube-system
FIRING 表示 prometheus 已经将告警发给 alertmanager,在 Alertmanager 中可以看到有一个 alert
访问alertmanager web界面:http://192.168.199.131:30066/
# 扩展:暴力更新配置文件 # 针对热修改不生效的情况
修改 prometheus 任何一个配置文件之后,可通过 kubectl apply 使配置生效,执行顺序如下: kubectl delete -f alertmanager-cm.yaml kubectl apply -f alertmanager-cm.yaml kubectl delete -f prometheus-alertmanager-cfg.yaml kubectl apply -f prometheus-alertmanager-cfg.yaml kubectl delete -f prometheus-alertmanager-deploy.yaml kubectl apply -f prometheus-alertmanager-deploy.yaml
四、Alertmanager---发送报警到钉钉群
打开电脑版钉钉创建机器人
1.创建钉钉机器人 打开电脑版钉钉,创建一个群,创建自定义机器人,创建步骤: https://open.dingtalk.com/document/org/application-types https://open.dingtalk.com/document/group/group-robot https://open.dingtalk.com/document/group/custom-robot-access 我创建的机器人如下: 群设置-->智能群助手-->添加机器人-->自定义-->添加 机器人名称:test 接收群组:钉钉报警测试 安全设置: 自定义关键词:cluster1 上面配置好之后点击完成即可,这样就会创建一个 test 的报警机器人,创建机器人成功之后怎么查 看 webhook,按如下: 点击智能群助手,可以看到刚才创建的 test 这个机器人,点击 test,就会进入到 test 机器人的设 置界面 出现如下内容: 机器人名称:test 接受群组:钉钉报警测试 消息推送:开启 webhook: https://oapi.dingtalk.com/robot/send?access_token=8a53475677339a11cec453c608543c3d85ea73 b330ea70c4b2de96a0839cbb90 安全设置: 自定义关键词:cluster1 2.安装钉钉的 webhook 插件,在 k8s 的控制节点 xianchaomaster1 操作 tar zxvf prometheus-webhook-dingtalk-0.3.0.linux-amd64.tar.gz cd prometheus-webhook-dingtalk-0.3.0.linux-amd64 启动钉钉报警插件 nohup ./prometheus-webhook-dingtalk --web.listen-address="0.0.0.0:8060" --ding.profile="cluster1=https://oapi.dingtalk.com/robot/send?access_token=8a53475677339a11cec453c608543c3d85ea73b330ea70c4b2de96a0839cbb90" & 对原来的 alertmanager-cm.yaml 文件做备份 cp alertmanager-cm.yaml alertmanager-cm.yaml.bak 重新生成一个新的 alertmanager-cm.yaml 文件
[root@master prometheus]# cat alertmanager-cm-dd.yaml kind: ConfigMap apiVersion: v1 metadata: name: alertmanager namespace: monitor-sa data: alertmanager.yml: |- global: resolve_timeout: 1m smtp_smarthost: 'smtp.163.com:25' smtp_from: 'test@163.com' smtp_auth_username: 'test@163.com' smtp_auth_password: 'ZQEHQINQHQBFE' smtp_require_tls: false route: group_by: [alertname] group_wait: 10s group_interval: 10s repeat_interval: 10m receiver: cluster1 receivers: - name: 'cluster1' webhook_configs: - url: 'http://192.168.199.131:8060/dingtalk/cluster1/send' send_resolved: true
修改 prometheus 任何一个配置文件之后,可通过 kubectl apply 使配置生效,执行顺序如下:
kubectl delete -f alertmanager-cm.yaml kubectl apply -f alertmanager-cm-dd.yaml kubectl delete -f prometheus-alertmanager-cfg.yaml kubectl apply -f prometheus-alertmanager-cfg.yaml kubectl delete -f prometheus-alertmanager-deploy.yaml kubectl apply -f prometheus-alertmanager-deploy.yaml
五、Alertmanager发送报警到企业微信群
1.注册企业微信 登录网址:https://work.weixin.qq.com/ 找到应用管理,创建应用,应用名字 wechat 创建成功之后显示如下:
AgentId:1000003 Secret:Ov5SWq_JqrolsOj6dD4Jg9qaMu1TTaDzVTCrXHcjlFs 2.修改 alertmanager-cm.yaml [root@master prometheus]# cat alertmanager-cm-wx.yaml kind: ConfigMap apiVersion: v1 metadata: name: alertmanager namespace: monitor-sa data: alertmanager.yml: |- global: resolve_timeout: 1m smtp_smarthost: 'smtp.163.com:25' smtp_from: 'yangxiongchun@163.com' smtp_auth_username: 'yangxiongchun@163.com' smtp_auth_password: 'ZQEHQINQGDMHQBFE' smtp_require_tls: false route: group_by: [alertname] group_wait: 10s group_interval: 10s repeat_interval: 10m receiver: prometheus receivers: - name: 'prometheus' wechat_config: - corp_id: wwa82df90a693abb15 to_user: '@all' agent_id: 1000003 api_secret: Ov5SWq_JqrolsOj6dD4Jg9qaMu1TTaDzVTCrXHcjlFs 参数说明: secret: 企业微信("企业应用"-->"自定应用"[Prometheus]--> "Secret") wechat 是本人自创建应用名称 corp_id: 企业信息("我的企业"--->"CorpID"[在底部]) agent_id: 企业微信("企业应用"-->"自定应用"[Prometheus]--> "AgentId") wechat 是自创建应用名称 #在这创建的应用名字是 wechat,那么在配置 route 时,receiver 也应该是 Prometheus to_user: '@all' :发送报警到所有人 3.配置自定义告警模板 cat template_wechat.tmpl {{ define "wechat.default.message" }} {{ range .Alerts }} ========start========== 告警程序:node_exporter 告警名称:{{ .Labels.alertname }} 故障主机: {{ .Labels.instance }} 告警主题: {{ .Annotations.summary }} 告警信息: {{ .Annotations.description }} ========end========== {{ end }} {{ end }}
暴力更新:
kubectl delete -f alertmanager-cm.yaml
kubectl apply -f alertmanager-cm-wx.yaml
kubectl delete -f prometheus-alertmanager-cfg.yaml
kubectl apply -f prometheus-alertmanager-cfg.yaml
kubectl delete -f prometheus-alertmanager-deploy.yaml
kubectl apply -f prometheus-alertmanager-deploy.yaml
六、Prometheus PromQL语法
PromQL(Prometheus Query Language)是 Prometheus 自己开发的表达式语言,语言表现力很丰 富,内置函数也很多。使用它可以对时序数据进行筛选和聚合。
6.1 数据类型
PromQL 表达式计算出来的值有以下几种类型:
瞬时向量 (Instant vector): 一组时序,每个时序只有一个采样值
区间向量 (Range vector): 一组时序,每个时序包含一段时间内的多个采样值
标量数据 (Scalar): 一个浮点数
字符串 (String): 一个字符串,暂时未用
6.1.1 瞬时向量选择器
瞬时向量选择器用来选择一组时序在某个采样点的采样值。
最简单的情况就是指定一个度量指标,选择出所有属于该度量指标的时序的当前采样值。比如下面的表达式:
apiserver_request_total
可以通过在后面添加用大括号包围起来的一组标签键值对来对时序进行过滤。比如下面的表达式筛 选出了 job 为 kubernetes-apiservers(Targets里可以看到),并且 resource 为 pod 的时序:
apiserver_request_total{job="kubernetes-apiserver",resource="pods"}
匹配标签值时可以是等于,也可以使用正则表达式。总共有下面几种匹配操作符:
=:完全相等
!=: 不相等
=~: 正则表达式匹配
!~: 正则表达式不匹配
下面的表达式筛选出了 container 是 kube-scheduler 或 kube-proxy 或 kube-apiserver 的时序数据
container_processes{container=~"kube-scheduler|kube-proxy|kube-apiserver"}
6.1.2 区间向量选择器
区间向量选择器类似于瞬时向量选择器,不同的是它选择的是过去一段时间的采样值。可以通过在瞬时向量选择器后面添加包含在 [] 里的时长来得到区间向量选择器。
比如下面的表达式选出了所有度量指标为 apiserver_request_total 且 resource 是 pod 的时序在过去 1 分钟的采样值。
apiserver_request_total{job="kubernetes-apiserver",resource="pods"}[1m]
这个不支持 Graph,需要选择 Console,才会看到采集的数据
说明:时长的单位可以是下面几种之一: s:seconds ;m:minutes ;h:hours ;d:days ;w:weeks ;y:years
6.1.3 偏移向量选择器
前面介绍的选择器默认都是以当前时间为基准时间,偏移修饰器用来调整基准时间,使其往前偏移一段时间。偏移修饰器紧跟在选择器后面,使用 offset 来指定要偏移的量。
比如下面的表达式选择度量名称为 apiserver_request_total 的所有时序在 5 分钟前的采样值。
apiserver_request_total{job="kubernetes-apiserver",resource="pods"} offset 5m
下面的表达式选择 apiserver_request_total 度量指标在 1 周前的这个时间点过去 5 分钟的采样值。
apiserver_request_total{job="kubernetes-apiserver",resource="pods"} [5m] offset 1w
6.1.4 聚合操作符
PromQL 的聚合操作符用来将向量里的元素聚合得更少。总共有下面这些聚合操作符:
sum:求和 min:最小值 max:最大值 avg:平均值 stddev:标准差 stdvar:方差 count:元素个数 count_values:等于某值的元素个数 bottomk:最小的 k 个元素 topk:最大的 k 个元素 quantile:分位数
如: 计算 master节点所有容器总计内存
sum(container_memory_usage_bytes{instance=~"master"})/1024/1024/1024
计算 master节点最近 1m 所有容器 cpu 使用率
sum (rate (container_cpu_usage_seconds_total{instance=~"master"}[1m])) / sum (machine_cpu_cores{ instance =~"master"}) * 100
计算最近 1m 所有容器 cpu 使用率
sum (rate (container_cpu_usage_seconds_total{id!="/"}[1m])) by (id)
6.1.5 函数
Prometheus 内置了一些函数来辅助计算,下面介绍一些典型的。
abs():绝对值 sqrt():平方根 exp():指数计算 ln():自然对数 ceil():向上取整 floor():向下取整 round():四舍五入取整 delta():计算区间向量里每一个时序第一个和最后一个的差值 sort():排序