基于Prometheus的全方位监控平台--企业中需要哪些告警Rules
基于Prometheus的全方位监控平台--企业中需要哪些告警Rules?
一、前言
Prometheus中的告警规则允许你基于PromQL表达式定义告警触发条件,Prometheus后端对这些触发规则进行周期性计算,当满足触发条件后则会触发告警通知。
在企业中,为了确保业务的稳定性和可靠性,Prometheus告警规则非常重要。以下是需要考虑的几个维度:
- 业务维度:在企业中,不同的业务拥有不同的指标和告警规则。例如,对于ToC平台,需要监控订单量、库存、支付成功率等指标,以确保业务的正常运行。
- 环境维度:企业中通常会有多个环境,例如开发、测试、预生产和生产环境等。由于每个环境的特点不同,因此需要为每个环境制定不同的告警规则。
- 应用程序维度:不同的应用程序拥有不同的指标和告警规则。例如,在监控Web应用程序时,需要监控HTTP请求失败率、响应时间和内存使用情况等指标。
- 基础设施维度:企业中的基础设施包括服务器、网络设备和存储设备等。在监控基础设施时,需要监控CPU使用率、磁盘空间和网络带宽等指标。
- 指标说明:https://v2-1.docs.kubesphere.io/docs/zh-CN/api-reference/monitoring-metrics/
二、定义告警规则
一条典型的告警规则如下所示:
groups: - name: general.rules rules: - alert: InstanceDown expr: | up{job=~"other-ECS|k8s-nodes|prometheus"} == 0 for: 1m labels: severity: critical annotations: summary: "Instance {{ $labels.instance }} 停止工作" description: "{{ $labels.instance }} 主机名:{{ $labels.hostname }} 已经停止1分钟以上."
在告警规则文件中,我们可以将一组相关的规则设置定义在一个group下。
在每一个group中我们可以定义多个告警规则(rule)。一条告警规则主要由以下几部分组成:
- alert:告警规则的名称。
- expr:基于PromQL表达式告警触发条件,用于计算是否有时间序列满足该条件。
- for:评估等待时间,可选参数。用于表示只有当触发条件持续一段时间后才发送告警。在等待期间新产生告警的状态为pending。
- labels:自定义标签,允许用户指定要附加到告警上的一组附加标签。
- annotations:用于指定一组附加信息,比如用于描述告警详细信息的文字等,annotations的内容在告警产生时会一同作为参数发送到Alertmanager。
三、企业中的告警rules
结合公司的业务场景参考:Awesome Prometheus alerts | Collection of alerting rules (samber.github.io)
3.1、Node.rules
groups: - name: node.rules rules: - alert: NodeFilesystemUsage expr: | 100 - (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 > 85 for: 1m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} : {{ $labels.mountpoint }} 分区使用率过高" description: "{{ $labels.instance }} 主机名:{{ $labels.hostname }} : {{ $labels.mountpoint }} 分区使用大于85% (当前值: {{ $value }})" - alert: NodeMemoryUsage expr: | 100 - (node_memory_MemFree_bytes+node_memory_Cached_bytes+node_memory_Buffers_bytes) / node_memory_MemTotal_bytes * 100 > 85 for: 5m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} 内存使用率过高" description: "{{ $labels.instance }} 主机名:{{ $labels.hostname }} 内存使用大于85% (当前值: {{ $value }})" - alert: NodeCPUUsage expr: | 100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance) * 100) > 85 for: 10m labels: hostname: '{{$labels.hostname}}' severity: warning annotations: summary: "Instance {{ $labels.instance }} CPU使用率过高" description: "{{ $labels.instance }} 主机名:{{ $labels.hostname }} CPU使用大于85% (当前值: {{ $value }})" - alert: TCP_Estab expr: | node_netstat_Tcp_CurrEstab > 5500 for: 5m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} TCP_Estab链接过高" description: "{{ $labels.instance }} 主机名:{{ $labels.hostname }} TCP_Estab链接过高!(当前值: {{ $value }})" - alert: TCP_TIME_WAIT expr: | node_sockstat_TCP_tw > 3000 for: 5m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} TCP_TIME_WAIT过高" description: "{{ $labels.instance }} 主机名:{{ $labels.hostname }} TCP_TIME_WAIT过高!(当前值: {{ $value }})" - alert: TCP_Sockets expr: | node_sockstat_sockets_used > 10000 for: 5m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} TCP_Sockets链接过高" description: "{{ $labels.instance }} 主机名:{{ $labels.hostname }} TCP_Sockets链接过高!(当前值: {{ $value }})" - alert: KubeNodeNotReady expr: | kube_node_status_condition{condition="Ready",status="true"} == 0 for: 1m labels: severity: critical annotations: description: '{{ $labels.node }} NotReady已经1分钟.' - alert: KubernetesMemoryPressure expr: kube_node_status_condition{condition="MemoryPressure",status="true"} == 1 for: 2m labels: severity: critical annotations: summary: Kubernetes memory pressure (instance {{ $labels.instance }}) description: "{{ $labels.node }} has MemoryPressure condition VALUE = {{ $value }}" - alert: KubernetesDiskPressure expr: kube_node_status_condition{condition="DiskPressure",status="true"} == 1 for: 2m labels: severity: critical annotations: summary: Kubernetes disk pressure (instance {{ $labels.instance }}) description: "{{ $labels.node }} has DiskPressure condition." - alert: KubernetesContainerOomKiller expr: (kube_pod_container_status_restarts_total - kube_pod_container_status_restarts_total offset 10m >= 1) and ignoring (reason) min_over_time(kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}[10m]) == 1 for: 10m labels: severity: warning annotations: summary: Kubernetes container oom killer (instance {{ $labels.instance }}) description: "{{ $labels.namespace }}/{{ $labels.pod }} has been OOMKilled {{ $value }} times in the last 10 minutes." - alert: KubernetesJobFailed expr: kube_job_status_failed > 0 for: 1m labels: severity: warning annotations: summary: Kubernetes Job failed (instance {{ $labels.instance }}) description: "Job {{$labels.namespace}}/{{$labels.job_name}} failed to complete." - alert: UnusualDiskReadRate expr: | sum by (job,instance) (irate(node_disk_read_bytes_total[5m])) / 1024 / 1024 > 140 for: 5m labels: severity: critical hostname: '{{ $labels.hostname }}' annotations: description: '{{ $labels.instance }} 主机名:{{ $labels.hostname }} 持续5分钟磁盘读取数据(> 140 MB/s) (当前值: {{ $value }}) 阿里云ESSD PL0最大吞吐量180MB/s, PL1最大350MB/s' - alert: UnusualDiskWriteRate expr: | sum by (job,instance) (irate(node_disk_written_bytes_total[5m])) / 1024 / 1024 > 140 for: 5m labels: severity: critical hostname: '{{ $labels.hostname }}' annotations: description: '{{ $labels.instance }} 主机名:{{ $labels.hostname }} 持续5分钟磁盘写入数据(> 140 MB/s) (当前值: {{ $value }}) 阿里云ESSD PL0最大吞吐量180MB/s, PL1最大350MB/s' - alert: UnusualNetworkThroughputIn expr: | sum by (job,instance) (irate(node_network_receive_bytes_total{job=~"aws-hk-monitor|k8s-nodes"}[5m])) / 1024 / 1024 > 80 for: 5m labels: severity: critical annotations: description: '{{ $labels.instance }} 主机名:{{ $labels.hostname }} 持续5分钟网络带宽接收数据(> 80 MB/s) (当前值: {{ $value }})' - alert: UnusualNetworkThroughputOut expr: | sum by (job,instance) (irate(node_network_transmit_bytes_total{job=~"aws-hk-monitor|k8s-nodes"}[5m])) / 1024 / 1024 > 80 for: 5m labels: severity: critical annotations: description: '{{ $labels.instance }} 主机名:{{ $labels.hostname }} 持续5分钟网络带宽发送数据(> 80 MB/s) (当前值: {{ $value }})' - alert: SystemdServiceCrashed expr: | node_systemd_unit_state{state="failed"} == 1 for: 5m labels: severity: warning annotations: description: '{{ $labels.instance }} 主机名:{{ $labels.hostname }} 上的{{$labels.name}}服务有问题已经5分钟,请及时处理' - alert: HostDiskWillFillIn24Hours expr: (node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes < 10 and ON (instance, device, mountpoint) predict_linear(node_filesystem_avail_bytes{fstype!~"tmpfs"}[1h], 24 * 3600) < 0 and ON (instance, device, mountpoint) node_filesystem_readonly == 0 for: 2m labels: severity: warning annotations: summary: Host disk will fill in 24 hours (instance {{ $labels.instance }}) description: "{{ $labels.instance }} 主机名:{{ $labels.hostname }} 以当前写入速率,预计文件系统将在未来24小时内耗尽空间!" - alert: HostOutOfInodes expr: node_filesystem_files_free / node_filesystem_files * 100 < 10 and ON (instance, device, mountpoint) node_filesystem_readonly == 0 for: 2m labels: severity: warning annotations: summary: Host out of inodes (instance {{ $labels.instance }}) description: "{{ $labels.instance }} 主机名:{{ $labels.hostname }} 磁盘iNode空间剩余小于10%!\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: HostOomKillDetected expr: increase(node_vmstat_oom_kill[1m]) > 0 for: 0m labels: severity: warning annotations: summary: Host OOM kill detected (instance {{ $labels.instance }}) description: "{{ $labels.instance }} 主机名:{{ $labels.hostname }} 当前主机检查到有OOM现象!"
3.2、prometheus.rules
groups: - name: prometheus.rules rules: - alert: PrometheusErrorSendingAlertsToAnyAlertmanagers expr: | (rate(prometheus_notifications_errors_total{instance="localhost:9090", job="prometheus"}[5m]) / rate(prometheus_notifications_sent_total{instance="localhost:9090", job="prometheus"}[5m])) * 100 > 3 for: 5m labels: severity: warning annotations: description: '{{ printf "%.1f" $value }}% minimum errors while sending alerts from Prometheus {{$labels.namespace}}/{{$labels.pod}} to any Alertmanager.' - alert: PrometheusNotConnectedToAlertmanagers expr: | max_over_time(prometheus_notifications_alertmanagers_discovered{instance="localhost:9090", job="prometheus"}[5m]) != 1 for: 5m labels: severity: critical annotations: description: "Prometheus {{$labels.namespace}}/{{$labels.pod}} 链接alertmanager异常!" - alert: PrometheusRuleFailures expr: | increase(prometheus_rule_evaluation_failures_total{instance="localhost:9090", job="prometheus"}[5m]) > 0 for: 5m labels: severity: critical annotations: description: 'Prometheus {{$labels.namespace}}/{{$labels.pod}} 在5分钟执行失败的规则次数 {{ printf "%.0f" $value }}' - alert: PrometheusRuleEvaluationFailures expr: increase(prometheus_rule_evaluation_failures_total[3m]) > 0 for: 0m labels: severity: critical annotations: summary: Prometheus rule evaluation failures (instance {{ $labels.instance }}) description: "Prometheus 遇到规则 {{ $value }} 载入失败, 请及时检查." - alert: PrometheusTsdbReloadFailures expr: increase(prometheus_tsdb_reloads_failures_total[1m]) > 0 for: 0m labels: severity: critical annotations: summary: Prometheus TSDB reload failures (instance {{ $labels.instance }}) description: "Prometheus {{ $value }} TSDB 重载失败!" - alert: PrometheusTsdbWalCorruptions expr: increase(prometheus_tsdb_wal_corruptions_total[1m]) > 0 for: 0m labels: severity: critical annotations: summary: Prometheus TSDB WAL corruptions (instance {{ $labels.instance }}) description: "Prometheus {{ $value }} TSDB WAL 模块出现问题!"
3.3、website.rules
groups: - name: website.rules rules: - alert: "ssl证书过期警告" expr: (probe_ssl_earliest_cert_expiry - time())/86400 <30 for: 1h labels: severity: warning annotations: description: '域名{{$labels.instance}}的证书还有{{ printf "%.1f" $value }}天就过期了,请尽快更新证书' summary: "ssl证书过期警告" - alert: blackbox_network_stats expr: probe_success == 0 for: 1m labels: severity: critical pod: '{{$labels.instance}}' namespace: '{{$labels.kubernetes_namespace}}' annotations: summary: "接口/主机/端口/域名 {{ $labels.instance }} 不能访问" description: "接口/主机/端口/域名 {{ $labels.instance }} 不能访问,请尽快检测!" - alert: curlHttpStatus expr: probe_http_status_code{job="blackbox-http"} >= 422 and probe_success{job="blackbox-http"} == 0 for: 1m labels: severity: critical annotations: summary: '业务报警: 网站不可访问' description: '{{$labels.instance}} 不可访问,请及时查看,当前状态码为{{$value}}'
3.4、pod.rules
groups: - name: node.rules rules: - alert: NodeFilesystemUsage expr: | 100 - (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 > 85 for: 1m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} : {{ $labels.mountpoint }} 分区使用率过高" description: "{{ $labels.instance }} 主机名:{{ $labels.hostname }} : {{ $labels.mountpoint }} 分区使用大于85% (当前值: {{ $value }})" - alert: NodeMemoryUsage expr: | 100 - (node_memory_MemFree_bytes+node_memory_Cached_bytes+node_memory_Buffers_bytes) / node_memory_MemTotal_bytes * 100 > 85 for: 5m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} 内存使用率过高" description: "{{ $labels.instance }} 主机名:{{ $labels.hostname }} 内存使用大于85% (当前值: {{ $value }})" - alert: NodeCPUUsage expr: | 100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance) * 100) > 85 for: 10m labels: hostname: '{{$labels.hostname}}' severity: warning annotations: summary: "Instance {{ $labels.instance }} CPU使用率过高" description: "{{ $labels.instance }} 主机名:{{ $labels.hostname }} CPU使用大于85% (当前值: {{ $value }})" - alert: TCP_Estab expr: | node_netstat_Tcp_CurrEstab > 5500 for: 5m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} TCP_Estab链接过高" description: "{{ $labels.instance }} 主机名:{{ $labels.hostname }} TCP_Estab链接过高!(当前值: {{ $value }})" - alert: TCP_TIME_WAIT expr: | node_sockstat_TCP_tw > 3000 for: 5m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} TCP_TIME_WAIT过高" description: "{{ $labels.instance }} 主机名:{{ $labels.hostname }} TCP_TIME_WAIT过高!(当前值: {{ $value }})" - alert: TCP_Sockets expr: | node_sockstat_sockets_used > 10000 for: 5m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} TCP_Sockets链接过高" description: "{{ $labels.instance }} 主机名:{{ $labels.hostname }} TCP_Sockets链接过高!(当前值: {{ $value }})" - alert: KubeNodeNotReady expr: | kube_node_status_condition{condition="Ready",status="true"} == 0 for: 1m labels: severity: critical annotations: description: '{{ $labels.node }} NotReady已经1分钟.' - alert: KubernetesMemoryPressure expr: kube_node_status_condition{condition="MemoryPressure",status="true"} == 1 for: 2m labels: severity: critical annotations: summary: Kubernetes memory pressure (instance {{ $labels.instance }}) description: "{{ $labels.node }} has MemoryPressure condition VALUE = {{ $value }}" - alert: KubernetesDiskPressure expr: kube_node_status_condition{condition="DiskPressure",status="true"} == 1 for: 2m labels: severity: critical annotations: summary: Kubernetes disk pressure (instance {{ $labels.instance }}) description: "{{ $labels.node }} has DiskPressure condition." - alert: KubernetesContainerOomKiller expr: (kube_pod_container_status_restarts_total - kube_pod_container_status_restarts_total offset 10m >= 1) and ignoring (reason) min_over_time(kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}[10m]) == 1 for: 10m labels: severity: warning annotations: summary: Kubernetes container oom killer (instance {{ $labels.instance }}) description: "{{ $labels.namespace }}/{{ $labels.pod }} has been OOMKilled {{ $value }} times in the last 10 minutes." - alert: KubernetesJobFailed expr: kube_job_status_failed > 0 for: 1m labels: severity: warning annotations: summary: Kubernetes Job failed (instance {{ $labels.instance }}) description: "Job {{$labels.namespace}}/{{$labels.job_name}} failed to complete." - alert: UnusualDiskReadRate expr: | sum by (job,instance) (irate(node_disk_read_bytes_total[5m])) / 1024 / 1024 > 140 for: 5m labels: severity: critical hostname: '{{ $labels.hostname }}' annotations: description: '{{ $labels.instance }} 主机名:{{ $labels.hostname }} 持续5分钟磁盘读取数据(> 140 MB/s) (当前值: {{ $value }}) 阿里云ESSD PL0最大吞吐量180MB/s, PL1最大350MB/s' - alert: UnusualDiskWriteRate expr: | sum by (job,instance) (irate(node_disk_written_bytes_total[5m])) / 1024 / 1024 > 140 for: 5m labels: severity: critical hostname: '{{ $labels.hostname }}' annotations: description: '{{ $labels.instance }} 主机名:{{ $labels.hostname }} 持续5分钟磁盘写入数据(> 140 MB/s) (当前值: {{ $value }}) 阿里云ESSD PL0最大吞吐量180MB/s, PL1最大350MB/s' - alert: UnusualNetworkThroughputIn expr: | sum by (job,instance) (irate(node_network_receive_bytes_total{job=~"aws-hk-monitor|k8s-nodes"}[5m])) / 1024 / 1024 > 80 for: 5m labels: severity: critical annotations: description: '{{ $labels.instance }} 主机名:{{ $labels.hostname }} 持续5分钟网络带宽接收数据(> 80 MB/s) (当前值: {{ $value }})' - alert: UnusualNetworkThroughputOut expr: | sum by (job,instance) (irate(node_network_transmit_bytes_total{job=~"aws-hk-monitor|k8s-nodes"}[5m])) / 1024 / 1024 > 80 for: 5m labels: severity: critical annotations: description: '{{ $labels.instance }} 主机名:{{ $labels.hostname }} 持续5分钟网络带宽发送数据(> 80 MB/s) (当前值: {{ $value }})' - alert: SystemdServiceCrashed expr: | node_systemd_unit_state{state="failed"} == 1 for: 5m labels: severity: warning annotations: description: '{{ $labels.instance }} 主机名:{{ $labels.hostname }} 上的{{$labels.name}}服务有问题已经5分钟,请及时处理' - alert: HostDiskWillFillIn24Hours expr: (node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes < 10 and ON (instance, device, mountpoint) predict_linear(node_filesystem_avail_bytes{fstype!~"tmpfs"}[1h], 24 * 3600) < 0 and ON (instance, device, mountpoint) node_filesystem_readonly == 0 for: 2m labels: severity: warning annotations: summary: Host disk will fill in 24 hours (instance {{ $labels.instance }}) description: "{{ $labels.instance }} 主机名:{{ $labels.hostname }} 以当前写入速率,预计文件系统将在未来24小时内耗尽空间!" - alert: HostOutOfInodes expr: node_filesystem_files_free / node_filesystem_files * 100 < 10 and ON (instance, device, mountpoint) node_filesystem_readonly == 0 for: 2m labels: severity: warning annotations: summary: Host out of inodes (instance {{ $labels.instance }}) description: "{{ $labels.instance }} 主机名:{{ $labels.hostname }} 磁盘iNode空间剩余小于10%!\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: HostOomKillDetected expr: increase(node_vmstat_oom_kill[1m]) > 0 for: 0m labels: severity: warning annotations: summary: Host OOM kill detected (instance {{ $labels.instance }}) description: "{{ $labels.instance }} 主机名:{{ $labels.hostname }} 当前主机检查到有OOM现象!"
3.5、volume.rules
groups: - name: volume.rules rules: - alert: PersistentVolumeClaimLost expr: | sum by(namespace, persistentvolumeclaim) (kube_persistentvolumeclaim_status_phase{phase="Lost"}) == 1 for: 2m labels: severity: warning annotations: description: "PersistentVolumeClaim {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} is lost!" - alert: PersistentVolumeClaimPendig expr: | sum by(namespace, persistentvolumeclaim) (kube_persistentvolumeclaim_status_phase{phase="Pendig"}) == 1 for: 2m labels: severity: warning annotations: description: "PersistentVolumeClaim {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} is pendig!" - alert: PersistentVolume Failed expr: | sum(kube_persistentvolume_status_phase{phase="Failed",job="kubernetes-service-endpoints"}) by (persistentvolume) == 1 for: 2m labels: severity: warning annotations: description: "Persistent volume is failed state\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: PersistentVolume Pending expr: | sum(kube_persistentvolume_status_phase{phase="Pending",job="kubernetes-service-endpoints"}) by (persistentvolume) == 1 for: 2m labels: severity: warning annotations: description: "Persistent volume is pending state\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
3.6、process.rules
groups: - name: process.rules rules: - alert: Process for Sparkxtask already down!!! expr: | (namedprocess_namegroup_num_procs{groupname="map[:sparkxtask]"}) < 4 for: 1m labels: severity: warning pod: sparkxads-process annotations: description: "任务名称: sparktask | 正常进程数量: 4个 | 当前值: {{ $value }},请Robot及时处理!"
四、总结
本节课主要探讨了 Prometheus 中不同维度的规则定义,总结如下:
- Prometheus 规则是一种基于 PromQL 表达式的告警和记录生成机制,可以通过对 指标的计算 和 聚合 来产生新的时间序列。
- 通过定义 不同维度 的规则,可以让 Prometheus 对 不同层次 和细节的 指标 进行监控和告警,从而更好地了解应用程序的状态和性能状况。
- 为了实现简单而 有效的 告警策略,根据哪些指标来触发告警?避免过度告警和噪声干扰,提高监控和告警的 可靠性 和 准确性。
越学越感到自己的无知
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· TypeScript + Deepseek 打造卜卦网站:技术与玄学的结合
· 阿里巴巴 QwQ-32B真的超越了 DeepSeek R-1吗?
· 如何调用 DeepSeek 的自然语言处理 API 接口并集成到在线客服系统
· 【译】Visual Studio 中新的强大生产力特性
· 2025年我用 Compose 写了一个 Todo App