Prometheus 告警规则

Prometheus 告警规则概念：

警报规则允许您根据 Prometheus 表达式语言表达式定义警报条件，并将有关触发警报的通知发送到外部服务。每当警报表达式在给定的时间点产生一个或多个向量元素时，警报对于这些元素的标签集算作活动。
类似于记录规则, 告警规则(Alerting rule) 也定义在独立的文件中, 而后由 Prometheus 在 rule_files 配置段中加载配置如下：

rule_files:
  - alerting_rules/*.yml          # 告警规则文件路径

Prometheus指标含义：

选项	含义
- group	配置顶级，用于定义一个监控组
- name	规则名称
- rules	规则
- alert	告警规则名称
- expr	表达式基于PromQL表达式告警触发条件，用于计算是否有时间序列满足该条件
- for	评估等待时间，在等待时间状态为pending，满足时长告警状态为firing，恢复则为inactive状态
- lables	自定义标签
- annotations	用于指定一组附加信息，比如用于描述告警详细信息的文字等，annotations的内容在告警产生时会一同作为参数发送到Alertmanager。summary描述告警的概要信息，description用于描述告警的详细信息。同时Alertmanager的UI也会根据这两个标签值，显示告警信息。

1.1Prometheus 自我监控（25 条规则）

1.1.1 Prometheus 自我监控模板：

- alert: PrometheusJobMissing               // 告警规则名称：
    expr: absent(up{job="prometheus"})      // 匹配规则,表达式：
    for: 0m                                 // 检测持续时间,表示持续一分钟获取不到信息，则触发报警。0表示不使用持续时间
    labels:                                 // 定义当前告警规则级别
      severity: warning                     // 指定告警级别
    annotations:                            // 注释 告警通知
	//调用标签，具体指附加通知信息
      summary: Prometheus job missing (instance {{ $labels.instance }})     //自定义摘要
      description: "A Prometheus job has disappeared\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"                     //自定义具体描述

1.1.2 Prometheus 目标丢失模板：

- alert: PrometheusTargetMissing           // 告警规则名称
    expr: up == 0                          // 匹配规则,表达式
    for: 0m                                // 检测持续时间,表示持续一分钟获取不到信息，则触发报警。0表示不使用持续时间
    labels:                                // 定义当前告警规则级别
      severity: critical                   // 指定告警级别
    annotations:                           // 注释 告警通知
	// 调用标签具体指附加通知信息
      summary: Prometheus target missing (instance {{ $labels.instance }})  // 自定义摘要 
      description: "A Prometheus target has disappeared. An exporter might be crashed.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}" // 自定义具体描述

1.1.3 Prometheus 所有目标丢失模板：

  - alert: PrometheusAllTargetsMissing  // 告警规则名称
    expr: count by (job) (up) == 0      // 匹配规则,表达式
    for: 0m                             // 检测持续时间,表示持续一分钟获取不到信息，则触发报警。0表示不使用持续时间
    labels:                             // 定义当前告警规则级别
      severity: critical                // 指定告警级别
    annotations:                        // 注释 告警通知
	  // 调用标签具体指附加通知信息
      summary: Prometheus all targets missing (instance {{ $labels.instance }})  // 自定义摘要
      description: "A Prometheus job does not have living target anymore.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"  // 自定义具体描述

1.1.4 Prometheus 配置重载失败模板：

  - alert: PrometheusConfigurationReloadFailure           // 告警规则名称
    expr: prometheus_config_last_reload_successful != 1   // 匹配规则,表达式
    for: 0m                                               // 检测持续时间,表示持续一分钟获取不到信息，则触发报警。0表示不使用持续时间
    labels:                                               // 定义当前告警规则级别
      severity: warning                                   // 指定告警级别
    annotations:                                          // 注释 告警通知
	// 调用标签具体指附加通知信息
      summary: Prometheus configuration reload failure (instance {{ $labels.instance }})  // 自定义摘要
      description: "Prometheus configuration reload error\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"  // 自定义具体描述

1.1.5 Prometheus 重启太多模板：

- alert: PrometheusTooManyRestarts                      // 告警规则名称
    expr:                           		  	//匹配规则,表达式changes(process_start_time_seconds{job=~"prometheus|pushgateway|alertmanager"}[15m]) > 2
    for: 0m                                     // 检测持续时间,表示持续一分钟获取不到信息，则触发报警。0表示不使用持续时间
    labels:                                     // 定义当前告警规则级别
      severity: warning							// 指定告警级别
    annotations:								 // 注释 告警通知
	// 调用标签具体指附加通知信息
      summary: Prometheus too many restarts (instance {{ $labels.instance }})		 // 自定义摘要
      description: "Prometheus has restarted more than twice in the last 15 minutes. It might be crashlooping.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"		// 自定义具体描述

1.1.6 Prometheus AlertManager 配置重载失败模板：

 - alert: PrometheusAlertmanagerConfigurationReloadFailure		// 告警规则名称
    expr: alertmanager_config_last_reload_successful != 1		// 匹配规则,表达式
    for: 0m			// 检测持续时间,表示持续一分钟获取不到信息，则触发报警。0表示不使用持续时间
    labels:			// 定义当前告警规则级别
      severity: warning			// 指定告警级别
    annotations:		// 注释 告警通知
	// 调用标签具体指附加通知信息
      summary: Prometheus AlertManager configuration reload failure (instance {{ $labels.instance }})		// 自定义摘要
      description: "AlertManager configuration reload error\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"		// 自定义具体描述

1.1.7 Prometheus AlertManager 配置未同步模板：

- alert: PrometheusAlertmanagerConfigNotSynced		// 告警规则名称
    expr: count(count_values("config_hash", alertmanager_config_hash)) > 1		// 匹配规则,表达式
    for: 0m		// 检测持续时间,表示持续一分钟获取不到信息，则触发报警。0表示不使用持续时间
    labels:		// 定义当前告警规则级别
      severity: warning		// 指定告警级别
    annotations:		// 注释 告警通知
	// 调用标签具体指附加通知信息
      summary: Prometheus AlertManager config not synced (instance {{ $labels.instance }})	// 自定义摘要
      description: "Configurations of AlertManager cluster instances are out of sync\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"	// 自定义具体描述

1.1.8 Prometheus AlertManager E2E dead man switch模板：

- alert: PrometheusAlertmanagerE2eDeadManSwitch	// 告警规则名称
    expr: vector(1)	// 匹配规则,表达式
    for: 0m	// 检测持续时间,表示持续一分钟获取不到信息，则触发报警。0表示不使用持续时间
    labels:	// 定义当前告警规则级别
      severity: critical	// 指定告警级别
    annotations:	// 注释 告警通知
	// 调用标签具体指附加通知信息
      summary: Prometheus AlertManager E2E dead man switch (instance {{ $labels.instance }})	// 自定义摘要
      description: "Prometheus DeadManSwitch is an always-firing alert. It's used as an end-to-end test of Prometheus through the Alertmanager.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"	// 自定义具体描述

1.1.9 Prometheus 无法连接alertmanager模板：

  - alert: PrometheusNotConnectedToAlertmanager	// 告警规则组名称
    expr: prometheus_notifications_alertmanagers_discovered < 1	// 匹配规则,表达式
    for: 0m	// 检测持续时间,表示持续一分钟获取不到信息，则触发报警。0表示不使用持续时间
    labels:	// 定义当前告警规则级别
      severity: critical	// 指定告警级别
    annotations:	// 注释 告警通知
	// 调用标签具体指附加通知信息
      summary: Prometheus not connected to alertmanager (instance {{ $labels.instance }})	// 自定义摘要
      description: "Prometheus cannot connect the alertmanager\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"	// 自定义具体描述

1.1.10 Prometheus 规则评估失败模板：

 - alert: PrometheusTemplateTextExpansionFailures
    expr: increase(prometheus_template_text_expansion_failures_total[3m]) > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Prometheus template text expansion failures (instance {{ $labels.instance }})
      description: "Prometheus encountered {{ $value }} template text expansion failures\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.1.11 Prometheus 规则评估缓慢模板：

- alert: PrometheusRuleEvaluationSlow
    expr: prometheus_rule_group_last_duration_seconds > prometheus_rule_group_interval_seconds
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Prometheus rule evaluation slow (instance {{ $labels.instance }})
      description: "Prometheus rule evaluation took more time than the scheduled interval. It indicates a slower storage backend access or too complex query.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.1.12 Prometheus 通知队列积压模板：

- alert: PrometheusNotificationsBacklog
    expr: min_over_time(prometheus_notifications_queue_length[10m]) > 0
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Prometheus notifications backlog (instance {{ $labels.instance }})
      description: "The Prometheus notification queue has not been empty for 10 minutes\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.1.13 Prometheus AlertManager 通知失败

- alert: PrometheusAlertmanagerNotificationFailing
    expr: rate(alertmanager_notifications_failed_total[1m]) > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Prometheus AlertManager notification failing (instance {{ $labels.instance }})
      description: "Alertmanager is failing sending notifications\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.1.14 Prometheus 目标为空模板：

- alert: PrometheusTargetEmpty
    expr: prometheus_sd_discovered_targets == 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Prometheus target empty (instance {{ $labels.instance }})
      description: "Prometheus has no target in service discovery\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.1.15 Prometheus目标抓取缓慢模板：

- alert: PrometheusTargetScrapingSlow
    expr: prometheus_target_interval_length_seconds{quantile="0.9"} > 60
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Prometheus target scraping slow (instance {{ $labels.instance }})
      description: "Prometheus is scraping exporters slowly\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.1.16 Prometheus large scrape模板：

- alert: PrometheusLargeScrape
    expr: increase(prometheus_target_scrapes_exceeded_sample_limit_total[10m]) > 10
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Prometheus large scrape (instance {{ $labels.instance }})
      description: "Prometheus has many scrapes that exceed the sample limit\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.1.17 Prometheus目标抓取重复模板：

- alert: PrometheusTargetScrapeDuplicate
    expr: increase(prometheus_target_scrapes_sample_duplicate_timestamp_total[5m]) > 0
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Prometheus target scrape duplicate (instance {{ $labels.instance }})
      description: "Prometheus has many samples rejected due to duplicate timestamps but different values\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.1.18 Prometheus TSDB 检查点创建失败模板：

 - alert: PrometheusTsdbCheckpointCreationFailures
    expr: increase(prometheus_tsdb_checkpoint_creations_failed_total[1m]) > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Prometheus TSDB checkpoint creation failures (instance {{ $labels.instance }})
      description: "Prometheus encountered {{ $value }} checkpoint creation failures\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.1.19 Prometheus TSDB 检查点删除失败模板：

 - alert: PrometheusTsdbCheckpointDeletionFailures
    expr: increase(prometheus_tsdb_checkpoint_deletions_failed_total[1m]) > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Prometheus TSDB checkpoint deletion failures (instance {{ $labels.instance }})
      description: "Prometheus encountered {{ $value }} checkpoint deletion failures\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.1.20 Prometheus TSDB 压缩失败模板：

- alert: PrometheusTsdbCompactionsFailed
    expr: increase(prometheus_tsdb_compactions_failed_total[1m]) > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Prometheus TSDB compactions failed (instance {{ $labels.instance }})
      description: "Prometheus encountered {{ $value }} TSDB compactions failures\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.1.21 Prometheus TSDB 头部截断失败模板：

 - alert: PrometheusTsdbHeadTruncationsFailed
    expr: increase(prometheus_tsdb_head_truncations_failed_total[1m]) > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Prometheus TSDB head truncations failed (instance {{ $labels.instance }})
      description: "Prometheus encountered {{ $value }} TSDB head truncation failures\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.1.22 Prometheus TSDB 重新加载失败模板：

 - alert: PrometheusTsdbReloadFailures
    expr: increase(prometheus_tsdb_reloads_failures_total[1m]) > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Prometheus TSDB reload failures (instance {{ $labels.instance }})
      description: "Prometheus encountered {{ $value }} TSDB reload failures\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.1.23 普罗米修斯 TSDB WAL 损坏模板：

 - alert: PrometheusTsdbWalCorruptions
    expr: increase(prometheus_tsdb_wal_corruptions_total[1m]) > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Prometheus TSDB WAL corruptions (instance {{ $labels.instance }})
      description: "Prometheus encountered {{ $value }} TSDB WAL corruptions\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.1.24 Prometheus TSDB WAL 截断失败模板：

 - alert: PrometheusTsdbWalTruncationsFailed
    expr: increase(prometheus_tsdb_wal_truncations_failed_total[1m]) > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Prometheus TSDB WAL truncations failed (instance {{ $labels.instance }})
      description: "Prometheus encountered {{ $value }} TSDB WAL truncation failures\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.2 主机和硬件：节点导出器（33条规则）

1.2.1. 主机内存不足模板：

节点内存已满（剩余< 10%）
- alert: HostOutOfMemory
    expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 10
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Host out of memory (instance {{ $labels.instance }})
      description: "Node memory is filling up (< 10% left)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.2.2 内存压力下的主机内存模板：

节点内存压力很大。主要页面错误率高
 - alert: HostMemoryUnderMemoryPressure
    expr: rate(node_vmstat_pgmajfault[1m]) > 1000
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Host memory under memory pressure (instance {{ $labels.instance }})
      description: "The node is under heavy memory pressure. High rate of major page faults\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.2.3 主机异常网络吞吐量模板：

主机网络接口可能接收太多数据 (> 100 MB/s)
 - alert: HostUnusualNetworkThroughputIn
    expr: sum by (instance) (rate(node_network_receive_bytes_total[2m])) / 1024 / 1024 > 100
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Host unusual network throughput in (instance {{ $labels.instance }})
      description: "Host network interfaces are probably receiving too much data (> 100 MB/s)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.2.4 主机异常网络吞吐量模板：

主机网络接口可能发送过多数据（> 100 MB/s）
 - alert: HostUnusualNetworkThroughputOut
    expr: sum by (instance) (rate(node_network_transmit_bytes_total[2m])) / 1024 / 1024 > 100
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Host unusual network throughput out (instance {{ $labels.instance }})
      description: "Host network interfaces are probably sending too much data (> 100 MB/s)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.2.5 主机异常磁盘读取率模板：

磁盘可能正在读取太多数据（> 50 MB/s）
 - alert: HostUnusualDiskReadRate
    expr: sum by (instance) (rate(node_disk_read_bytes_total[2m])) / 1024 / 1024 > 50
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Host unusual disk read rate (instance {{ $labels.instance }})
      description: "Disk is probably reading too much data (> 50 MB/s)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.2.6 主机异常磁盘写入率模板：

磁盘可能正在写入太多数据（> 50 MB/s）
- alert: HostUnusualDiskWriteRate
    expr: sum by (instance) (rate(node_disk_written_bytes_total[2m])) / 1024 / 1024 > 50
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Host unusual disk write rate (instance {{ $labels.instance }})
      description: "Disk is probably writing too much data (> 50 MB/s)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.2.7 主机磁盘空间不足模板：

磁盘几乎已满（还剩 < 10%）
 - alert: HostOutOfDiskSpace
    expr: (node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes < 10 and ON (instance, device, mountpoint) node_filesystem_readonly == 0
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Host out of disk space (instance {{ $labels.instance }})
      description: "Disk is almost full (< 10% left)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.2.8 主机磁盘将在 24 小时内填满模板：

以当前写入速率，预计文件系统将在未来 24 小时内耗尽空间
- alert: HostDiskWillFillIn24Hours
    expr: (node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes < 10 and ON (instance, device, mountpoint) predict_linear(node_filesystem_avail_bytes{fstype!~"tmpfs"}[1h], 24 * 3600) < 0 and ON (instance, device, mountpoint) node_filesystem_readonly == 0
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Host disk will fill in 24 hours (instance {{ $labels.instance }})
      description: "Filesystem is predicted to run out of space within the next 24 hours at current write rate\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.2.9 主机不足 inode模板：

磁盘几乎用完了可用的 inode（剩余 < 10%）
- alert: HostOutOfInodes
    expr: node_filesystem_files_free{mountpoint ="/rootfs"} / node_filesystem_files{mountpoint="/rootfs"} * 100 < 10 and ON (instance, device, mountpoint) node_filesystem_readonly{mountpoint="/rootfs"} == 0
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Host out of inodes (instance {{ $labels.instance }})
      description: "Disk is almost running out of available inodes (< 10% left)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.2.10 主机 inode 将在 24 小时内填满模板：

以当前写入速率，预计文件系统将在未来 24 小时内耗尽 inode
 - alert: HostInodesWillFillIn24Hours
    expr: node_filesystem_files_free{mountpoint ="/rootfs"} / node_filesystem_files{mountpoint="/rootfs"} * 100 < 10 and predict_linear(node_filesystem_files_free{mountpoint="/rootfs"}[1h], 24 * 3600) < 0 and ON (instance, device, mountpoint) node_filesystem_readonly{mountpoint="/rootfs"} == 0
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Host inodes will fill in 24 hours (instance {{ $labels.instance }})
      description: "Filesystem is predicted to run out of inodes within the next 24 hours at current write rate\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.2.11 主机异常磁盘读取延迟模板：

磁盘延迟增加（读取操作 > 100 毫秒）
- alert: HostUnusualDiskReadLatency
    expr: rate(node_disk_read_time_seconds_total[1m]) / rate(node_disk_reads_completed_total[1m]) > 0.1 and rate(node_disk_reads_completed_total[1m]) > 0
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Host unusual disk read latency (instance {{ $labels.instance }})
      description: "Disk latency is growing (read operations > 100ms)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.2.12 主机异常磁盘写入延迟模板：

磁盘延迟增加（写入操作 > 100 毫秒）
- alert: HostUnusualDiskWriteLatency
    expr: rate(node_disk_write_time_seconds_total[1m]) / rate(node_disk_writes_completed_total[1m]) > 0.1 and rate(node_disk_writes_completed_total[1m]) > 0
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Host unusual disk write latency (instance {{ $labels.instance }})
      description: "Disk latency is growing (write operations > 100ms)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.2.13 主机 CPU 负载高模板：

CPU 负载 > 80%
 - alert: HostHighCpuLoad
    expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100) > 80
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Host high CPU load (instance {{ $labels.instance }})
      description: "CPU load is > 80%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.2.14 主机 CPU 窃取模板：

CPU 窃取率 > 10%
- alert: HostCpuStealNoisyNeighbor
    expr: avg by(instance) (rate(node_cpu_seconds_total{mode="steal"}[5m])) * 100 > 10
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Host CPU steal noisy neighbor (instance {{ $labels.instance }})
      description: "CPU steal is > 10%. A noisy neighbor is killing VM performances or a spot instance may be out of credit.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.2.15 主机上下文切换模板：

节点上的上下文切换正在增长（> 1000 / s）
- alert: HostContextSwitching
    expr: (rate(node_context_switches_total[5m])) / (count without(cpu, mode) (node_cpu_seconds_total{mode="idle"})) > 1000
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Host context switching (instance {{ $labels.instance }})
      description: "Context switching is growing on node (> 1000 / s)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.2.16. 主机交换已满模板：

掉期已满 (>80%)
 - alert: HostSwapIsFillingUp
    expr: (1 - (node_memory_SwapFree_bytes / node_memory_SwapTotal_bytes)) * 100 > 80
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Host swap is filling up (instance {{ $labels.instance }})
      description: "Swap is filling up (>80%)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.2.17. 主机 systemd 服务崩溃模板：

systemd 服务崩溃
- alert: HostSystemdServiceCrashed
    expr: node_systemd_unit_state{state="failed"} == 1
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Host systemd service crashed (instance {{ $labels.instance }})
      description: "systemd service crashed\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.2.18 主机物理组件过热模板：

systemd 服务崩溃
- alert: HostPhysicalComponentTooHot
    expr: node_hwmon_temp_celsius > 75
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Host physical component too hot (instance {{ $labels.instance }})
      description: "Physical hardware component too hot\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.2.19. 主机节点超温告警模板：

物理节点温度告警触发
 - alert: HostNodeOvertemperatureAlarm
    expr: node_hwmon_temp_crit_alarm_celsius == 1
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Host node overtemperature alarm (instance {{ $labels.instance }})
      description: "Physical node temperature alarm triggered\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.2.20 主机REID阵列处于非活动状态模板：

由于一个或多个磁盘故障，RAID 阵列 {{ $labels.device }} 处于降级状态。备用驱动器的数量不足以自动修复问题。
- alert: HostRaidArrayGotInactive
    expr: node_md_state{state="inactive"} > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Host RAID array got inactive (instance {{ $labels.instance }})
      description: "RAID array {{ $labels.device }} is in degraded state due to one or more disks failures. Number of spare drives is insufficient to fix issue automatically.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.2.21 主机RAID磁盘故障模板：

{{ $labels.instance }} 上的 RAID 阵列中至少有一个设备出现故障。数组 {{ $labels.md_device }} 需要注意，可能需要磁盘交换
- alert: HostRaidDiskFailure
    expr: node_md_disks{state="failed"} > 0
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Host RAID disk failure (instance {{ $labels.instance }})
      description: "At least one device in RAID array on {{ $labels.instance }} failed. Array {{ $labels.md_device }} needs attention and possibly a disk swap\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.2.22 主机内核版本偏差模板：

不同的内核版本正在运行
- alert: HostKernelVersionDeviations
    expr: count(sum(label_replace(node_uname_info, "kernel", "$1", "release", "([0-9]+.[0-9]+.[0-9]+).*")) by (kernel)) > 1
    for: 6h
    labels:
      severity: warning
    annotations:
      summary: Host kernel version deviations (instance {{ $labels.instance }})
      description: "Different kernel versions are running\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.2.23 检测到主机 OOM 终止模板：

检测到 OOM 杀死
 - alert: HostOomKillDetected
    expr: increase(node_vmstat_oom_kill[1m]) > 0
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Host OOM kill detected (instance {{ $labels.instance }})
      description: "OOM kill detected\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.2.24 检测到主机 EDAC 可纠正错误模板：

在过去 5 分钟内，主机 {{ $labels.instance }} 有 {{ printf "%.0f" $value }} 由 EDAC 报告的可纠正内存错误。
 - alert: HostEdacCorrectableErrorsDetected
    expr: increase(node_edac_correctable_errors_total[1m]) > 0
    for: 0m
    labels:
      severity: info
    annotations:
      summary: Host EDAC Correctable Errors detected (instance {{ $labels.instance }})
      description: "Host {{ $labels.instance }} has had {{ printf \"%.0f\" $value }} correctable memory errors reported by EDAC in the last 5 minutes.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.2.25. 检测到主机 EDAC 无法纠正的错误模板：

在过去 5 分钟内，主机 {{ $labels.instance }} 有 {{ printf "%.0f" $value }} 由 EDAC 报告的无法纠正的内存错误。
 - alert: HostEdacUncorrectableErrorsDetected
    expr: node_edac_uncorrectable_errors_total > 0
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Host EDAC Uncorrectable Errors detected (instance {{ $labels.instance }})
      description: "Host {{ $labels.instance }} has had {{ printf \"%.0f\" $value }} uncorrectable memory errors reported by EDAC in the last 5 minutes.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.2.26 主机网络接收错误模板：

主机 {{ $labels.instance }} interface {{ $labels.device }} 在过去两分钟内遇到了 {{ printf "%.0f" $value }} 接收错误。
- alert: HostNetworkReceiveErrors
    expr: rate(node_network_receive_errs_total[2m]) / rate(node_network_receive_packets_total[2m]) > 0.01
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Host Network Receive Errors (instance {{ $labels.instance }})
      description: "Host {{ $labels.instance }} interface {{ $labels.device }} has encountered {{ printf \"%.0f\" $value }} receive errors in the last two minutes.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.2.27 主机网络接口饱和模板：

“{{ $labels.instance }}”上的网络接口“{{ $labels.device }}”正在过载。
- alert: HostNetworkInterfaceSaturated
    expr: (rate(node_network_receive_bytes_total{device!~"^tap.*"}[1m]) + rate(node_network_transmit_bytes_total{device!~"^tap.*"}[1m])) / node_network_speed_bytes{device!~"^tap.*"} > 0.8 < 10000
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: Host Network Interface Saturated (instance {{ $labels.instance }})
      description: "The network interface \"{{ $labels.device }}\" on \"{{ $labels.instance }}\" is getting overloaded.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.2.28 主机网络绑定降级模板：

绑定“{{ $labels.device }}”在“{{ $labels.instance }}”上降级。
- alert: HostNetworkBondDegraded
    expr: (node_bonding_active - node_bonding_slaves) != 0
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Host Network Bond Degraded (instance {{ $labels.instance }})
      description: "Bond \"{{ $labels.device }}\" degraded on \"{{ $labels.instance }}\".\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.2.29 主机连接限制模板：

conntrack数量接近极限
- alert: HostConntrackLimit
    expr: node_nf_conntrack_entries / node_nf_conntrack_entries_limit > 0.8
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Host conntrack limit (instance {{ $labels.instance }})
      description: "The number of conntrack is approaching limit\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.2.30 主机时钟偏差模板：

检测到时钟偏差。时钟不同步。确保在此主机上正确配置了 NTP。
- alert: HostClockSkew
    expr: (node_timex_offset_seconds > 0.05 and deriv(node_timex_offset_seconds[5m]) >= 0) or (node_timex_offset_seconds < -0.05 and deriv(node_timex_offset_seconds[5m]) <= 0)
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Host clock skew (instance {{ $labels.instance }})
      description: "Clock skew detected. Clock is out of sync. Ensure NTP is configured correctly on this host.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.2.31 模板主机时钟不同步:

时钟不同步。确保在此主机上配置了 NTP。
- alert: HostClockNotSynchronising
    expr: min_over_time(node_timex_sync_status[1m]) == 0 and node_timex_maxerror_seconds >= 16
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Host clock not synchronising (instance {{ $labels.instance }})
      description: "Clock not synchronising. Ensure NTP is configured on this host.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.2.32 主机需要重启模板：

{{ $labels.instance }} 需要重新启动。
 - alert: HostRequiresReboot
    expr: node_reboot_required > 0
    for: 4h
    labels:
      severity: info
    annotations:
      summary: Host requires reboot (instance {{ $labels.instance }})
      description: "{{ $labels.instance }} requires a reboot.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.3 Docker 容器： google/cAdvisor （7 条规则）

1.3.1 容器--被杀死模板：

一个容器消失了
 - alert: ContainerKilled
    expr: time() - container_last_seen > 60
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Container killed (instance {{ $labels.instance }})
      description: "A container has disappeared\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.3.2 容器--不存在模板：

容器不存在---5分钟
- alert: ContainerAbsent
    expr: absent(container_last_seen)
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Container absent (instance {{ $labels.instance }})
      description: "A container is absent for 5 min\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.3.3 容器--CPU使用率模板：

容器CPU使用率在80%以上
- alert: ContainerCpuUsage
    expr: (sum(rate(container_cpu_usage_seconds_total{name!=""}[3m])) BY (instance, name) * 100) > 80
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Container CPU usage (instance {{ $labels.instance }})
      description: "Container CPU usage is above 80%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.3.4 容器--内存使用模板：

容器内存使用率在80%以上
- alert: ContainerMemoryUsage
    expr: (sum(container_memory_working_set_bytes{name!=""}) BY (instance, name) / sum(container_spec_memory_limit_bytes > 0) BY (instance, name) * 100) > 80
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Container Memory usage (instance {{ $labels.instance }})
      description: "Container Memory usage is above 80%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.3.5 容器--卷使用模板：

Container Volume 使用率超过 80%
 - alert: ContainerVolumeUsage
    expr: (1 - (sum(container_fs_inodes_free{name!=""}) BY (instance) / sum(container_fs_inodes_total) BY (instance))) * 100 > 80
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Container Volume usage (instance {{ $labels.instance }})
      description: "Container Volume usage is above 80%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.3.6 容器--卷 IO 使用情况模板：

Container Volume IO 使用率在 80% 以上
 - alert: ContainerVolumeIoUsage
    expr: (sum(container_fs_io_current{name!=""}) BY (instance, name) * 100) > 80
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Container Volume IO usage (instance {{ $labels.instance }})
      description: "Container Volume IO usage is above 80%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.3.7 容器--高节流率模板：

容器被限制
- alert: ContainerHighThrottleRate
    expr: rate(container_cpu_cfs_throttled_seconds_total[3m]) > 1
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Container high throttle rate (instance {{ $labels.instance }})
      description: "Container is being throttled\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.4 黑盒： prometheus/blackbox_exporter （8 条规则）

1.4.1. 黑盒探测失败模板：

探测失败
 - alert: BlackboxProbeFailed
    expr: probe_success == 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Blackbox probe failed (instance {{ $labels.instance }})
      description: "Probe failed\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.4.2 黑盒慢探针模板：

黑盒探测用了 1 秒完成
- alert: BlackboxSlowProbe
    expr: avg_over_time(probe_duration_seconds[1m]) > 1
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: Blackbox slow probe (instance {{ $labels.instance }})
      description: "Blackbox probe took more than 1s to complete\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.4.3 黑盒探测 HTTP 失败模板：

HTTP 状态码不是 200-399
- alert: BlackboxProbeHttpFailure
    expr: probe_http_status_code <= 199 OR probe_http_status_code >= 400
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Blackbox probe HTTP failure (instance {{ $labels.instance }})
      description: "HTTP status code is not 200-399\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.4.4 Blackbox SSL 证书即将到期模板：

 - alert: BlackboxSslCertificateWillExpireSoon
    expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 30
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Blackbox SSL certificate will expire soon (instance {{ $labels.instance }})
      description: "SSL certificate expires in 30 days\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.4.5 Blackbox SSL 证书即将到期模板：

SSL 证书 3 天后到期
 - alert: BlackboxSslCertificateWillExpireSoon
    expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 3
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Blackbox SSL certificate will expire soon (instance {{ $labels.instance }})
      description: "SSL certificate expires in 3 days\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.4.6 Blackbox SSL 证书已过期模板：

SSL 证书已过期
- alert: BlackboxSslCertificateExpired
    expr: probe_ssl_earliest_cert_expiry - time() <= 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Blackbox SSL certificate expired (instance {{ $labels.instance }})
      description: "SSL certificate has expired already\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.4.7 黑盒探测慢速 HTTP模板：

HTTP 请求耗时超过 1 秒
 - alert: BlackboxProbeSlowHttp
    expr: avg_over_time(probe_http_duration_seconds[1m]) > 1
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: Blackbox probe slow HTTP (instance {{ $labels.instance }})
      description: "HTTP request took more than 1s\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.4.8 黑盒探测慢 ping模板：

Blackbox ping 耗时超过 1 秒
- alert: BlackboxProbeSlowPing
    expr: avg_over_time(probe_icmp_duration_seconds[1m]) > 1
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: Blackbox probe slow ping (instance {{ $labels.instance }})
      description: "Blackbox ping took more than 1s\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.5 Windows 服务器： prometheus-community/windows_exporter （5 条规则）

1.5.1. Windows Server 收集器错误模板：

收集器 {{ $labels.collector }} 不成功
 - alert: WindowsServerCollectorError
    expr: windows_exporter_collector_success == 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Windows Server collector Error (instance {{ $labels.instance }})
      description: "Collector {{ $labels.collector }} was not successful\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.5.2 Windows Server 服务状态模板：

Windows 服务状态不正常
- alert: WindowsServerServiceStatus
    expr: windows_service_status{status="ok"} != 1
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: Windows Server service Status (instance {{ $labels.instance }})
      description: "Windows Service state is not OK\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.5.3. Windows 服务器 CPU 使用率模板：

CPU使用率超过80%
- alert: WindowsServerCpuUsage
    expr: 100 - (avg by (instance) (rate(windows_cpu_time_total{mode="idle"}[2m])) * 100) > 80
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Windows Server CPU Usage (instance {{ $labels.instance }})
      description: "CPU Usage is more than 80%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.5.4. Windows Server 内存使用情况模板：

内存使用率超过 90%
 - alert: WindowsServerMemoryUsage
    expr: 100 - ((windows_os_physical_memory_free_bytes / windows_cs_physical_memory_bytes) * 100) > 90
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Windows Server memory Usage (instance {{ $labels.instance }})
      description: "Memory usage is more than 90%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.5.5 Windows Server 磁盘空间使用情况模板：

磁盘使用率超过80%
- alert: WindowsServerDiskSpaceUsage
    expr: 100.0 - 100 * ((windows_logical_disk_free_bytes / 1024 / 1024 ) / (windows_logical_disk_size_bytes / 1024 / 1024)) > 80
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: Windows Server disk Space Usage (instance {{ $labels.instance }})
      description: "Disk usage is more than 80%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

VMware : pryorda/vmware_exporter （4 条规则）

1.6.1. 虚拟机内存警告模板：

{{ $labels.instance }} 上的高内存使用：{{ $value | printf "%.2f"}}%
 - alert: VirtualMachineMemoryWarning
    expr: vmware_vm_mem_usage_average / 100 >= 80 and vmware_vm_mem_usage_average / 100 < 90
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Virtual Machine Memory Warning (instance {{ $labels.instance }})
      description: "High memory usage on {{ $labels.instance }}: {{ $value | printf \"%.2f\"}}%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.6.2 虚拟机内存严重模板：

{{ $labels.instance }} 上的高内存使用：{{ $value | printf "%.2f"}}%
 - alert: VirtualMachineMemoryCritical
    expr: vmware_vm_mem_usage_average / 100 >= 90
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: Virtual Machine Memory Critical (instance {{ $labels.instance }})
      description: "High memory usage on {{ $labels.instance }}: {{ $value | printf \"%.2f\"}}%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.6.3 大量快照模板：

{{ $labels.instance }} 上的高快照数：{{ $value }}
- alert: HighNumberOfSnapshots
    expr: vmware_vm_snapshots > 3
    for: 30m
    labels:
      severity: warning
    annotations:
      summary: High Number of Snapshots (instance {{ $labels.instance }})
      description: "High snapshots number on {{ $labels.instance }}: {{ $value }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.6.4 过时的快照模板：

{{ $labels.instance }} 上的过时快照：{{ $value | printf "%.0f"}} 天
- alert: OutdatedSnapshots
    expr: (time() - vmware_vm_snapshot_timestamp_seconds) / (60 * 60 * 24) >= 3
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Outdated Snapshots (instance {{ $labels.instance }})
      description: "Outdated snapshots on {{ $labels.instance }}: {{ $value | printf \"%.0f\"}} days\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.7. Netdata：嵌入式导出器（9 条规则）

1.7.1. Netdata CPU占用率高模板：

Netdata 高 CPU 使用率 (> 80%)
 - alert: NetdataHighCpuUsage
    expr: rate(netdata_cpu_cpu_percentage_average{dimension="idle"}[1m]) > 80
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Netdata high cpu usage (instance {{ $labels.instance }})
      description: "Netdata high CPU usage (> 80%)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.7.2 主机 CPU 窃取模板：

CPU 窃取率 > 10%。嘈杂的邻居正在扼杀 VM 性能，或者 Spot 实例可能信用不足。
- alert: HostCpuStealNoisyNeighbor
    expr: rate(netdata_cpu_cpu_percentage_average{dimension="steal"}[1m]) > 10
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Host CPU steal noisy neighbor (instance {{ $labels.instance }})
      description: "CPU steal is > 10%. A noisy neighbor is killing VM performances or a spot instance may be out of credit.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.7.3 Netdata 内存占用高模板：

Netdata 高内存使用率 (> 80%)
- alert: NetdataHighMemoryUsage
    expr: 100 / netdata_system_ram_MB_average * netdata_system_ram_MB_average{dimension=~"free|cached"} < 20
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Netdata high memory usage (instance {{ $labels.instance }})
      description: "Netdata high memory usage (> 80%)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.7.4 Netdata 磁盘空间不足模板：

Netdata 磁盘空间不足 (> 80%)
- alert: NetdataLowDiskSpace
    expr: 100 / netdata_disk_space_GB_average * netdata_disk_space_GB_average{dimension=~"avail|cached"} < 20
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Netdata low disk space (instance {{ $labels.instance }})
      description: "Netdata low disk space (> 80%)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.7.5 Netdata 预测磁盘已满模板：

Netdata 预测 24 小时内磁盘已满
- alert: NetdataPredictedDiskFull
    expr: predict_linear(netdata_disk_space_GB_average{dimension=~"avail|cached"}[3h], 24 * 3600) < 0
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Netdata predicted disk full (instance {{ $labels.instance }})
      description: "Netdata predicted disk full in 24 hours\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.7.6. Netdata MD 不匹配 cnt 未同步块模板：

RAID 阵列有未同步的块
- alert: NetdataMdMismatchCntUnsynchronizedBlocks
    expr: netdata_md_mismatch_cnt_unsynchronized_blocks_average > 1024
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Netdata MD mismatch cnt unsynchronized blocks (instance {{ $labels.instance }})
      description: "RAID Array have unsynchronized blocks\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.7.7. Netdata磁盘重新分配扇区模板：

磁盘上重新分配的扇区
- alert: NetdataDiskReallocatedSectors
    expr: increase(netdata_smartd_log_reallocated_sectors_count_sectors_average[1m]) > 0
    for: 0m
    labels:
      severity: info
    annotations:
      summary: Netdata disk reallocated sectors (instance {{ $labels.instance }})
      description: "Reallocated sectors on disk\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.7.8 Netdata磁盘当前挂起扇区模板：

磁盘当前挂起扇区
- alert: NetdataDiskCurrentPendingSector
    expr: netdata_smartd_log_current_pending_sector_count_sectors_average > 0
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Netdata disk current pending sector (instance {{ $labels.instance }})
      description: "Disk current pending sector\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

1.7.9 Netdata 报告无法纠正的磁盘扇区模板：

报告无法纠正的磁盘扇区
- alert: NetdataReportedUncorrectableDiskSectors
    expr: increase(netdata_smartd_log_offline_uncorrectable_sector_count_sectors_average[2m]) > 0
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Netdata reported uncorrectable disk sectors (instance {{ $labels.instance }})
      description: "Reported uncorrectable disk sectors\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

posted @ 2021-12-08 21:57 姚鑫磊阅读(3471) 评论(0) 编辑收藏举报

刷新页面返回顶部

姚鑫磊的博客园

翻过一座山，山后一片海。

姚鑫磊

大海无量！

Prometheus 告警规则

Prometheus 告警规则

Prometheus 告警规则概念：

Prometheus指标含义：

1.1Prometheus 自我监控 （25 条规则）

1.1.1 Prometheus 自我监控模板：

1.1.2 Prometheus 目标丢失模板：

1.1.3 Prometheus 所有目标丢失模板：

1.1.4 Prometheus 配置重载失败模板：

1.1.5 Prometheus 重启太多模板：

1.1.6 Prometheus AlertManager 配置重载失败模板：

1.1.7 Prometheus AlertManager 配置未同步模板：

1.1.8 Prometheus AlertManager E2E dead man switch模板：

1.1.9 Prometheus 无法连接alertmanager模板：

1.1.10 Prometheus 规则评估失败模板：

1.1.11 Prometheus 规则评估缓慢模板：

1.1.12 Prometheus 通知队列积压模板：

1.1.13 Prometheus AlertManager 通知失败

1.1.14 Prometheus 目标为空模板：

1.1.15 Prometheus目标抓取缓慢模板：

1.1.16 Prometheus large scrape模板：

1.1.17 Prometheus目标抓取重复模板：

1.1.18 Prometheus TSDB 检查点创建失败模板：

1.1.19 Prometheus TSDB 检查点删除失败模板：

1.1.20 Prometheus TSDB 压缩失败模板：

1.1.21 Prometheus TSDB 头部截断失败模板：

1.1.22 Prometheus TSDB 重新加载失败模板：

1.1.23 普罗米修斯 TSDB WAL 损坏模板：

1.1.24 Prometheus TSDB WAL 截断失败模板：

1.2 主机和硬件：节点导出器（33条规则）

1.2.1. 主机内存不足模板：

1.2.2 内存压力下的主机内存模板：

1.2.3 主机异常网络吞吐量模板：

1.2.4 主机异常网络吞吐量模板：

1.2.5 主机异常磁盘读取率模板：

1.2.6 主机异常磁盘写入率模板：

1.2.7 主机磁盘空间不足模板：

1.2.8 主机磁盘将在 24 小时内填满模板：

1.2.9 主机不足 inode模板：

1.2.10 主机 inode 将在 24 小时内填满模板：

1.2.11 主机异常磁盘读取延迟模板：

1.2.12 主机异常磁盘写入延迟模板：

1.2.13 主机 CPU 负载高模板：

1.2.14 主机 CPU 窃取模板：

1.2.15 主机上下文切换模板：

1.2.16. 主机交换已满模板：

1.2.17. 主机 systemd 服务崩溃模板：

1.2.18 主机物理组件过热模板：

1.2.19. 主机节点超温告警模板：

1.2.20 主机REID阵列处于非活动状态模板：

1.2.21 主机RAID磁盘故障模板：

1.2.22 主机内核版本偏差模板：

1.2.23 检测到主机 OOM 终止模板：

1.2.24 检测到主机 EDAC 可纠正错误模板：

1.2.25. 检测到主机 EDAC 无法纠正的错误模板：

1.2.26 主机网络接收错误模板：

1.2.27 主机网络接口饱和模板：

1.2.28 主机网络绑定降级模板：

1.2.29 主机连接限制模板：

1.2.30 主机时钟偏差模板：

1.2.31 模板主机时钟不同步:

1.2.32 主机需要重启模板：

1.3 Docker 容器： google/cAdvisor （7 条规则）

1.3.1 容器--被杀死模板：

1.3.2 容器--不存在模板：

1.3.3 容器--CPU使用率模板：

1.3.4 容器--内存使用模板：

1.3.5 容器--卷使用模板：

1.3.6 容器--卷 IO 使用情况模板：

1.3.7 容器--高节流率模板：

1.4 黑盒： prometheus/blackbox_exporter （8 条规则）

1.4.1. 黑盒探测失败模板：

1.4.2 黑盒慢探针模板：

1.4.3 黑盒探测 HTTP 失败模板：

1.4.4 Blackbox SSL 证书即将到期模板：

1.4.5 Blackbox SSL 证书即将到期模板：

1.1Prometheus 自我监控（25 条规则）

1.7. Netdata：嵌入式导出器（9 条规则）