prometheus-搭建-容器版

一、安装prometheus

docker pull prom/prometheus:latest

mkdir -p ~/dockerdata/prometheus

#创建 prometheus.yml 内容从下面获取
vim ~/dockerdata/prometheus/prometheus.yml

global:    #全局配置部分,如果有内部单独设定，会覆盖这个参数
  scrape_interval: 15s      #每隔15秒就会从被监控的目标实例中抓取一次数据。
  scrape_timeout: 10s       #如果在10秒内没有成功抓取到数据，就认为抓取失败
  evaluation_interval: 1m    #每隔1分钟就会对所有监控规则进行一次评估，Prometheus会重新读取并应用这些修改后的规则，确保监控规则及时更新生效。

alerting:         # 告警插件定义。这里会设定alertmanager这个告警插件
  alertmanagers:
  - static_configs:
    - targets:
      # - alertmanagerIp:9093

rule_files:      #规则文件配置部分，用于指定存储监控规则的文件路径
  # - "first_rules.yml"
  # - "second_rules.yml"

scrape_configs:       #数据抓取配置部分，定义了要监控目标实例及其相关参数
- job_name: prometheus       # 此实例监控本地prometheus,名字自定义体现在web页面-status-targets中
  honor_timestamps: true
  scrape_interval: 5s
  scrape_timeout: 5s
  metrics_path: /metrics
  scheme: http
  follow_redirects: true
  static_configs:
  - targets:
    - localhost:9090

#用于远程存储写配置
#remote_write: 
#用于远程读配置
#remote_read:

上面配置文件逻辑关系：定时向scrape_configs-job中抓取数据指标，和rule_files中规则相匹配后进行页面告警，同时根据rule_files-*.yml中labels下标签和altermanager配置文件match标签进行匹配路由点选择告警通知方式。

启动prometheus

#启动
docker run -d --name prometheus -p 9090:9090 \
-v ~/dockerdata/prometheus:/etc/prometheus \
-v ~/dockerdata/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml \
prom/prometheus:latest

docker logs prometheus

#访问界面 原生 web-ui
http://127.0.0.1:9090
#状态信息查看
http://127.0.0.1:9090/targets启动

二、安装grafna

#下拉镜像
docker pull grafana/grafana

#启动
docker run -d -p 3000:3000 --name=grafana grafana/grafana

#访问 默认账户：admin/admin
http://127.0.0.1:3000

在 grafana 中配置Promethrus 数据源：

三、node-export监控

3.1 安装node-export

监控服务器：主要是用来收集服务器硬件资源使用情况，在需要被监控的机器上启用

#下载镜像
docker pull prom/node-exporter

#启动
docker run -d --name node-exporter -p 19100:9100 \
-v "/proc:/host/proc:ro" \
-v "/sys:/host/sys:ro" \
-v "/:/rootfs:ro" \
prom/node-exporter

#访问 采集指标接口，查看服务器主机的指标
http://ip:19100/metrics

prometheus.yml中添加此job

#添加至prometheus配置
vim ~/dockerdata/prometheus/prometheus.yml

scrape_configs:
  - job_name: 'node-exporter'
    static_configs:
    - targets: ['ip:19100']

重启prometheus容器:

#重启
docker restart prometheus

3.2 导入模板并创建仪表盘

模板：监控主机服务器(ECS)指标

模板id：8919

模板地址：https://grafana.com/grafana/dashboards/8919

模板描述：使用8919模板监控主机服务器(ECS)指标，包含整体资源展示与资源明细图表：CPU 内存磁盘 IO 网络等监控指标

四、cadvisor监控

4.1 docker安装cadvisor

谷歌开发的容器资源采集信息, 用于获取docker容器的指标，在需要被监控的容器机器上启用

docker pull google/cadvisor

docker run -d --name cadvisor -p 18081:8080 \
-v /:/rootfs:ro \
-v /var/run:/var/run:ro \
-v /sys:/sys:ro \
-v /var/lib/docker/:/var/lib/docker:ro \
-v /dev/disk/:/dev/disk:ro \
google/cadvisor:latest

#访问 采集指标接口，查看docker容器的指标
http://ip:18081/metrics

prometheus.yml添加此job

#编辑prometheus配置
vim ~/dockerdata/prometheus/prometheus.yml

scrape_configs:
  - job_name: 'cadvisor'
    static_configs:
    - targets: ['ip:18081']

重启prometheus容器:

#重启
docker restart prometheus

4.2 导入模板并创建仪表盘

模板：监控docker容器指标

模板id：893

模板地址：https://grafana.com/grafana/dashboards/893

模板描述：使用893模板监控docker容器指标

五、SpringBoot和jvm监控

5.1 springboot 添加依赖和配置文件:

        <!--监控-->
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-actuator</artifactId>
        </dependency>
        <!--适配prometheus-->
        <dependency>
            <groupId>io.micrometer</groupId>
            <artifactId>micrometer-registry-prometheus</artifactId>
            <scope>runtime</scope>
        </dependency>

application.yaml配置:

spring:
  application:
    name: springboot-prometheus-example
    
# 开启监控并可以让prometheus拉取配置
management:
  endpoint:
    health:
      show-details: always
    metrics:
      enabled: true
    prometheus:
      enabled: true
  endpoints:
    web:
      exposure:
        include: "*"
  metrics:
    tags:
      application: ${spring.application.name}
    export:
      prometheus:
        enabled: true

#1、访问 http://127.0.0.1:8080/actuator 路径就能看到一大堆输出的指标了，包括prometheus的采集应用

#访问 采集指标接口，查看springboot应用的指标
http://127.0.0.1:8080/actuator/prometheus

`5.2 prometheus.yml添加job`

#编辑prometheus配置
vim ~/dockerdata/prometheus/prometheus.yml

scrape_configs:
  - job_name: 'springboot-prometheus-example' 
    metrics_path: '/actuator/prometheus'     # 采集的路径
    static_configs:
    - targets: ['192.168.14.238:8080']     # ip和端口写自己springbot应用的

重启prometheus容器:

docker restart prometheus

5.2 导入模板并创建仪表盘

模板：监控Springboot应用

模板id：4701、12900

模板地址：https://grafana.com/grafana/dashboards/4701

模板地址：https://grafana.com/grafana/dashboards/12900

模板描述：使用4701 JVM监控、12900 SpringBoot监控模块

配置grafana监控模块：

选择两个4701 JVM监控、12900 SpringBoot监控模块

六、其他监控配置

网络探测-blackbox exporter

我们之前监控主机的资源用量、容器的运行状态、数据库中间件的运行数据。这些都是支持业务和服务的基础设施，通过白盒能够了解其内部的实际运行状态，通过对监控指标的观察能够预判可能出现的问题，从而对潜在的不确定因素进行优化。而从完整的监控逻辑的角度，除了大量的应用白盒监控以外，还应该添加适当的黑盒监控。
黑盒监控即以用户的身份测试服务的外部可见性，常见的黑盒监控包括HTTP探针、TCP探针等用于检测站点或者服务的可访问性，以及访问效率等。

黑盒监控相较于白盒监控最大的不同在于黑盒监控是以故障为导向当故障发生时，黑盒监控能快速发现故障，而白盒监控则侧重于主动发现或者预测潜在的问题。一个完善的监控目标是要能够从白盒的角度发现潜在问题，能够在黑盒的角度快速发现已经发生的问题.

Blackbox Exporter是Prometheus社区提供的官方黑盒监控解决方案，其允许用户通过：HTTP、HTTPS、DNS、TCP以及ICMP的方式对网络进行探测。用户可以直接使用go get命令获取Blackbox Exporter源码并生成本地可执行文件：

wget https://github.com/prometheus/blackbox_exporter/releases/download/v0.16.0/blackbox_exporter-0.16.0.linux-amd64.tar.gz
#下载安装
tar xvf blackbox_exporter-0.16.0.linux-amd64.tar.gz -C /usr/local/prometheus/
mv blackbox_exporter-0.16.0.linux-amd64/ blackbox_exporter
useradd prometheus
chown -R prometheus:prometheus /usr/local/prometheus/

vim /usr/lib/systemd/system/blackbox_exporter.service
[Unit]
Description=blackbox_exporter
After=network.target

[Service]
Type=simple
User=prometheus
ExecStart=/usr/local/prometheus/blackbox_exporter/blackbox_exporter --config.file=/usr/local/prometheus/blackbox_exporter/blackbox.yml
Restart=on-failure

[Install]
WantedBy=multi-user.target

systemctl enable blackbox_exporter.service
systemctl start blackbox_exporter.service

运行Blackbox Exporter时，需要用户提供探针的配置信息，这些配置信息可能是一些自定义的HTTP头信息，也可能是探测时需要的一些TSL配置，也可能是探针本身的验证行为。在Blackbox Exporter每一个探针配置称为一个module，并且以YAML配置文件的形式提供给Blackbox Exporter。每一个module主要包含以下配置内容，包括探针类型（prober）、验证访问超时时间（timeout）、以及当前探针的具体配置项:

# 探针类型：http、 tcp、 dns、 icmp.
prober: <prober_string>
# 超时时间
[ timeout: <duration> ]
# 探针的详细配置，最多只能配置其中的一个
[ http: <http_probe> ]
[ tcp: <tcp_probe> ]
[ dns: <dns_probe> ]
[ icmp: <icmp_probe> ]

下面是一个简化的探针配置文件blockbox.yml，包含两个HTTP探针配置项

modules:
  http_2xx:
    prober: http
    http:
      method: GET
  http_post_2xx:
    prober: http
    http:
      method: POST

通过运行一下命令，并指定使用的探针设置文件启动Blockbox Exporter实例:

blackbox_exporter --config.file=/etc/prometheus/blackbox.yml
or
systemctl restart blackbox_exporter.service

启动成功后，就可以通过访问http://172.19.0.27:9115/probe?module=http_2xx&target=baidu.com对baidu.com进行探测。这里通过在URL中提供module参数指定了当前使用的探针，target参数指定探测目标，探针的探测结果通过Metrics的形式返回：

# HELP probe_dns_lookup_time_seconds Returns the time taken for probe dns lookup in seconds
# TYPE probe_dns_lookup_time_seconds gauge
probe_dns_lookup_time_seconds 0.004359875
# HELP probe_duration_seconds Returns how long the probe took to complete in seconds
# TYPE probe_duration_seconds gauge
probe_duration_seconds 0.046153996
# HELP probe_failed_due_to_regex Indicates if probe failed due to regex
# TYPE probe_failed_due_to_regex gauge
probe_failed_due_to_regex 0
# HELP probe_http_content_length Length of http content response
# TYPE probe_http_content_length gauge
probe_http_content_length 81
# HELP probe_http_duration_seconds Duration of http request by phase, summed over all redirects
# TYPE probe_http_duration_seconds gauge
probe_http_duration_seconds{phase="connect"} 0.00105657
probe_http_duration_seconds{phase="processing"} 0.039457402
probe_http_duration_seconds{phase="resolve"} 0.004359875
probe_http_duration_seconds{phase="tls"} 0
probe_http_duration_seconds{phase="transfer"} 0.000337184
# HELP probe_http_last_modified_timestamp_seconds Returns the Last-Modified HTTP \
response header in unixtime
# TYPE probe_http_last_modified_timestamp_seconds gauge
probe_http_last_modified_timestamp_seconds 1.26330408e+09
# HELP probe_http_redirects The number of redirects
# TYPE probe_http_redirects gauge
probe_http_redirects 0
# HELP probe_http_ssl Indicates if SSL was used for the final redirect
# TYPE probe_http_ssl gauge
probe_http_ssl 0
# HELP probe_http_status_code Response HTTP status code
# TYPE probe_http_status_code gauge
probe_http_status_code 200
# HELP probe_http_uncompressed_body_length Length of uncompressed response body
# TYPE probe_http_uncompressed_body_length gauge
probe_http_uncompressed_body_length 81
# HELP probe_http_version Returns the version of HTTP of the probe response
# TYPE probe_http_version gauge
probe_http_version 1.1
# HELP probe_ip_protocol Specifies whether probe ip protocol is IP4 or IP6
# TYPE probe_ip_protocol gauge
probe_ip_protocol 4
# HELP probe_success Displays whether or not the probe was a success
# TYPE probe_success gauge
probe_success 1

从返回的样本中，用户可以获取站点的DNS解析耗时，站点响应时间，HTTP响应状态码等等和站点访问质量相关的监控指标，从而帮助管理员主动的发现故障和问题.

参考文章：03 . Prometheus监控容器和HTTP探针应用及服务发现 - 常见-youmen - 博客园

监控进程告警-Process-Exporter

$ wget https://github.com/ncabatoff/process-exporter/releases/download/v0.5.0/process-exporter-0.5.0.linux-amd64.tar.gz
$ tar -xvf  process-exporter-0.5.0.linux-amd64.tar.gz

进入解压出的目录，我们开始设置我们需要监控的进程。Process-Exporter的做法是配置需要监控的进程的名称，他会去搜索该进程从而得到其需要的监控信息，其实也就是我们常做的“ps -efl | grep xxx”命令来查看对应的进程。配置文件一开始是不存在的，需要我们创建，名字可以自定义：

$ vim process-name.yaml
process_names:
  - name: "{{.Matches}}"
    cmdline:
    - 'alertToRobot.js'

  - name: "{{.Matches}}"
    cmdline:
    - 'prometheus'

这里，在配置文件（process-name.yaml）中，我们添加了两个要监控的进程名“alertToRobot.js”和“prometheus”，一个process_names就定义了要监控的一组进程，{{.Matches}}模板表示映射包含应用命令行所产生的所有匹配项，还有其他模板如下：

模板变量：

{{.Comm}} contains the basename of the original executable, i.e. 2nd field in /proc/<pid>/stat
{{.ExeBase}} contains the basename of the executable
{{.ExeFull}} contains the fully qualified path of the executable
{{.Username}} contains the username of the effective user
{{.Matches}} map contains all the matches resulting from applying cmdline regexps

配置好后，我们依据此配置文件来运行process-exporter：

$ nohup ./process-exporter -config.path process-name.yaml & # 后台运行，日志落nohup文件中

$ curl http://localhost:9256/metrics   #获取监控信息

从这里也能看出来，process-exporter会给出你设置要观察的进程关键词搜索到的在不同状态下的所有进程的数量，比如这里我观察的两个进程均在“state="Sleeping"”状态下才有数值，说明都处于睡眠状态下，在等待唤醒，是正常的。对于Linux下进程的各种状态说明，可以查看这篇文章。

现在，我们需要去配置Prometheus来采集这份数据了，和其他配置一样，就是给Prometheus添加一份数据源：

$ vim prometheus.yml
……
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus_server'
    static_configs:
    - targets: ['localhost:9100']

  - job_name: 'process'
    static_configs:
    - targets: ['localhost:9256']
……

我的prometheus配置文件名为prometheus.yml，其中名为prometheus_server的job是之前配置的noe-exporter数据源，这里我们新增“process”数据源。然后重启Prometheus就可以在其网站上看到数据了，在搜索框输入“namedprocess_namegroup_states”就可以查看到上面状态值同样的数据，也可以做出筛选来查看某一个进程，当然也可以添加到Grafana中，这些方法都大同小异，就不细说了。

这里提一嘴，怎么通过命令查看某个进程的状态呢？输入命令：ps -efl | grep xxx，就可以看到了，比如：

[root@localhost test6]# ps -l
F S   UID   PID  PPID  C PRI  NI ADDR SZ WCHAN  TTY          TIME CMD
4 S     0 17398 17394  0  75   0 - 16543 wait   pts/0    00:00:00 bash
4 R     0 17469 17398  0  77   0 - 15877 -      pts/0    00:00:00 ps
这里的第二列就表示进程状态：

D 不可中断 uninterruptible sleep (usually IO)
R 运行 runnable (on run queue)
S 中断 sleeping
T 停止 traced or stopped
Z 僵死 a defunct (”zombie”) process

参考文章：Prometheus监控进程状态（Process-Exporter）-腾讯云开发者社区-腾讯云

七、集成 Alertmanager 告警通知

7.1. 安装 alertmanager

#下拉镜像
docker pull prom/alertmanager

mkdir -p ~/dockerdata/promethrus/alertmanager

#创建 alertmanager.yml 内容从下面获取
vim ~/dockerdata/promethrus/alertmanager/alertmanager.yml

alertmanager.yml 配置:

详细配置参考

# global：全局配置，主要配置告警方式，如邮件、webhook等。
global:
   # 警报解决的超时时间为5分钟（默认值），当一个警报发生后(个人理解针对不会持续的告警)，如果在5分钟内未解决，Alertmanager 将会自动将该告警标记为已解决状态。
  resolve_timeout: 5m
  smtp_smarthost: 'smtp.qq.com:465'    # 这里为 QQ 邮箱 SMTP 服务地址，官方地址为 smtp.qq.com 端口为 465 或 587，同时要设置开启 POP3/SMTP 服务。
  smtp_from: '1272621269@qq.com'
  smtp_auth_username: '1272621269@qq.com'
  # 这里为第三方登录 QQ 邮箱的授权码，非 QQ 账户登录密码，否则会报错，获取方式在 QQ 邮箱服务端设置开启 POP3/SMTP 服务时会提示。
  smtp_auth_password: 'lojdeopbholobgah'
#是否使用 tls，根据环境不同，来选择开启和关闭。如果提示报错 email.loginAuth failed: 530 Must issue a STARTTLS command first，那么就需要设置为 true。
#如果开启了tls，提示报错 starttls failed: x509: certificate signed by unknown authority，需要在email_configs下配置insecure_skip_verify: true来跳过tls验证。
  smtp_require_tls: false
templates:    # 告警通知模板
  - '/usr/local/alertmanager/alert.tmp'

# route：用来设置告警的分发策略。Prometheus的告警先是到达alertmanager的根路由(route)，alertmanager的根路由不能包含任何匹配项，因为根路由是所有告警的入口点。
# 另外，根路由需要配置一个接收器(receiver)，用来处理那些没有匹配到任何子路由的告警（如果没有配置子路由，则全部由根路由发送告警），即缺省
# 接收器。告警进入到根route后开始遍历子route节点，根据match匹配，如果匹配到，则将告警发送到该子route定义的receiver中，然后就停止匹配了。因为在route中
# continue默认为false，如果continue为true，则告警会继续进行后续子route匹配。如果当前告警仍匹配不到任何的子route，则该告警将从其上一级(
# 匹配)route或者根route发出（按最后匹配到的规则发出邮件）。查看你的告警路由树，https://www.prometheus.io/webtools/alerting/routing-tree-editor/,
# 将alertmanager.yml配置文件复制到对话框，然后点击"Draw Routing Tree"
route:
#用于分组聚合，对告警通知按相同标签(label)或相同告警名称(alertname)聚合在同一组，然后作为一个通知发送。如果想完全禁用聚合，可以设置为group_by: [...]
  group_by: ['alertname']
#表示在接收到警报后，等待 30 秒，然后尝试对相同分组的警报进行分组处理。这样可以确保在发送等待前能聚合更多具有相同标签的告警，最后合并为一个通知发送。
  group_wait: 30s
#表示每隔2分钟发送一次针对同一分组的通知。当第一次告警通知发出后，在新的评估周期内又收到了该分组最新的告警，则需等待2m时间后，开始发送为该组触发的新告警。  
  group_interval: 2m 
  repeat_interval: 10m    # 告警通知成功发送后，若问题一直未恢复，每隔10m重复发送一次通知。
  receiver: 'email'        #  配置告警消息接收者，与下面配置的对应。例如常用的 email、wechat、slack、webhook 等消息通知方式。
  routes:    # 子路由
  - receiver: 'wechat'
    match:    # 通过标签去匹配这次告警是否符合这个路由节点；也可以使用  match_re 进行正则匹配
      severity: Disaster    # 标签severity为Disaster时满足条件，使用wechat警报

receivers:    # 配置告警信息接收者信息。
- name: 'email'    # 警报接收者名称
  email_configs:
  - to: '{{ template "email.to"}}'  # 接收警报的email（这里是引用模板文件中定义的变量）
    html: '{{ template "email.to.html" .}}' # 发送邮件的内容（调用模板文件中的）
#    headers: { Subject: " {{ .CommonLabels.instance }} {{ .CommonAnnotations.summary }}" }    # 邮件标题，不设定使用默认的即可
    send_resolved: true        # 故障恢复后通知

- name: 'wechat'
  wechat_configs:
  - corp_id: wwd76d598b5fad5097        # 企业信息("我的企业"--->"CorpID"[在底部])
    to_user: '@all'        # 发送给企业微信用户的ID，这里是所有人
#   to_party: '' 接收部门ID
    agent_id: 1000004    # 企业微信("企业应用"-->"自定应用"[Prometheus]--> "AgentId") 
    api_secret: DY9IlG0Bdwawb_ku0NblxKFrrmMwbLIZ7YxMa5rCg8g        # 企业微信("企业应用"-->"自定应用"[Prometheus]--> "Secret") 
    message: '{{ template "email.to.html" .}}'    # 发送内容（调用模板）
    send_resolved: true         # 故障恢复后通知

inhibit_rules:        # 抑制规则配置，当存在与另一组匹配的警报（源）时，抑制规则将禁用与一组匹配的警报（目标）。
  - source_match:    #指定了源警报的匹配条件，这里是当源警报的严重程度为 'critical' 时。
      severity: 'critical'
    target_match:    #指定了目标警报的匹配条件，这里是当目标警报的严重程度为 'warning' 时。
      severity: 'warning'
 #指定了需要匹配相同的警报名称、开发环境和实例才能触发抑制。
 #当满足source_match和target_match的条件，并且警报名称、开发环境和实例都相同时，目标警报将会被抑制
    equal: ['alertname', 'dev', 'instance']

alter.tmp模板内容示例

[root@ds-slave ~]# vim /usr/local/alertmanager/alert.tmp
{{ define "email.from" }}916719080@qq.com{{ end }}
{{ define "email.to" }}916719080@qq.com{{ end }}
{{ define "email.to.html" }}
{{- if gt (len .Alerts.Firing) 0 -}}{{ range .Alerts }}
<h2>@告警通知</h2>
告警程序: prometheus_alert <br>
告警级别: {{ .Labels.severity }} 级 <br>
告警类型: {{ .Labels.alertname }} <br>
故障主机: {{ .Labels.instance }} <br>
告警主题: {{ .Annotations.summary }} <br>
告警详情: {{ .Annotations.description }} <br>
触发时间: {{ .StartsAt.Local.Format "2006-01-02 15:04:05" }} <br>
{{ end }}{{ end -}}
{{- if gt (len .Alerts.Resolved) 0 -}}{{ range .Alerts }}
<h2>@告警恢复</h2>
告警程序: prometheus_alert <br>
故障主机: {{ .Labels.instance }}<br>
故障主题: {{ .Annotations.summary }}<br>
告警详情: {{ .Annotations.description }}<br>
告警时间: {{ .StartsAt.Local.Format "2006-01-02 15:04:05" }}<br>
恢复时间: {{ .EndsAt.Local.Format "2006-01-02 15:04:05" }}<br>
{{ end }}{{ end -}}
{{- end }}

# 模板中<h2><br>等用与邮件通知换行展示，企业微信通知无需此代码可自动换行。

# {{- if gt (len .Alerts.Firing) 0 -}} 是go语法表示如果 .Alerts.Firing 的长度大于0，则执行接下来的代码块。

# {{ range .Alerts }} 则表示对 .Alerts 进行迭代，依次取出其中的元素（如下图），并执行迭代过程中的代码块。

启动altermanager

#启动
docker run -d --name alertmanager -p 19093:9093 \
-v ~/dockerdata/promethrus/alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml \
prom/alertmanager
#若容器时间不正确，同步主机时间 -v /etc/localtime:/etc/localtime
#访问 alertmanager web-ui页面
http://ip:19093

7.2 prometheus.yml增加altermanager配置和job

#编辑prometheus配置
vim ~/dockerdata/prometheus/prometheus.yml
  
#关联alertmanagers告警通知服务
alerting:
  alertmanagers:
    - static_configs:
        - targets: ['192.168.159.148:19093']

#监控alertmanager告警通知服务
scrape_configs:
  - job_name: "alertmanager"
    static_configs:
      - targets: ["192.168.159.148:19093"]

7.3 prometheus.yml配置告警规则文件存储位置

prometheus会根据全局global设定的evaluation_interval参数进行扫描加载，规则改动后会自动加载

#编辑prometheus配置
vim ~/dockerdata/prometheus/prometheus.yml

#全局配置
global:
  scrape_interval: 15s
  evaluation_interval: 15s # 默认1m
#指定告警规则文件
rule_files:
  - "rules/*.yml"

创建告警规则文件

#创建并编辑告警规则文件
mkdir -p ~/dockerdata/prometheus/rules
vim ~/dockerdata/prometheus/rules/rules-alerts.yml

groups:
- name: 实例存活告警规则
  rules:
  - alert: 实例存活告警     # 告警规则的名称（alertname）
    expr: up == 0         # expr 是计算公式，up指标可以获取到当前所有运行的Exporter实例以及其状态，即告警阈值为up==0
    for: 30s    # for语句会使 Prometheus 服务等待指定的时间, 然后执行查询表达式。（for 表示告警持续的时长，若持续时长小于该时间就不发给alertmanager了，大于该时间再发。for的值不要小于prometheus中的scrape_interval，例如scrape_interval为30s，for为15s，如果触发告警规则，则再经过for时长后也一定会告警，这是因为最新的度量指标还没有拉取，在15s时仍会用原来值进行计算。另外，要注意的是只有在第一次触发告警时才会等待(for)时长。）
    labels:        # labels语句允许指定额外的标签列表，把它们附加在告警上。
      severity: Disaster   # 指定告警级别。分别为warning，critical，emergency。严重等级依次递增。Disaster为严重最高级别
    annotations:        # 解析项，详细解释告警信息。annotations语句指定了另一组标签，它们不被当做告警实例的身份标识，它们经常用于存储一些额外的信息，用于告警信息的展示之类的。
      summary: "节点失联"
      description: "节点断联已超过1分钟！"

- name: 内存告警规则
  rules:
  - alert: "内存使用率告警"
    expr: (node_memory_MemTotal_bytes - (node_memory_MemFree_bytes+node_memory_Buffers_bytes+node_memory_Cached_bytes )) / node_memory_MemTotal_bytes * 100 > 75    # 告警阈值为当内存使用率大于75%
    for: 30s
    labels:
      severity: warning
    annotations:
      summary: "服务器内存告警"
      description: "内存资源利用率大于75%！(当前值: {{ $value }}%)"

- name: 磁盘告警规则
  rules:
  - alert: 磁盘使用率告警
    expr: (node_filesystem_size_bytes - node_filesystem_avail_bytes) / node_filesystem_size_bytes * 100 > 80    # 告警阈值为某个挂载点使用大于80%
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: "服务器 磁盘告警"
      description: "服务器磁盘设备使用超过80%！(挂载点: {{ $labels.mountpoint }} 当前值: {{ $value }}%)"

查看prometheus配置告警规则效果图：

http://127.0.0.1:9090/alerts

八、模拟压测告警

下载linux 系统压测工具模拟告警场景

#下载linux 系统压测工具
yum install -y epel-release && yum install stress -y

[root@localhost ~]# free -h
              total        used        free      shared  buff/cache   available
Mem:           3.7G        376M        2.9G         11M        439M        3.1G
Swap:          3.9G          0B        3.9G

#模拟。内存压测
# 总内存4G，我们设置的告警阈值为内存使用率为%75则告警，4096*%75=3027M
# 表示运行6个进程，每个进程分配500M内存，分配后不释放，长期保持测试 or 测试80秒
# 长期保持测试，需Ctrl+Z手动停止
stress --vm 6 --vm-bytes 500M --vm-keep
#故障恢复
Ctrl+Z 手动停止
#stress --vm 6 --vm-bytes 500M --timeout 80

#模拟。CPU测试
# 压满6个cpu，执行100秒（--timeout可以直接用--t）
stress --cpu 6 --timeout 100

#模拟。磁盘使用超过80%时
df -h
dd if=/dev/zero of=./test.io count=1 bs=80M
df -h
#故障恢复
rm -rf test.io

posted @ 2023-07-13 17:02 阿锋888 阅读(220) 评论(0) 编辑收藏举报

刷新页面返回顶部

登录后才能查看或发表评论，立即登录或者逛逛博客园首页

相关博文：

· Prometheus-服务发现

· proxysql集群--mysqlMGR模式

· prometheus+grafana 监控告警部署

· prometheus监控平台学习

· Prometheus监控详述

阅读排行：
· DeepSeek “源神”启动！「GitHub 热点速览」
· 微软正式发布.NET 10 Preview 1：开启下一代开发框架新篇章
· 我与微信审核的“相爱相杀”看个人小程序副业
· C# 集成 DeepSeek 模型实现 AI 私有化（本地部署与 API 调用教程）
· DeepSeek R1 简明指南：架构、训练、本地部署及硬件要求

公告

昵称：阿锋888
园龄： 1年9个月
粉丝： 4
关注： 1

+加关注

2025年2月

日

一

二

三

四

五

六

随笔分类

随笔档案

文章档案

2023年5月(1)

enduring

prometheus-搭建-容器版

一、安装prometheus

二、安装grafna

三、node-export监控

3.1 安装node-export

3.2 导入模板并创建仪表盘

四、cadvisor监控

4.1 docker安装cadvisor

4.2 导入模板并创建仪表盘

五、SpringBoot和jvm监控

5.1 springboot 添加依赖和配置文件:

`5.2 prometheus.yml添加job`

5.2 导入模板并创建仪表盘

六、其他监控配置

七、集成 Alertmanager 告警通知

7.1. 安装 alertmanager

7.2 prometheus.yml增加altermanager配置和job

八、模拟压测告警

公告

搜索

常用链接

我的标签

随笔分类

随笔档案

文章档案

阅读排行榜

推荐排行榜