Docker实现GPA+Exporter监控告警系统
Docker实现GPA+Exporter监控告警系统
1.搭建grafana,prometheus,blackbox_exporter环境 # docker run -d -p 9090:9090 -v /tmp:/etc/prometheus --name prometheus prom/prometheus # docker run -d -p 3000:3000 --name grafana grafana/grafana:5.1.0 # docker run -d -p 9115:9115 --name blackbox_exporter -v /tmp:/config prom/blackbox-exporter --config.file=/config/blackbox.yml # docker run -d -p 9093:9093 -v /tmp/alertmanager_config.yml:/etc/alertmanager/alertmanager.yml --name alertmanager prom/alertmanager 2.配置grafana添加promethues数据源并添加图表可视化。安装pie插件并重启grafana。 # grafana-cli plugins install grafana-piechart-panel # docker restart grafana 3.blackbox_exporter和promethues修改配置文件之后使用以下命令重载: # curl -XPOST http://127.0.0.1:9115/-/reload # curl -XPOST http://127.0.0.1:9090/-/reload // 需要 --web.enable-lifecycle 参数为true 4.Debugging Blackbox exporter failures # The first is to add &debug=true to the probe URL, for example http://localhost:9115/probe?module=http_2xx&target=https://www.prometheus.io/&debug=true. This will not produce metrics, but rather detailed debug information。 5.配置文件示例: ========================================== 1.#cat blackbox.yml modules: http_2xx: # http 监测模块 prober: http http: valid_http_versions: ["HTTP/1.1", "HTTP/2"] valid_status_codes: [] # Defaults to 2xx method: GET http_post_2xx: # http post 监测模块 prober: http http: method: POST tcp_connect: # tcp 监测模块 prober: tcp timeout: 5s tcp: preferred_ip_protocol: "ip4" source_ip_address: "127.0.0.1" ping: # icmp 检测模块 prober: icmp timeout: 5s icmp: preferred_ip_protocol: "ip4" source_ip_address: "127.0.0.1" dns: prober: dns timeout: 5s dns: transport_protocol: "tcp" # defaults to "udp" preferred_ip_protocol: "ip4" # defaults to "ip6" query_name: "xxx.xxx.cn" query_type: "A" ========================================== 2.#cat promethues.yml global: scrape_interval: 15s evaluation_interval: 10s external_labels: monitor: 'monitor' alerting: alertmanagers: - static_configs: - targets: - 172.17.0.5:9093 rule_files: - "first_rules.yml" scrape_configs: - job_name: 'prometheus' scrape_interval: 5s static_configs: - targets: ['localhost:9090'] - job_name: 'blackbox_ping_all' scrape_interval: 5s scrape_timeout: 2s metrics_path: /probe params: module: [ping] file_sd_configs: - files: - /etc/prometheus/blackbox_ping.json refresh_interval: 1m relabel_configs: - source_labels: [__address__] target_label: __param_target - source_labels: [__param_target] target_label: instance - target_label: __address__ replacement: 172.17.0.2:9115 - job_name: 'blackbox_http_code' scrape_interval: 5s metrics_path: /probe params: module: [http_2xx] file_sd_configs: - files: - /etc/prometheus/blackbox_http_code.json refresh_interval: 1m relabel_configs: - source_labels: [__address__] target_label: __param_target - source_labels: [__param_target] target_label: instance - target_label: __address__ replacement: 172.17.0.2:9115 - job_name: "blackbox_tcp" metrics_path: /probe # 不是 metrics,是 probe params: module: [tcp_connect] # TCP 模块 file_sd_configs: - files: - /etc/prometheus/blackbox_tcp.json refresh_interval: 1m relabel_configs: - source_labels: [__address__] target_label: __param_target - source_labels: [__param_target] target_label: instance - target_label: __address__ replacement: 172.17.0.2:9115 - job_name: "blackbox_dns" metrics_path: /probe # 不是 metrics,是 probe params: module: [dns] # DNS 模块 static_configs: - targets: - xxx.xxx.xxx.xxx:53 #dns服务器的IP地址 relabel_configs: - source_labels: [__address__] target_label: __param_target - source_labels: [__param_target] target_label: instance - target_label: __address__ replacement: 172.17.0.2:9115 1)# blackbox_ping.json [ { "targets": ["172.17.0.2", "172.17.0.3", "172.17.0.4", "172.17.0.5"] } ] 2)# blackbox_http_code.json [ { "targets": ["172.17.0.4:3000", "172.17.0.5:9093", "https://www.baidu.com/"] } ] 4)# blackbox_tcp.json [ { "targets": ["xxx.xxx.xxx.xxx:53"] } ] 5)# blackbox_dns.json [ { "targets": ["xxx.xxx.xxx.xxx:53"] # 不要省略端口号 } ] ========================================== 3.#cat first_rules.yml groups: - name: http_code rules: - alert: HTTP_CODE_ERROR expr: probe_http_status_code{instance="https://xxx.xxx.com/",job="blackbox_http_2xx"} != 200 for: 10m labels: severity: page annotations: summary: High request latency # http_code alert rules - name: example rules: - alert: ProbeFailing expr: up{job="blackbox_http_code"} == 0 or probe_success{job="blackbox_http_code"} == 0 for: 10m labels: severity: page # ssl expiry alert rules - name: ssl_expiry.rules rules: - alert: SSLCertExpiringSoon expr: probe_ssl_earliest_cert_expiry{job="blackbox_http_code"} - time() < 86400 * 30 for: 10m # ping rules - name: ping_all rules: - alert: Ping_down expr: probe_success{job="blackbox_ping_all"} == 0 for: 10m ========================================== 4.#cat alertmanager.yml # 全局配置项 global: resolve_timeout: 5m #处理超时时间,默认为5min smtp_smarthost: 'smtp.xxx.xxx.com:25' # 邮箱smtp服务器代理 smtp_from: 'xxx@xx.com' # 发送邮箱名称 smtp_auth_username: 'xxx@xx.com' # 邮箱名称 smtp_auth_password: 'xxxxxx' # 邮箱密码或授权码 # 定义模板信息 templates: - 'template/*.tmpl' # 定义路由树信息 route: group_by: ['severity'] # 报警分组依据 group_wait: 10s # 当一个新的报警分组被创建后,需要等待至少group_wait时间来初始化通知,这种方式可以确保您能有足够的时间为同一分组来获取多个警报,然后一起触发这个报警信息。 group_interval: 10s # 当第一个报警发送后,等待'group_interval'时间来发送新的一组报警信息。 repeat_interval: 1m # 发送重复警报的周期,如果一个报警信息已经发送成功了,等待'repeat_interval'时间来重新发送他们,对于email配置中,此项不可以设置过低,否则将会由于邮件发送太多频繁,被smtp服务器拒绝 receiver: 'email' # 发送警报的接收者的名称,以下receivers name的名称 # 上面所有的属性都由所有子路由继承,并且可以在每个子路由上进行覆盖。 routes: - receiver: email group_wait: 10s match: instance: 172.17.0.2 # 定义警报接收者信息 receivers: - name: 'email' # 警报 email_configs: # 邮箱配置 - to: 'xxx@xx.com' # 接收警报的email配置 html: '{{ template "test.html" . }}' # 设定邮箱的内容模板 headers: { Subject: "[WARN] 报警邮件"} # 接收邮件的标题 # 一个inhibition规则是在与另一组匹配器匹配的警报存在的条件下,使匹配一组匹配器的警报失效的规则。两个警报必须具有一组相同的标签。 inhibit_rules: - source_match: severity: 'critical' target_match: severity: 'warning' equal: ['alertname', 'dev', 'instance'] ========================================== 5.#cat test.tmpl {{ define "test.html" }} <table border="1"> <tr> <td>报警项</td> <td>实例</td> <td>报警阀值</td> <td>开始时间</td> </tr> {{ range $i, $alert := .Alerts }} <tr> <td>{{ index $alert.Labels "alertname" }}</td> <td>{{ index $alert.Labels "instance" }}</td> <td>{{ index $alert.Annotations "value" }}</td> <td>{{ $alert.StartsAt }}</td> </tr> {{ end }} </table> {{ end }} ========================================== # 原理和问题总结: 1.blackbox_exporter是Prometheus监控系统中用于在agent机器上采集http,DNS,ICMP相关信息,通过prometheus传递的参数和target,映射到对应的agent的web接口上进行处理。默认情况下,Blackbox Exporter在端口9115上运行,并在/probe端点上提供指标。Blackbox Exporter的scrape_configs配置与其他出口商的配置不同。 最显着的不同是targets指令,它列出了被探测的端点而不是出口者的地址。 出口商的地址是使用适当的__address__标签设置的。 2.prometheus采集到监控数据后需要告警时发现alertmanager要等待prometheus发送数据之后再处理,也就是告警出现延时。分析验证结果:目标节点down机后,exporter很快检测到,通过查询可以知道prometheus中也更新了,但是prometheus的alert并没有更新,而是等待了一会之后由绿色变为黄色,进入pending状态,后来查看配置知道和evaluation_interval参数有关,默认是1m,该参数是评估报警规则的时间间隔,也就是更新Alert状态的时间,然后根据规则里for的设置5秒后由黄色变为红色,进入firing状态,这时发送警报到alertmanager,由alertmanager发出邮件告警。alertmanager同一个告警发送邮件的间隔时间是30s,可以通过设置Slienced来暂时屏蔽告警信息的发送。通过alertmangaer的repeat interval,group wait和group interval三个参数(配置在route树形结构下)来控制通知的发送规则和时间间隔,防止告警信息轰炸,具体配置需熟悉。注意:group_wait的时间如果设置时间过长,会导致报警发出有延迟;配置时间过短,同时会导致大量的邮件轰炸。建议根据不同警报重要性进行配置。 3.alertmanager的路由分组使用group_by配合alert rule规则配置文件中设置的labels来实现的。使得告警邮件合并了聚合后的告警信息,减少了告警邮件数。
4.还有其他的exporter可以安装后使用targets发现导入即可。