altermanagert告警

altermanager:

报警组件

github: https://github.com/prometheus/alertmanager/tags
端口:

  • 9093 通信端口
  • 9094 集群端口

报警过程:
1111

pro采集数据-->触发-->alertmanager-->分组、抑制、静默-->媒介类型-->微信等

分组:将类似性质的报警合并为单个通知
静默:特定时间的静音机制,和zabbix的维护时间一样
抑制:报警发出后,停止重复发送
路由:配置报警如何传入的特定类型的告警通知,根据路由匹配规则结果来确定当前告警通知的路径和行为

配置步骤:

1、在prom中配置altermanager在何处
支持静态配置、动态发现
还应将报警程序作为监控目标,监控运行状态
2、prom中指定告警规则(哪些是异常)


配置文件:

官网: https://prometheus.io/docs/alerting/latest/configuration/

主配置文件:

global:
  resolve_timeout: 5m
  smtp_smarthost: 'smtp.qq.com:465'
  smtp_from: '1137127273@qq.com'
  smtp_auth_username: '1137127273@qq.com'
  smtp_auth_password: 'juaovhwqyjscbafa'
  smtp_hello: '@qq.com'
  smtp_require_tls: false
  smtp_auth_identity: <string>
  smtp_auth_secret: <secret>

  slack_api_url: <secret> 
  slack_api_url_file: <filepath> 

  victorops_api_key: <secret> 
  victorops_api_key_file: <filepath> 
  victorops_api_url: <string> | default = "https://alert.victorops.com/integrations/generic/20131114/alert/" 

  pagerduty_url: <string> | default = "https://events.pagerduty.com/v2/enqueue" 

  opsgenie_api_key: <secret> 
  opsgenie_api_key_file: <filepath> 
  opsgenie_api_url: <string> | default = "https://api.opsgenie.com/" 

  #企业微信
  wechat_api_url: <string> | default = "https://qyapi.weixin.qq.com/cgi-bin/" 
  wechat_api_secret: <secret> 
  wechat_api_corp_id: <string> 

  telegram_api_url: <string> | default = "https://api.telegram.org" 
  webex_api_url: <string> | default = "https://webexapis.com/v1/messages" 
  http_config: <http_config>


route:		#报警分发策略
  group_by: 'alertname'	#用哪个标签分组	
  group_wait: 30s					#等待时间,报警后等10s才发送
  group_interval: 5m			#组报警间隔
  repeat_interval: 1h			#重复报警的间隔,默认4h
  receiver: 'web.hook'		#设置接收人
  continue: false         #默认false,匹配到当前路由后,不再继续匹配后续的路由
  match:            #当告警的所有标签都匹配这些条件时,告警才会被路由到当前路由
  - 标签1: 值
  match_re:         #当告警的所有标签都匹配这些正则表达式时,告警才会被路由到当前路由
  - 标签1: 正则
  matchers:         #字符串表达式
  - 标签1=xxx       #表达式符合支持:=、!=、=~ 、!~、运算符
  mute_time_intervals: []         #指定在哪些时间间隔内静默告警。必须匹配在mute_time_interval中定义的名称
  active_time_intervals: []       #指定在哪些时间间隔内激活告警。字符串必须匹配 time_intervals配置中定义的时间间隔名称,空为该路由始终处于活动状态
  routes: []       #定义0~n个子路由,告警首先被路由到匹配的子路由,如果没有匹配的子路由,才会被路由到当前路由
  - receiver: 'database-pager'        #发送给接受人名称是database-pager的
    group_wait: 10s
    matchers:
    - service=~"mysql|cassandra"

receivers:        #定义接收器,也是就接受报警信息的
- name: 'web.hook'
  #webhook_configs:	#调用api发送,但我们只有邮箱,所以注释
  #  - url: 'http://127.0.0.1:5001/'
  email_configs:
  - to: '1137127273@qq.com'
    from: 邮箱          #从哪个邮箱发
    smarthost: SMTP     #SMTP 服务器的地址
    auth_username: 账号       #其余选项都与全局的邮箱一样
    ...
    
- name: 'dingding.钉钉webhook名称'		#钉钉报警配置
  webhook_configs:
  - url: 'http://172.29.1.11:8060/dingtalk/钉钉webhook名称/send'
    send_resolved: true		#发送恢复通知
    
- name: 'team-X-slack'
  slack_configs:
  - api_url: 'https://hooks.slack.com/services/T0BRR5J4C/B0BRR5J4C/XXXXXXXXXXXXXXXXXXXXXXXX'
    channel: '#alerts'
- name: 'wechat'
  wechat_configs:
  - api_url: 'https://qyapi.weixin.qq.com/cgi-bin/'   #微信企业号的 API URL
    corp_id: 'your_corp_id'           #微信企业号的企业 ID
    to_party: 'PartyID1|PartyID2'     #接收通知的部门 ID 列表,多个部门 ID 之间用 | 分隔
    agent_id: 'your_agent_id'         #微信企业号的应用 ID
    api_secret: 'your_api_secret'     #微信企业号的应用密钥

inhibit_rules:		      #抑制的规则,哪些级别不报警,当源告警满足source_match的所有条件,且目标告警满足 target_match的所有条件时,抑制规则才会生效。也就是旧报警和新报警匹配才行
- source_match:	      #原来已经报警的
    severity: 报警等级
      #critical,严重的问题,需要立即处理
      #warning
      #info
      #high,高
      #medium,中
      #low,低
  target_match:	      #新报警的
    severity: 'warning'
  equal: 'alertname', 'dev', 'instance'	#定义源告警和目标告警必须具有相同值的标签。只有当这些标签在源告警和目标告警中都存在且值相同时,抑制规则才会生效
  source_match_re:        #正则模式
  target_match_re:

mute_time_intervals: []      #选项用于指定在哪些时间间隔内静默告警
- "xxx"       #使用time_intervals中定义的时间xxx规则

time_intervals:     #定义route规则的执行时间间隔,做到一天中某个时间段运行
- name: 名称
  weekdays: []      #一周中的哪些天,取值为 1(周一)到 7(周日)
    #sunday
    #monday
    #tuesday
    #wednesday
    #thursday
    #friday
    #saturday
- name:
  years           #年、月、日的范围,取值为 1 到对应的最大值
  months
  days        
- name: "xxx"
  time_range:       #每晚18点到6点
    start_time: "18:00"
    end_time: "06:00"

templates:      #报警信息的模板文件,使用 Go 的 text/template 语言编写
- tmp.tmpl

告警信息模板:

#该报警模板显示:邮件标题、报警信息、报警的指标、标签、报警等级、时间、报警主机,如果是k8s的pod报警报警,则显示pod名称、pod ip、node节点ip
cat > tmp.tmpl <<eof
{{ define "email.subject" }}
[{{ .Status | toUpper }}] {{ .GroupLabels.alertname }} for {{ range .Alerts }}{{ .Labels.instance }} {{ end }}
{{ end }}

{{ define "email.text" }}
{{ range .Alerts }}
Alert: {{ .Labels.alertname }}
Metric: {{ .Annotations.metric }}
Labels: {{ range .Labels.SortedPairs }}{{ .Name }} = {{ .Value }}, {{ end }}
Severity: {{ .Labels.severity }}
Time: {{ .StartsAt }}
Host: {{ if eq .Labels.alertname "PodAlert" }} 
  Pod Name: {{ .Labels.pod }},
  Pod IP: {{ .Labels.pod_ip }},
  Node IP: {{ .Labels.node_ip }} 
{{ else }} 
  {{ .Labels.instance }}
{{ end }}
URL: {{ .GeneratorURL }}
{{ end }}
{{ end }}
eof

promtheus配置:

alerting:
	alertmanagers:
	- static_configs:
		- targets:
			- 2.2.2.43:9093

rule_files:
- hj.yml

常见报警项:

(sum by(name)rate(container_cpu_usage_seconds_total{image!=""}[5m]))*100 > 1	#pod cpu使用率大于10%
sort_desc(ave by(name)(irate(container_memory_usage_bytes{name!=""}[5m]))*100) > 10		#内存大使用大于10%
sum by (name)(irate(container_network_receive_bytes_total{container_name="POD"}[1m])) > 50*1024*1024		#pod网络出栈流量大于20M

安装:

1)下载

wget https://github.com/prometheus/alertmanager/releases/download/v0.25.0/alertmanager-0.25.0.linux-amd64.tar.gz
tar xf alertmanager-0.25.0.linux-amd64.tar.gz 
cd alertmanager-0.25.0.linux-amd64/
mv alertmanager amtool /bin/
ln -s `pwd` /opt/alertmanager

2)写service

cat > /etc/systemd/system/alertmanager.service <<-eof
[Unit]
After=network.target

[Service]
Environment="CONF_FILE=/opt/alertmanager/alertmanager.yml"
Environment="IP=`hostname -i`"
Type=simple
User=prometheus
ExecStartPre=/bin/amtool check-config \$CONF_FILE
ExecStart=/bin/alertmanager \
  --config.file=\${CONF_FILE} \
  --web.listen-address=":9093" \
  --web.external-url=http://\${IP} \
  --storage.path=/data/alertmanager \
  --cluster.advertise-address=\${IP}:9094 \
  --cluster.listen-address=":9094"	\
  --cluster.allow-insecure-public-advertise-address-discovery
ExecReload=/bin/kill -HUP \$MAINPID
TimeoutStopSec=10s
Restart=on-failore

[Install]
WantedBy=multi-user.target
eof

systemctl daemon-reload 
systemctl start alertmanager.service

案例:

例1: 配置邮件报警

1)配置alter主配置文件

cat > alertmanager.yml <<eof
global:
  resolve_timeout: 5m

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 10s
  receiver: 'email-me'

receivers:
- name: 'email-me'
  email_configs:
  - to: '1137127273@qq.com'		#谁收
    from: '1137127273@qq.com'	#谁发
    smarthost: 'smtp.qq.com:465'
    auth_username: '1137127273@qq.com'
    auth_password: 'juaovhwqyjscbafa'
    require_tls: false
inhibit_rules:
- source_match:
    alertname: InstanceDown
    severity: critical
  target_match:
    alertname: InstanceDown
    severity: critical
  equal:
    - instance
eof

2)配置prom主配置文件

mkdir {targets,rules,alert_rules}

cat > prometheus.yml <<eof
global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
  - file_sd_configs:
    - files:
      - targets/alertmanagers.yml

rule_files:
- "rules/*.yaml"
- "alert_rules/*.yaml"

scrape_configs:
- job_name: "prometheus"
  static_configs:
    - targets: ["localhost:9090"]
- job_name: 'alertmanagers'
  file_sd_configs:
  - files:
    - targets/alertmanagers.yml
    refresh_interval: 2m
- job_name: blackbox_all
  metrics_path: /probe
  params:
    module: [http_2xx]
  static_configs:
  - targets:
    - 'http://www.qq.com'
    labels: 
      group: web
  relabel_configs:
  - source_labels: [__address__]
    target_label: __param_target
  - source_labels: [__param_target]
    target_label: url
  - target_label: __address__
    replacement: 127.0.0.1:9115
  - source_labels: [__meta_dns_name]
    target_label: __param_hostname
  - source_labels: [__meta_dns_name]
    target_label: vhost
eof

3)配置prom的alter服务发现文件

cat > targets/alertmanagers.yml <<eof
- targets:
  - 2.2.2.15:9093
  labels:
    app: alertmanagers
eof

4)定义路由规则

cat > alert_rules/instance_down.yaml <<eof
groups:
- name: AllInstances
  rules:
  - alert: InstanceDown
    expr: up == 0
    for: 1m
    annotations:
      title: "Instance {{ $labels.instance }} down"
      description: "{{ $labels.name }} {{ $labels.instance }} of job {{ $labels.job }} has been down,current value: {{ $value }}"
    labels:
      severity: 'critical'
eof

5)启动prom和alter

例2: 钉钉报警

插件下载: https://github.com/timonwong/prometheus-webhook-dingtalk
端口: 8060

image-20230805181435636

webhook-dingtalk配置模板:

timeout: 5s       #发送超时时间

no_builtin_template: true		#设置为true,则不会使用内置的消息模板,而是使用用户自定义的模板,这些模板可以在templates字段中指定

templates:      #消息模板
- contrib/templates/legacy/template.tmpl

default_message:
  title: '{{ template "legacy.title" . }}'
  text: '{{ template "legacy.content" . }}'

targets:
  自定义名称:		
    url: https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxxxxx
    secret: SEC000000000000000000000
  webhook2:
    url: https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxxxxx
  webhook_legacy:
    url: https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxxxxx
    message:		#自定义消息
      title: '{{ template "legacy.title" . }}'
      text: '{{ template "legacy.content" . }}'
  webhook_mention_all:
    url: https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxxxxx
    mention:		#发送模式
      all: true		#发给所有人
  webhook_mention_users:
    url: https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxxxxx
    mention:
      mobiles: ['156xxxx8827', '189xxxx8325']		#发给指定手机号

1)钉钉建群、加机器人

2)为机器人加签

url==机器人的webhook
secret==机器人的签

3)运行prometheus-webhook-dingtalk

配置文件

#需要将url、secret改为自己的
cat > config.yml <<eof
targets:
  webhook1:
    url: https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxxxxx
    secret: SEC000000000000000000000
  webhook_mention_all:
    url: https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxxxxx
    mention:
      all: true
eof
wget https://github.com/timonwong/prometheus-webhook-dingtalk/releases/download/v2.1.0/prometheus-webhook-dingtalk-2.1.0.linux-amd64.tar.gz
tar xf prometheus-webhook-dingtalk-2.1.0.linux-amd64.tar.gz
cd prometheus-webhook-dingtalk-2.1.0.linux-amd64/
mv prometheus-webhook-dingtalk /bin/

prometheus-webhook-dingtalk --web.enable-ui --web.enable-lifecycle

4)配置alter

global:
  resolve_timeout: 5m

route:
  group_by: ['alertname', 'severity', 'namespace']
  group_wait: 1m
  group_interval: 1m
  repeat_interval: 1m
  receiver: 'dingding.webhook1'
  routes:
  - receiver: 'dingding.webhook1'
    match:
      team: DevOps
    group_wait: 10s
    group_interval: 15s
    repeat_interval: 3h
  - receiver: 'dingding.webhook.all'
    match:
      team: SRE
    group_wait: 10s
    group_interval: 15s
    repeat_interval: 3h

receivers:
- name: 'dingding.webhook1'
  webhook_configs:
  - url: 'http://172.29.1.11:8060/dingtalk/webhook1/send'
    send_resolved: true
- name: 'dingding.webhook.all'
  webhook_configs:
  - url: 'http://172.29.1.11:8060/dingtalk/webhook_mention_all/send'
    send_resolved: true

集群模式:

集群间使用gossip协议做信息通知,可做到对一个告警,只用一个altermanage报警

alertmanager --cluster.listen-address=":9094" --cluster.peer=2:8001
posted @ 2023-11-06 18:07  suyanhj  阅读(214)  评论(0编辑  收藏  举报