docker-compsoe部署prometheus、Grafana监控、钉钉告警(四)
docker-compsoe部署prometheus、Grafana监控、钉钉告警(四)
四、Prometheus 钉钉告警
Prometheus dingtalk属于alertmanager部分
- 建钉钉群、添加AI机器人
-
建测试群,拉两人新建群,把其他人T出,即可形成单独的测试群;
-
群设置--机器人
-
添加自定义机器人
-
加签
- 创建文件目录
[root@128-255-96 prometheus]# pwd
/home/prometheus/docker/prometheus
cd ./alertmanager/dingtalk && vim config.yml
- 编写
config.yml
配置文件
## Request timeout
# timeout: 5s
## Uncomment following line in order to write template from scratch (be careful!)
#no_builtin_template: true
## Customizable templates path
templates:
# - contrib/templates/legacy/template.tmpl
- /root/contrib/templates/*.tmpl
## You can also override default template using `default_message`
## The following example to use the 'legacy' template from v0.3.0
#default_message:
# title: '{{ template "legacy.title" . }}'
# text: '{{ template "legacy.content" . }}'
## Targets, previously was known as "profiles"
targets:
webhook1:
url: https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxx
# secret for signature
secret: SEC0xxxxxxxx
message:
title: '{{ template "_ding.link.title" . }}'
text: '{{ template "_ding.link.content" . }}'
webhook2:
url: https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxx
secret: SEC0xxxxxxxx
message:
title: '{{ template "ding.link.title" . }}'
text: '{{ template "ding.link.content" . }}'
webhook_legacy:
url: https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxx
secret: SEC0xxxxxxxx
# Customize template content
message:
# Use legacy template
title: '{{ template "legacy.title" . }}'
text: '{{ template "legacy.content" . }}'
webhook_mention_all:
url: https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxx
secret: SEC0xxxxxxxx
mention:
all: true #@ALL
webhook_mention_users:
url: https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxx
secret: SEC0xxxxxxxx
mention:
mobiles: ['186****7521'] #@某人
- 编写
dingtalk.tmpl
文件
mkdir -p contrib/templates/ && cd contrib/templates
vim dingtalk.tmpl
{{ define "__subject" }}[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .GroupLabels.SortedPairs.Values | join " " }} {{ if gt (len .CommonLabels) (len .GroupLabels) }}({{ with .CommonLabels.Remove .GroupLabels.Names }}{{ .Values | join " " }}{{ end }}){{ end }}{{ end }}
{{ define "__alertmanagerURL" }}{{ .ExternalURL }}/#/alerts?receiver={{ .Receiver }}{{ end }}
{{ define "__text_alert_list" }}{{ range . }}
**Labels**
{{ range .Labels.SortedPairs }}> - {{ .Name }}: {{ .Value | markdown | html }}
{{ end }}
**Annotations**
{{ range .Annotations.SortedPairs }}> - {{ .Name }}: {{ .Value | markdown | html }}
{{ end }}
**Source:** [{{ .GeneratorURL }}]({{ .GeneratorURL }})
{{ end }}{{ end }}
{{ define "___text_alert_list_with_help" }}
{{ template "___text_alert_list" .Alerts.Firing }}
---
**帮助信息:** {{ (index .Alerts.Firing 0).Annotations.description | markdown | html }}
{{ end }}
{{ define "___text_alert_list" }}{{ range . }}
---
**告警主题:** {{ .Labels.alertname | upper }}
**告警级别:** {{ .Labels.severity | upper }}
**触发时间:** {{ dateInZone "2006.01.02 15:04:05" (.StartsAt) "Asia/Shanghai" }}
**事件信息:** {{ .Annotations.summary | markdown | html }}
**事件标签:**
{{ range .Labels.SortedPairs }}{{ if and (ne (.Name) "severity") (ne (.Name) "summary") (ne (.Name) "team") }}> - {{ .Name }}: {{ .Value | markdown | html }}
{{ end }}{{ end }}
{{ end }}
{{ end }}
{{ define "___text_alertresovle_list" }}{{ range . }}
---
**告警主题:** {{ .Labels.alertname | upper }}
**告警级别:** {{ .Labels.severity | upper }}
**触发时间:** {{ dateInZone "2006.01.02 15:04:05" (.StartsAt) "Asia/Shanghai" }}
**结束时间:** {{ dateInZone "2006.01.02 15:04:05" (.EndsAt) "Asia/Shanghai" }}
**事件信息:** {{ .Annotations.summary | markdown | html }}
**事件标签:**
{{ range .Labels.SortedPairs }}{{ if and (ne (.Name) "severity") (ne (.Name) "summary") (ne (.Name) "team") }}> - {{ .Name }}: {{ .Value | markdown | html }}
{{ end }}{{ end }}
{{ end }}
{{ end }}
{{/* Default */}}
{{ define "_default.title" }}{{ template "__subject" . }}{{ end }}
{{ define "_default.content" }} [{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}\] **[{{ index .GroupLabels "alertname" }}]({{ template "__alertmanagerURL" . }})**
{{ if gt (len .Alerts.Firing) 0 -}}

**========告警触发========**
{{/* template "___text_alert_list" .Alerts.Firing */}}
{{ template "___text_alert_list_with_help" . }}
{{- end }}
{{ if gt (len .Alerts.Resolved) 0 -}}

**========告警恢复========**
{{ template "___text_alertresovle_list" .Alerts.Resolved }}
{{- end }}
{{- end }}
{{/* Legacy */}}
{{ define "legacy.title" }}{{ template "__subject" . }}{{ end }}
{{ define "legacy.content" }} [{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}\] **[{{ index .GroupLabels "alertname" }}]({{ template "__alertmanagerURL" . }})**
{{ template "__text_alert_list" .Alerts.Firing }}
{{- end }}
{{/* Following names for compatibility */}}
{{ define "_ding.link.title" }}{{ template "_default.title" . }}{{ end }}
{{ define "_ding.link.content" }}{{ template "_default.content" . }}{{ end }}
- 编写
docker-compose-dingtalk.yml
文件
cd ../../../../
vim docker-compose-dingtalk.yml
version: '3'
services:
dingtalk-alert:
image: timonwong/prometheus-webhook-dingtalk
container_name: dingtalk-alert
restart: always
ports:
- "8060:8060"
volumes:
- /home/prometheus/docker/prometheus/alertmanager/dingtalk:/root
command: --config.file=/root/config.yml
networks:
- prometheus
networks:
prometheus:
name: prometheus
- 启动容器
docker-compose -f docker-compose-dingtalk.yml up -d
- 验证部署是否成功
使用Postman发送钉钉消息验证
- 添加rules规则
vim prometheus/rules/mssql_rules.yml
- 编辑
mssql_rules.yml
文件
groups:
- name: MSSQL告警规则
rules:
- alert: '待处理队列超长'
expr: mssql_current_exec_num > 10
for: 1m
labels:
severity: warning
notify_type: dingtalk
annotations:
summary: "{{$labels.exported_instance}}: Too many scripts executing.(Current value is: {{$value}})"
description: "执行以下语句查看详细信息:
SELECT
der.[session_id],der.[blocking_session_id],
sp.lastwaittype,sp.hostname,sp.program_name,sp.loginame,
der.[start_time] AS '开始时间',
der.[status] AS '状态',
dest.[text] AS 'sql语句',
DB_NAME(der.[database_id]) AS '数据库名',
der.[wait_type] AS '等待资源类型',
der.[wait_time] AS '等待时间',
der.[wait_resource] AS '等待的资源',
der.[logical_reads] AS '逻辑读次数'
FROM sys.[dm_exec_requests] AS der
INNER JOIN master.dbo.sysprocesses AS sp ON der.session_id=sp.spid
CROSS APPLY sys.[dm_exec_sql_text](der.[sql_handle]) AS dest
WHERE [session_id]>50 AND session_id<>@@SPID
ORDER BY der.[session_id];"
- alert: '数据库状态异常'
expr: mssql_database_state != 0
for: 1m
labels:
severity: warning
notify_type: dingtalk
annotations:
summary: "{{$labels.exported_instance}}: {{$labels.database}} is not online.(Current state is {{$value}})."
description: "0=ONLINE 1=RESTORING 2=RECOVERING 3=RECOVERY_PENDING 4=SUSPECT 5=EMERGENCY 6=OFFLINE 7=COPYING 10=OFFLINE_SECONDARY \r\n执行以下语句查看详细信息:
SELECT [name] AS [database],[state] FROM sys.databases;"
- alert: '数据库产生死锁'
expr: mssql_current_deadlocks != 0
for: 3m
labels:
severity: warning
notify_type: dingtalk
annotations:
summary: "{{$labels.exported_instance}}: deadlocks occurs.(Current count is {{$value}})"
description: "执行以下语句查看详细信息:
SELECT
request_session_id spid,
DB_NAME(resource_database_id) [DataBase],
OBJECT_NAME(resource_associated_entity_id) TableName
FROM sys.dm_tran_locks
WHERE resource_type='OBJECT';
DBCC INPUTBUFFER(spid);"
- alert: '脚本执行耗时过长'
expr: mssql_long_elapsed_count != 0
for: 1m
labels:
severity: warning
notify_type: dingtalk
annotations:
summary: "{{$labels.exported_instance}}: Sql scripts execute for long time.(Current count is {{$value}})"
description: "执行以下语句查看详细信息:
SELECT
(total_elapsed_time / execution_count)/1000 N'平均时间ms'
,total_elapsed_time/1000 N'总花费时间ms'
,total_worker_time/1000 N'所用的CPU总时间ms'
,total_physical_reads N'物理读取总次数'
,total_logical_reads/execution_count N'每次逻辑读次数'
,total_logical_reads N'逻辑读取总次数'
,total_logical_writes N'逻辑写入总次数'
,execution_count N'执行次数'
,SUBSTRING(st.text, (qs.statement_start_offset/2) + 1,
((CASE statement_end_offset
WHEN -1 THEN DATALENGTH(st.text)
ELSE qs.statement_end_offset END
- qs.statement_start_offset)/2) + 1) N'执行语句'
,db_name(st.dbid) N'数据库名'
,creation_time N'语句编译时间'
,last_execution_time N'上次执行时间'
FROM sys.dm_exec_query_stats AS qs
CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) st
WHERE creation_time > DATEADD(S, -15, GETDATE()) --BETWEEN '2023-04-20 00:00:00' AND '2023-04-22 00:00:00'
AND (total_elapsed_time / execution_count)/(1000) > 1000
ORDER BY total_elapsed_time / execution_count DESC;"
- alert: mssql引擎服务宕机
expr: windows_service_state{state="running",exported_name="mssqlserver"}!=1
for: 1m
labels:
severity: warning
notify_type: dingtalk
annotations:
summary: "详细: {{ $labels }}"
- alert: mssql代理服务宕机
expr: windows_service_state{exported_name="sqlserveragent",state="running"}!=1
for: 1m
labels:
severity: warning
notify_type: dingtalk
annotations:
summary: "详细: {{ $labels }}"
- alert: mssql引擎服务重启
expr: mssql_db_uptime < 3600
for: 1m
labels:
severity: warning
notify_type: dingtalk
annotations:
summary: "详细: {{ $labels }}"
description: "mssql引擎服务1小时内有过重启,现已重启{{ $value }} 秒"
- alert: mssql数据库不可用/不可访问
expr: mssql_current_state_dbState !=0
for: 1m
labels:
severity: warning
notify_type: dingtalk
annotations:
summary: "详细: {{ $labels }}"
description: "db:{{ $labels.db }}\n value:{{ $labels.value }}={{ $value }} "
- alert: mssql阻塞
expr: sum(mssql_current_state_blocking)>5
for: 1m
labels:
severity: warning
notify_type: dingtalk
annotations:
summary: "详细: {{ $labels }}"
description: "mssql请求阻塞数>5,当前:{{ $value }} "
- alert: mssql请求过多
expr: sum(mssql_current_state_requests)>100
for: 1m
labels:
severity: warning
notify_type: dingtalk
annotations:
summary: "详细: {{ $labels }}"
description: "mssql请求数>100,当前:{{ $value }} "
- alert: mssql死锁产生
expr: increase(mssql_counter{type_object="SQLServer:Locks",type_counter="Number of Deadlocks/sec",type_instance="_Total"}[5m])>0
for: 1m
labels:
severity: warning
notify_type: dingtalk
annotations:
summary: "详细: {{ $labels }}"
description: "mssql 5分钟内死锁产生次数:{{ $value }} "
- alert: mssql作业执行错误
expr: increase(mssql_job_state_today[5m])>0
for: 1m
labels:
severity: warning
notify_type: dingtalk
annotations:
summary: "详细: {{ $labels }}"
description: "mssql 今天作业运行错误次数:{{ $value }} "
- alert: mssql镜像状态变化
expr: increase(mssql_mirror_sync{value="status"} [5m])!=0
for: 1m
labels:
severity: warning
notify_type: dingtalk
annotations:
summary: "详细: {{ $labels }}"
description: "db:{{ $labels.db }}\n value:{{ $labels.value }}={{ $value }} "
- 重启prometheus
docker-compose -f docker-compose-prometheus.yml restart
- 查看钉钉消息
[FIRING:2] 数据库状态异常
========告警触发========
告警主题: 数据库状态异常
告警级别: WARNING
触发时间: 2023.04.24 17:56:03
事件信息: 128.0.23.17:1433: IISLogDB is not online.(Current state is 6).
事件标签:
alertname: 数据库状态异常
database: IISLogDB
exported_instance: 128.0.23.17:1433
exporter_type: prom-mssql-exporter
host: 128.0.23.17:1433
instance: 128.0.255.96:14001
job: prometheus-mssql-exporter
monitor: line-monitor
notify_type: dingtalk
告警主题: 数据库状态异常
告警级别: WARNING
触发时间: 2023.04.24 17:56:03
事件信息: sqlserver_xulq: IISLogDB is not online.(Current state is 6).
事件标签:
alertname: 数据库状态异常
database: IISLogDB
exported_instance: sqlserver_xulq
exported_job: sql-exporter
exporter_type: sql-exporter
instance: sql-exporter:9399
job: sql-exporter
monitor: line-monitor
notify_type: dingtalk
帮助信息: 0=ONLINE 1=RESTORING 2=RECOVERING 3=RECOVERY_PENDING 4=SUSPECT 5=EMERGENCY 6=OFFLINE 7=COPYING 10=OFFLINE_SECONDARY
执行以下语句查看详细信息: SELECT [name] AS [database],[state] FROM sys.databases;
嘿嘿
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· go语言实现终端里的倒计时
· 如何编写易于单元测试的代码
· 10年+ .NET Coder 心语,封装的思维:从隐藏、稳定开始理解其本质意义
· .NET Core 中如何实现缓存的预热?
· 从 HTTP 原因短语缺失研究 HTTP/2 和 HTTP/3 的设计差异
· 分享一个免费、快速、无限量使用的满血 DeepSeek R1 模型,支持深度思考和联网搜索!
· 基于 Docker 搭建 FRP 内网穿透开源项目(很简单哒)
· ollama系列01:轻松3步本地部署deepseek,普通电脑可用
· 按钮权限的设计及实现
· 25岁的心里话