SkyWalking链路追踪系统-告警篇
来源:https://cloud.tencent.com/developer/article/1808306
1、概述
Skywalking
发送告警的基本原理是每隔一段时间轮询skywalking-oap
收集到的链路追踪的数据,再根据所配置的告警规则(如服务响应时间、服务响应时间百分比)等,如果达到阈值则发送响应的告警信息。 发送告警信息是以线程池异步的方式调用webhook
接口完成的,具体的webhook
接口可以由使用者自行定义,从而可以在指定的webhook
接口中自行编写各种告警方式,比如钉钉告警、邮件告警等等。告警的信息也可以在RocketBot
即ui
中查看到。
目前对应我前面文章中部署的8.4.0
版本支持的告警接口如下:
- 普通webhook
- gRPCHook
- Slack Chat Hook
- WeChat Hook(微信告警)
- Dingtalk Hook(钉钉告警)
- Feishu Hook(飞书告警)
2、告警规则
2.1 默认告警规则
在Skywalking
中,告警规则称为rule
,默认安装的Skywalking
oap server
组件中包含了告警规则的配置文件,位于安装目录下config
文件夹下alarm-settings.yml
文件中,在容器中运行的也是一样的
# kubectl -n monitoring exec -it skywalking-oap-57d7f454f5-w4k4j -- bash
bash-5.0# pwd
/skywalking
bash-5.0# cat config/alarm-settings.yml
以下是默认的告警规则配置文件内容
rules:
# Rule unique name, must be ended with `_rule`.
service_resp_time_rule:
metrics-name: service_resp_time
op: ">"
threshold: 1000
period: 10
count: 3
silence-period: 5
message: Response time of service {name} is more than 1000ms in 3 minutes of last 10 minutes.
service_sla_rule:
# Metrics value need to be long, double or int
metrics-name: service_sla
op: "<"
threshold: 8000
# The length of time to evaluate the metrics
period: 10
# How many times after the metrics match the condition, will trigger alarm
count: 2
# How many times of checks, the alarm keeps silence after alarm triggered, default as same as period.
silence-period: 3
message: Successful rate of service {name} is lower than 80% in 2 minutes of last 10 minutes
service_resp_time_percentile_rule:
# Metrics value need to be long, double or int
metrics-name: service_percentile
op: ">"
threshold: 1000,1000,1000,1000,1000
period: 10
count: 3
silence-period: 5
message: Percentile response time of service {name} alarm in 3 minutes of last 10 minutes, due to more than one condition of p50 > 1000, p75 > 1000, p90 > 1000, p95 > 1000, p99 > 1000
service_instance_resp_time_rule:
metrics-name: service_instance_resp_time
op: ">"
threshold: 1000
period: 10
count: 2
silence-period: 5
message: Response time of service instance {name} is more than 1000ms in 2 minutes of last 10 minutes
database_access_resp_time_rule:
metrics-name: database_access_resp_time
threshold: 1000
op: ">"
period: 10
count: 2
message: Response time of database access {name} is more than 1000ms in 2 minutes of last 10 minutes
endpoint_relation_resp_time_rule:
metrics-name: endpoint_relation_resp_time
threshold: 1000
op: ">"
period: 10
count: 2
message: Response time of endpoint relation {name} is more than 1000ms in 2 minutes of last 10 minutes
# Active endpoint related metrics alarm will cost more memory than service and service instance metrics alarm.
# Because the number of endpoint is much more than service and instance.
#
# endpoint_avg_rule:
# metrics-name: endpoint_avg
# op: ">"
# threshold: 1000
# period: 10
# count: 2
# silence-period: 5
# message: Response time of endpoint {name} is more than 1000ms in 2 minutes of last 10 minutes
webhooks:
# - http://127.0.0.1