FELK学习(elastalertRule常用规则)

这次着重看一看elastalert的配置及支持的Rule规则. 对于一般的业务需求基本是可以满足的了.

全局配置

es_host: 127.0.0.1
es_port: 9200
rules_folder: rules
run_every:
  minutes: 5
buffer_time:
  minutes: 5

# Option basic-auth username and password for elasticsearch
#es_username: someusername
#es_password: somepassword
writeback_index: demo.elastalert_status

alert_time_limit:
  hours: 6

skip_invalid: True
scan_subdirectories: False

全局配置文件比较好理解, 比较重要的是

run_every: 指定每次查询的周期

buffer_time: 查询es的时间窗口

writeback_index: 存储所有数据的索引, 参考

old_query_limit: The maximum time between queries for ElastAlert to start at the most recently run query.
When ElastAlert starts, for each rule, it will search elastalert_metadata for the most recently run query and start from that time, unless it is older than old_query_limit, in which case it will start from the present time. The default is one week.

注意如果需要改动这个配置文件, 需要重启.

rule类型

常用的rule类型有以下几类:

frequency

说明：当给定时间范围内至少有一定数量的事件时，此规则匹配。这可以按照每个query_key来计数

规则: 5分钟之内如果出现10次匹配到关键字，则触发报警

es_host: 127.0.0.1
es_port: 9200
name: index demo find invalid keyword # 规则的名字必须是唯一的
type: frequency
num_events: 10 # 匹配的数量
# type: flatline
# threshold: 10  # 指定最小阈值
index: demo.demo-*
match_enhancements: # 指定使用增强功能
  - 'elastalert.enhancements.TimeEnhancement'
timeframe: # 时间段
  minutes: 5
query_key: #alert去重的字段名（多个字段会导致检索上新建一个组合字段名以用于查询）
- name
realert:  
  minutes: 5 #设置一个时长，在该时间内，相同 query_key 的报警只发一个,期间产生的alert被简单丢弃
#exponential_realert:
#  minutes: 5				#设置一个时长，必须大于realert 设置，则在realert到exponential_realert之间，每次报警之后，realert 自动翻倍
filter: # filter是es中的语法.
  - query:
    wildcard:
      backendMod: '*senseface*'
# 以上规则： 5分钟之内如果出现10次匹配的日志数，则触发报警
alert:
  - 'email'
alert_subject: 'ELASTLOG: FIND EVENT IN INDEX [{}] FROM [{}] TO [{}]' # 邮件title,占位参数由下面3个参数替换
alert_subject_args:
  - _index
  - starttime
  - endtime
alert_text_args:  # 邮件正文参数, 这里会直接传递给邮件模板, 进行了二次开发
  - '@timestamp'
  - '@version'
  - _id
  - _index
  - _type
  - ... # 省略
  - num_hits
  - num_matches
  - starttime  # starttime可以直接从self.rule.get('starttime')中获取,并不存在于match_body中.
  						#如果参数中没有传递starttime,endtime，则enttime是以当前的时间为准，去计算starttime
  - endtime  # 注意endtime对应的是执行查询时的当前时间， 因此没有保存，如果需要使用 需要修改源码

smtp_host: demo.mail.com
smtp_port: 587
smtp_auth_file: /opt/elastalert/auth/smtp_auth_file.yaml
email_reply_to: demo@top.com
from_addr: demo@top.com

email_format: html
email:
  - 'no-reply@top.com'

alert_text_type: alert_text_only # 不发送默认内容, 如果不加这行, 则除了发送上面的邮件内容之外，还会连同发送原始内容.
use_local_time: true
# include: ["ip_address", "hostname", "status"] #限制输出的检索字段

下面几种rule规则的配置只会贴出与上面配置不一样的地方.官方的说明很详细

any

说明：任何规则都会匹配， 查询返回的每个命中将生成一个警报。

规则：当匹配status字段为anystatus，触发告警

1
2
3

type: any
timeframe:
    minutes: 1

一定要慎用这个规则, 因为每个命中都会生成告警

flatline

说明：当一个时间段内的事件总数低于一个给定的阈值时，匹配规则

规则: 5分钟之内如果关键字的文档数小于给定阈值，则触发报警

timeframe:
    minutes: 1
threshold: 3
type: flatline

spike

说明：当某个时间段内的事件量比上一个时间段的spike_height时间大或小时，这个规则是匹配的。它使用两个滑动窗口来比较事件的当前和参考频率。我们将这两个窗口称为“参考”和“当前”。

规则：当前窗口数据量为3，当前窗口超过参考窗口数据量次数1次，触发告警

type: spike
timeframe:
    minutes: 1
threshold_cur: 3
spike_height: 1
spike_type: "up"

threshold_cur：当前窗口初始值

spike_height：当前窗口数据量连续比参考窗口数据量高(/低)的次数

spike_type：高或低

change

说明：此规则将监视某个字段，并在该字段更改时进行匹配，该领域必须相对于最后一个事件发生相同的变化。

规则：当server字段值相同，codec字段值不同时，触发告警

type: change
timeframe:
    minutes: 1
compare_key: codec
ignore_null: true
query_key: server

compare_key：与上一条记录做对比的字段

query_key：与上一条记录相同的字段

ignore_null：忽略记录不存在compare_key字段的情况

blacklist

说明：黑名单规则将检查黑名单中的某个字段，如果它在黑名单中则匹配。

规则：当字段status匹配到关键字sensefacexxx，触发告警

type: blacklist
timeframe:
    minutes: 1
compare_key: backendMod
blacklist:
    - "sensefacexxx"

要注意的是最终转换成es的查询语句时会将blacklist的值也加入到查询条件中，如下:

curl -H 'Content-Type: application/json' -XGET 'http://localhost:9200/demo.demo-*/_search?pretty&_source_include=%40timestamp%2C%2A%2Cschema.device_id%2CbackendMod&ignore_unavailable=true&scroll=30s&size=1000
  "query": {                                    
    "bool": {                                                                                                                                                                         
      "filter": {                                                                                                                   
        "bool": {                             
          "must": [                        
            {    
              "range": {                                  
                "@timestamp": {                    
                  "gt": "2020-06-01T13:54:33.079080Z",
                  "lte": "2020-06-01T13:59:33.079080Z"
                }
              }                                                                                                                     
            },                                
            {                              
              "wildcard": {
                "backendMod": "*senseface*"
              }                                    
            },                                                                                                                                                                        
            {                       
              "query_string": {
                "query": "backendMod:\"sensefacexxx\""
              }  
            }                                             
          ]
        }
      }
    }
  },       
  "sort": [        
    {                
      "@timestamp": {
        "order": "asc"
      }
    }                                           
  ]                                             
}'

writelist

说明：与黑名单类似，此规则将某个字段与白名单进行比较，如果列表中不包含该字词，则匹配

blacklist与writelist都没有rule title.

cardinality

说明：当一个时间范围内的特定字段的唯一值的总数高于或低于阈值时，该规则匹配

规则：1分钟内，level的唯一数量超过2个(不包括2个)，触发告警。

type: cardinality
timeframe:
    minutes: 1
cardinality_field: level
max_cardinality: 2
query_key:
- schema.device_id # query_key会保留下来 
# min_cardinality: 1 如果同时存在max,min，两者是or的关系

注意:cardinality类型的因为查询的是唯一值，因此不会返回match_body的内容, 所以任何引用match_body的字段都会返回<MISSING VALUE>,当然可以使用enhancement.

关于时间格式

首先要说明的是: 默认情况下, elasticsearch存储的日志的@timestamp为UTC格式的, 因此所有查询es的时间窗口都会被转换成UTC

而elastalert也很人性化的会根据本地时区对日期进行格式化输出.

# endtime 永远都为当着时间，因此会首先获取utc时间然后根据时区转换为当地时间
datetime.datetime.utcnow().replace(tzinfo=dateutil.tz.tzutc()) # 2020-05-30 13:41:33.359737+00:00
# starttime 则会根据endtime与buffer_time计算出来的
# 所有的格式化输出都会调用时间转换函数为当地时间
def pretty_ts(timestamp, tz=True): # tz为rule文件中指定use_local_time，默认为true,因此可以不指定该项
    """Pretty-format the given timestamp (to be printed or logged hereafter).
    If tz, the timestamp will be converted to local time.
    Format: YYYY-MM-DD HH:MM TZ"""
    dt = timestamp
    if not isinstance(timestamp, datetime.datetime):
        dt = ts_to_dt(timestamp)
    if tz:
        dt = dt.astimezone(dateutil.tz.tzlocal())
    return dt.strftime('%Y-%m-%d %H:%M %Z') # 2020-05-30 21:19 CST

大多数情况下，查询es默认都是以@timestamp为基准,如果使用其它的field, 需要在rule文件中指定以下三个相应的配置

timestamp_field: datetime
timestamp_type: custom
timestamp_format: "%Y-%m-%dT%H:%M:%S.%fZ"
timestamp_format_expr: "ts[:23] + ts[26:]"

日志说明

Queried rule index realty find invalid keyword from 2020-05-29 08:49 UTC to 2020-05-29 08:54 UTC: 482 / 482 hits
INFO:elastalert:Ignoring match for silenced rule index realty find invalid keyword.d0:94:66:6a:6f:2b
INFO:elastalert:Ignoring match for silenced rule index realty find invalid keyword.d0:94:66:6a:6f:2b
...
INFO:elastalert:Ran index realty find invalid keyword from 2020-05-29 08:49 UTC to 2020-05-29 08:54 UTC: 482 query hits (108 already seen), 37 matches, 1 alerts sent
INFO:elastalert:Reloading configuration for rule rules/demo.yaml

elastalert索引中，hits表示规则命中条数；matches表示规则命中条数，并且匹配规则触发告警数量。
num_hits表示的是根据filter条件及查询时间段从es返回的记录,而num_matches表示的是预计会产生多少条报警
因此 num_matches = num_hits / num_events, 会四舍五入,所以在告警内容中会发现两者都是这样的关系

elastalert打印的以下日志包含以下几个信息非常有用.

时间段: 2020-05-29 08:49 to 2020-05-29 08:54

查询的文档数(匹配查询条件返回的记录): 482

已处理文档数(already seen): 108

报警数: 37

发送报警数: 1

如果每次运行的时间(run_every)跟查询窗口时间(buffer_time)有重叠的,则会出现already seen, 比如run_every为3分钟, buffer_time为5分钟, 则每3分钟查询5分钟的es, 待下次查询时还查5分钟内的文档, 则前2分钟在上一个执行周期内已经查询过了, 因此already seen就类似于2分钟内有108个文档.

如果在rule配置文件中配置了realert,比如2分钟, 则会在日志中可能看到Ignoring match，这些都是被丢弃的alert, realert的意思是同一类的match产生的报警在2分钟之内不会重复发送，直接被丢弃

如果设置了realert=0,则每个match都会产生alert.这个要注意告警风暴.

match_body

match_body这个字典是最重要的结构体，当查询到符合报警的文档时，会复制一份到elastalert的索引中, 内容如下:

{
  "_index": "demo.elastalert_status",
  "_type": "elastalert",
  "_id": "AXJfrrKDPz_v1Z2rKLI3",
  "_score": null,
  "_source": {
    "match_body": {  # 从原日志复制而来
     # 以下是原日志中的内容
      "msg": "\"unaryInterceptor done\"",
     # ...
    "rule_name": "index realty find invalid keyword",
    "alert_info": {
      "type": "email",
      "recipients": [
        "demo@top.com"
      ]
    },
    "alert_sent": true,
    "alert_time": "2020-05-29T09:06:22.194284Z",
    "match_time": "2020-05-29T09:02:54.540Z",
    "@timestamp": "2020-05-29T09:06:23.997639Z"
}

elastalert索引中，hits表示规则命中条数；matches表示规则命中条数，并且匹配规则触发告警数量

在上面的rule配置文件中alert_text_args与alert_subject_args指定的参数都可以直接使用match_body结构体中的字段. 嵌套的使用点(.)来引用

num_hits vs num_matches

num_hits is the number of documents returned by elasticsearch for a given query. num_matches is how many times that data matched your rule, each one potentially generating an alert.

If it makes a query over a 10 minute range and gets 10 hits, and you have

type: frequency
num_events: 10
timeframe:
  minutes: 10
then you'll get 1 match.

总结:
num_hits表示的是根据filter条件及查询时间段从es返回的记录
而num_matches表示的是预计会产生多少条报警
因此 num_matches = num_hits / num_events 
四舍五入

enhancements

如果觉得最终返回的数据不符合要求或者需要添加自定义的字段, 那么可以在发送给告警器之前对match_body进行修改, 这就需要用到enhancements功能

比如需要对时间格式进行调整, 那么可以这样使用

import arrow
class TimeEnhancement(BaseEnhancement):
    def process(self, match):
        tz = 'Asia/Shanghai'
        tf = 'YYYY-MM-DD HH:mm:ss'
        # starttime: 2020-05-29 04:21:17.353831+00:00 ,8h
        query_start = arrow.get(self.rule.get('starttime')).to(tz).format(tf)
        query_end = arrow.now(tz).format(tf)
        tt = arrow.get(match['@timestamp']).to(tz).format(tf)
        match['query_start'] = query_start
        match['query_end'] = query_end

然后在rule的配置文件中指定:

1 2	match_enhancements: # 指定使用增强功能 - 'elastalert.enhancements.TimeEnhancement'

如果出现以下错误failed to parse [match_time]:

原因: 这是由于enhancements.py中TimeEnhancement中改变了@timestamp的格式，从而使得match_time不符合索引中的格式

解决: 修改或者去掉enhancements.py默认的以下内容

1	match['@timestamp'] = pretty_ts(match['@timestamp']

告警内容

对于不同的rule, 生成的告警内容不太一样, 但是都是由汇总信息+自定义内容组成,对于frequency类型来说，内容如下:

Ref Log http://192.168.115.65

At least 5 events occurred between 2020-05-26 09:18 UTC and 2020-05-26 09:23 UTC

#...这里是原始日志内容
# 省略
# ...
num_hits: 318
num_matches: 6

对于其它类型的告警内容可参考官网

下次跟大家分享下elastalert的源码.

参考文章:

https://www.cnblogs.com/duanxz/p/11859307.html

https://www.freebuf.com/sectool/164591.html

https://www.jianshu.com/p/f82812e0a743

https://github.com/Yelp/elastalert/issues/2754

https://segmentfault.com/a/1190000017553282

https://blog.xizhibei.me/2017/11/19/alerting-with-elastalert/

https://github.com/Yelp/elastalert/issues/1737

https://elastalert.readthedocs.io/en/latest/elastalert_status.html#elastalert-status

https://elastalert.readthedocs.io/en/latest/recipes/adding_enhancements.html#enhancements

gongzb