matil采集错误日志并通过prometheus告警

mtail 配置

cat /etc/mtail/error.mtail 
counter error_log by file,date,info
/\[(?P<date>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})\]\[error\],(?P<info>.*)/ {
  error_log[getfilename()][$date][$info]++
}

 

service启动文件

cat /usr/lib/systemd/system/mtail.service 
[Unit]
Description=mtail server
After=network.target

[Service]
ExecStart=/usr/local/bin/mtail --progs /etc/mtail --logs  /data/server/logs/serverstatus_*.log
ExecReload=/bin/kill -HUP $MAINPID
TimeoutStopSec=20s
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target

 

错误日志样例

[2088-11-09 23:25:31][error],[object Promise] reason:TypeError: Cannot read properties of undefined (reading 'area')
    at /data/server/server-2022-11-02-19-49-41-627-ver-07b1a29930153101e4feb0ff39e760903d9e5cbe/Project/Servers/wbScene/worldScene/EntityComponent/ComponentTrade.js:139:64
    at Array.forEach (<anonymous>)
    at ComponentTrade.autoTradeAction (/data/server/server-2022-11-02-19-49-41-627-ver-07b1a29930153101e4feb0ff39e760903d9e5cbe/Project/Servers/wbScene/worldScene/EntityComponent/ComponentTrade.js:133:16)
    at runMicrotasks (<anonymous>)
    at processTicksAndRejections (node:internal/process/task_queues:96:5)

 

修改版

mtail配置

gauge error_log_timestamp by file,info
/\[(?P<date>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})\]\[error\]\D(?P<info>.*)/ {
  error_log_timestamp[getfilename()][$info] = timestamp()
}

 

prometheus表达式

groups:
 - name: ErrorLog
   rules:
    - alert: ErrorLog  # alertname
      expr: time() - error_log_timestamp <= 60
      for: 1s
      labels:
        severity: critical
      annotations:
        title: "[错误日志告警]"
        info: "错误文件{{ $labels.file }},错误内容是:{{ $labels.info }},详细信息请登录服务器查看!"

 

查询语句

 

 

 备注:

这个问题纠结了很长时间,prometheus自带的函数无法解决该问题,最后换了思路,决定从mtail的配置入手

解决的思路是:每次获取的错误日志信息作为一个时间戳,然后用当前时间戳减去错误日志的时间戳,如果小于60秒,则说明是一分钟内的告警,如此总能获取到最新的告警信息

 

 

 

 

  

 

posted @ 2022-11-11 15:28  羊脂玉净瓶  阅读(454)  评论(0编辑  收藏  举报