matil采集错误日志并通过prometheus告警
mtail 配置
cat /etc/mtail/error.mtail counter error_log by file,date,info /\[(?P<date>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})\]\[error\],(?P<info>.*)/ { error_log[getfilename()][$date][$info]++ }
service启动文件
cat /usr/lib/systemd/system/mtail.service [Unit] Description=mtail server After=network.target [Service] ExecStart=/usr/local/bin/mtail --progs /etc/mtail --logs /data/server/logs/serverstatus_*.log ExecReload=/bin/kill -HUP $MAINPID TimeoutStopSec=20s Restart=on-failure RestartSec=5 [Install] WantedBy=multi-user.target
错误日志样例
[2088-11-09 23:25:31][error],[object Promise] reason:TypeError: Cannot read properties of undefined (reading 'area') at /data/server/server-2022-11-02-19-49-41-627-ver-07b1a29930153101e4feb0ff39e760903d9e5cbe/Project/Servers/wbScene/worldScene/EntityComponent/ComponentTrade.js:139:64 at Array.forEach (<anonymous>) at ComponentTrade.autoTradeAction (/data/server/server-2022-11-02-19-49-41-627-ver-07b1a29930153101e4feb0ff39e760903d9e5cbe/Project/Servers/wbScene/worldScene/EntityComponent/ComponentTrade.js:133:16) at runMicrotasks (<anonymous>) at processTicksAndRejections (node:internal/process/task_queues:96:5)
修改版
mtail配置
gauge error_log_timestamp by file,info /\[(?P<date>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})\]\[error\]\D(?P<info>.*)/ { error_log_timestamp[getfilename()][$info] = timestamp() }
prometheus表达式
groups: - name: ErrorLog rules: - alert: ErrorLog # alertname expr: time() - error_log_timestamp <= 60 for: 1s labels: severity: critical annotations: title: "[错误日志告警]" info: "错误文件{{ $labels.file }},错误内容是:{{ $labels.info }},详细信息请登录服务器查看!"
查询语句
备注:
这个问题纠结了很长时间,prometheus自带的函数无法解决该问题,最后换了思路,决定从mtail的配置入手
解决的思路是:每次获取的错误日志信息作为一个时间戳,然后用当前时间戳减去错误日志的时间戳,如果小于60秒,则说明是一分钟内的告警,如此总能获取到最新的告警信息