SLS机器学习最佳实战：日志聚类+异常告警

1.手中的锤子都有啥？

围绕日志，挖掘其中更大价值，一直是我们团队所关注。在原有日志实时查询基础上，今年SLS在DevOps领域完善了如下功能：

上下文查询
实时Tail和智能聚类，以提高问题调查效率
提供多种时序数据的异常检测和预测函数，来做更智能的检查和预测
数据分析的结果可视化
强大的告警设置和通知，通过调用webhook进行关联行动

今天我们重点介绍下，日志只能聚类和异常告警如何配合，更好的进行异常发现和告警

2.平台实验

2.1 实验数据

一份Sys Log的原始数据，，并且开启了日志聚类服务，具体的状态截图如下：

通过调整下面截图中红色框1的大小，可以改变图中红色框2的结果，但是对于每个最细粒度的pattern并不会改变，也就是说：子Pattern的结果是稳定且唯一的，我们可以通过子Pattern的Signature找到对应的原始日志条目。

2.2 生成子模式的时序信息

假设，我们对这个子Pattern要进行监控：

msg:vm-111932.tc su: pam_unix(*:session): session closed for user root
对应的 signature_id : log_signature: 1814836459146662485

我们得到了上述pattern对应的原始日志，可以看下具体的数量在时间轴上的直返图：

上图中，我们可以发现，这个模式的日志分布不是很均衡，其中还有一些是没有的，如果直接按照时间窗口统计数量，得到的时序图如下：

__log_signature__: 1814836459146662485 |  
select 
    date_trunc('minute', __time__) as time, 
    COUNT(*) as num 
from log GROUP BY time order by time ASC limit 10000

上述图中我们发现时间上并不是连续的。因此，我们需要对这条时序进行补点操作。

__log_signature__: 1814836459146662485 | 
select 
    time_series(time, '1m', '%Y-%m-%d %H:%i:%s', '0') as time, 
    avg(num) as num 
from  ( 
    select 
        __time__ - __time__ % 60 as time, 
        COUNT(*) as num 
    from log GROUP BY time order by time desc ) 
GROUP by time order by time ASC limit 10000

2.3 对时序进行异常检测

使用时序异常检测函数： ts_predicate_arma

__log_signature__: 1814836459146662485 | 
select 
    ts_predicate_arma(to_unixtime(time), num, 5, 1, 1, 1, 'avg') 
from  ( 
    select 
        time_series(time, '1m', '%Y-%m-%d %H:%i:%s', '0') as time, 
        avg(num) as num 
    from  ( 
        select 
            __time__ - __time__ % 60 as time, 
            COUNT(*) as num 
        from log GROUP BY time order by time desc ) 
    GROUP by time order by time ASC ) limit 10000

2.4 告警该如何设置

将机器学习函数的结果拆解开

__log_signature__: 1814836459146662485 | 
select 
    t1[1] as unixtime, t1[2] as src, t1[3] as pred, t1[4] as up, t1[5] as lower, t1[6] as prob 
from  ( 
    select 
        ts_predicate_arma(to_unixtime(time), num, 5, 1, 1, 1, 'avg') as res 
    from  ( 
        select 
            time_series(time, '1m', '%Y-%m-%d %H:%i:%s', '0') as time, 
            avg(num) as num 
        from  ( 
            select 
                __time__ - __time__ % 60 as time, 
                COUNT(*) as num 
            from log GROUP BY time order by time desc ) 
        GROUP by time order by time ASC )) , unnest(res) as t(t1)

针对最近两分钟的结果进行告警

__log_signature__: 1814836459146662485 | 
select 
    unixtime, src, pred, up, lower, prob 
from  ( 
    select 
        t1[1] as unixtime, t1[2] as src, t1[3] as pred, t1[4] as up, t1[5] as lower, t1[6] as prob 
    from  ( 
        select 
            ts_predicate_arma(to_unixtime(time), num, 5, 1, 1, 1, 'avg') as res 
        from  ( 
            select 
                time_series(time, '1m', '%Y-%m-%d %H:%i:%s', '0') as time, 
                avg(num) as num 
            from  ( 
                select 
                    __time__ - __time__ % 60 as time, COUNT(*) as num 
                from log GROUP BY time order by time desc ) 
            GROUP by time order by time ASC )) , unnest(res) as t(t1) ) 
    where is_nan(src) = false order by unixtime desc limit 2

针对上升点进行告警，并设置兜底策略

__log_signature__: 1814836459146662485 | 
select 
    sum(prob) as sumProb, max(src) as srcMax, max(up) as upMax 
from ( 
    select 
        unixtime, src, pred, up, lower, prob 
    from  ( 
        select 
            t1[1] as unixtime, t1[2] as src, t1[3] as pred, t1[4] as up, t1[5] as lower, t1[6] as prob 
        from  ( 
            select 
                ts_predicate_arma(to_unixtime(time), num, 5, 1, 1, 1, 'avg') as res 
            from  ( 
                select 
                    time_series(time, '1m', '%Y-%m-%d %H:%i:%s', '0') as time, avg(num) as num 
                from  ( 
                    select 
                        __time__ - __time__ % 60 as time, COUNT(*) as num 
                    from log GROUP BY time order by time desc ) 
                GROUP by time order by time ASC )) , unnest(res) as t(t1) ) 
        where is_nan(src) = false order by unixtime desc limit 2 )