前言

原文链接这里
本文着重的要点是:Anomaly Detection 异常值检测
也就是通过对于不同异常值的定义,用不同的检测算法/方法来检测异常值
在了解有哪些对应的方法之前,我们要先知道异常值有哪些类型

异常值类型

参考这篇文章
异常值分为三类:

  • global outlier/point anomalies 整体异常
  • contextual outlier/conditional anomalies 上下文异常
  • collective outliers 集合异常

本文的解释:

  • global outlier:
    突然有数据点远远离开整体数据位置data point far away from entirety of data
  • contextual outlier:
    上下文异常的前提是上下文是同样的context,通常我们指的都只是time-series也就是周期性的context
    针对对象也是几个数据点
  • collective outlier:
    集合型异常,是指多个数据集合并后,某一部分数据出现上下文异常/global outlier。
    但是原先作为单独数据源的时候,对应的数据是没有任何异常的。
    比如时间序列堵车情况:单独看G10,G20,G30高速在下文2点会堵车的情况是很可能发生的,但是如果在某一天,这三个高速同时2点都在堵车那就有异常了。

小结

这里小结一下,作者对于不同anomalies的方法。
然后大部分都是拟合数据为模型,然后在用模型预测原数据,选择数据中偏差最大的1%(没有拟合好的数据)作为anomalies。

    1. collective anomalies:
    • PCA+Cluster(K-means)
    • Isolation Forest
    • One class SVM
    1. contextual data+ collective anomalies
    • split data into different contexts
    • use elliptic envelope(gaussian distribution) to fit data
    1. sequential/contextual anomalies
    • 方法1
      • discret target data into different states
      • use markov chains
    • 方法2
      • RNN

方法小结

看导入的包

from sklearn import preprocessing

from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.covariance import EllipticEnvelope
#from pyemma import msm # not available on Kaggle Kernel
from sklearn.ensemble import IsolationForest
from sklearn.svm import OneClassSVM

PCA+Cluster

原文的步骤是:

  1. 挑选几个cols
  2. PCA降维到2个components+minmaxscaler
  3. 用kmeans来cluster
  4. 人工定义1%的数据是outlieroutliers_fraction = 0.01
  5. 选择数据中最大的1%作为anomaly
  6. 在原数据中添加anomaly类,根据5.中选择的数据

Isolation Forest

是一个分割递归算法

# Take useful feature and standardize them 

data = df[['value', 'hours', 'daylight', 'DayOfTheWeek', 'WeekDay']]
min_max_scaler = preprocessing.StandardScaler()
np_scaled = min_max_scaler.fit_transform(data)
data = pd.DataFrame(np_scaled)
# train isolation forest 
model =  IsolationForest(contamination = outliers_fraction)
model.fit(data)

# add the data to the main  

df['anomaly25'] = pd.Series(model.predict(data))
df['anomaly25'] = df['anomaly25'].map( {1: 0, -1: 1} )
print(df['anomaly25'].value_counts())

OneClassSVM

单个class的SVM

# Take useful feature and standardize them 

data = df[['value', 'hours', 'daylight', 'DayOfTheWeek', 'WeekDay']]
min_max_scaler = preprocessing.StandardScaler()
np_scaled = min_max_scaler.fit_transform(data)
# train one class SVM 
model =  OneClassSVM(nu=0.95 * outliers_fraction) #nu=0.95 * outliers_fraction  + 0.05
data = pd.DataFrame(np_scaled)
model.fit(data)
# add the data to the main  
df['anomaly26'] = pd.Series(model.predict(data))
df['anomaly26'] = df['anomaly26'].map( {1: 0, -1: 1} )
print(df['anomaly26'].value_counts())

EllipticEnvelope

原文的步骤:

  1. 根据一个关键categorical的col然后分成不同的context
# creation of 4 differents data set based on categories defined before

df_class0 = df.loc[df['categories'] == 0, 'value']
df_class1 = df.loc[df['categories'] == 1, 'value']
df_class2 = df.loc[df['categories'] == 2, 'value']
df_class3 = df.loc[df['categories'] == 3, 'value']
  1. 用gaussian distribution来拟合数据+contamination(污染程度):0.01
# apply ellipticEnvelope(gaussian distribution) at each categories

envelope =  EllipticEnvelope(contamination = outliers_fraction) 
X_train = df_class0.values.reshape(-1,1)
envelope.fit(X_train)
df_class0 = pd.DataFrame(df_class0)
df_class0['deviation'] = envelope.decision_function(X_train)
df_class0['anomaly'] = envelope.predict(X_train)
  1. 如果上面预测的是1,也就是数据拟合在分布中,那么就不是异常值
    如果预测是-1,那么说明在分布外,就是异常值

MarkovChains

原文步骤:

  1. 离散化数据,数据分层,比如这里分5层
    (very low, low, average, high, very high)/(VL, L, A, H, VH).
# definition of the different state

x1 = (df['value'] <=18).astype(int)
x2= ((df['value'] > 18) & (df['value']<=21)).astype(int)
x3 = ((df['value'] > 21) & (df['value']<=24)).astype(int)
x4 = ((df['value'] > 24) & (df['value']<=27)).astype(int)
x5 = (df['value'] >27).astype(int)
df_mm = x1 + 2*x2 + 3*x3 + 4*x4 + 5*x5
  1. 然后markov chain会计算比如这个状态变化序列(VL, L, L, A, A, L, A)的概率,如果概率很低那么这个序列就是一个异常值(???)
# getting the anomaly labels for our dataset (evaluating sequence of 5 values and anomaly = less than 20% probable)

# I USE pyemma NOT AVAILABLE IN KAGGLE KERNEL
#df_anomaly = markovAnomaly(df_mm, 5, 0.20)
#df_anomaly = pd.Series(df_anomaly)
#print(df_anomaly.value_counts())

RNN

这里用了50个数据来预测一个新的数据
然后如果选择预测diff最大的前1%来作为anomaly

  • 数据准备
# important parameters and train/test size
prediction_time = 1 
testdatasize = 1000
unroll_length = 50
testdatacut = testdatasize + unroll_length  + 1

#train data
x_train = data_n[0:-prediction_time-testdatacut].as_matrix() #get ndarray as result
y_train = data_n[prediction_time:-testdatacut  ][0].as_matrix()

# test data
x_test = data_n[0-testdatacut:-prediction_time].as_matrix()
y_test = data_n[prediction_time-testdatacut:  ][0].as_matrix()
  • batch切割
#unroll: create sequence of 50 previous data points for each data points
def unroll(data,sequence_length=24):
    result = []
    for index in range(len(data) - sequence_length):
        result.append(data[index: index + sequence_length])
    return np.asarray(result)

# adapt the datasets for the sequence data shape
x_train = unroll(x_train,unroll_length)
x_test  = unroll(x_test,unroll_length)
y_train = y_train[-x_train.shape[0]:]
y_test  = y_test[-x_test.shape[0]:]
  • 建立模型
# specific libraries for RNN
# keras is a high layer build on Tensorflow layer to stay in high level/easy implementation
from keras.layers.core import Dense, Activation, Dropout
from keras.layers.recurrent import LSTM
from keras.models import Sequential
import time #helper libraries
from keras.models import model_from_json
import sys

# Build the model
model = Sequential()

model.add(LSTM(
    input_dim=x_train.shape[-1],
    output_dim=50,
    return_sequences=True))
model.add(Dropout(0.2))

model.add(LSTM(
    100,
    return_sequences=False))
model.add(Dropout(0.2))

model.add(Dense(
    units=1))
model.add(Activation('linear'))

start = time.time()
model.compile(loss='mse', optimizer='rmsprop')
print('compilation time : {}'.format(time.time() - start))
  • 训练
# Train the model

#nb_epoch = 350

model.fit(
    x_train,
    y_train,
    batch_size=3028,
    nb_epoch=30,
    validation_split=0.1)
  • 计算预测和实际的差值和radio,选择差值最大的1%作为outlier
# create the list of difference between prediction and test data
loaded_model = model
diff=[]
ratio=[]
p = loaded_model.predict(x_test)
# predictions = lstm.predict_sequences_multiple(loaded_model, x_test, 50, 50)
for u in range(len(y_test)):
    pr = p[u][0]
    ratio.append((y_test[u]/pr)-1)
    diff.append(abs(y_test[u]- pr))

# select the most distant prediction/reality data points as anomalies
diff = pd.Series(diff)
number_of_outliers = int(outliers_fraction*len(diff))
threshold = diff.nlargest(number_of_outliers).min()
# data with anomaly label (test data part)
test = (diff >= threshold).astype(int)
# the training data part where we didn't predict anything (overfitting possible): no anomaly
complement = pd.Series(0, index=np.arange(len(data_n)-testdatasize))
# # add the data to the main
df['anomaly27'] = complement.append(test, ignore_index='True')
print(df['anomaly27'].value_counts())