前言
原文链接这里
本文着重的要点是:Anomaly Detection 异常值检测
也就是通过对于不同异常值的定义,用不同的检测算法/方法来检测异常值
在了解有哪些对应的方法之前,我们要先知道异常值有哪些类型
异常值类型
参考这篇文章
异常值分为三类:
- global outlier/point anomalies 整体异常
- contextual outlier/conditional anomalies 上下文异常
- collective outliers 集合异常
本文的解释:
- global outlier:
突然有数据点远远离开整体数据位置data point far away from entirety of data
- contextual outlier:
上下文异常的前提是上下文是同样的context,通常我们指的都只是time-series也就是周期性的context
针对对象也是几个数据点
- collective outlier:
集合型异常,是指多个数据集合并后,某一部分数据出现上下文异常/global outlier。
但是原先作为单独数据源的时候,对应的数据是没有任何异常的。
比如时间序列堵车情况:单独看G10,G20,G30高速在下文2点会堵车的情况是很可能发生的,但是如果在某一天,这三个高速同时2点都在堵车那就有异常了。
小结
这里小结一下,作者对于不同anomalies的方法。
然后大部分都是拟合数据为模型,然后在用模型预测原数据,选择数据中偏差最大的1%(没有拟合好的数据)作为anomalies。
-
- collective anomalies:
- PCA+Cluster(K-means)
- Isolation Forest
- One class SVM
-
- contextual data+ collective anomalies
- split data into different contexts
- use elliptic envelope(gaussian distribution) to fit data
-
- sequential/contextual anomalies
- 方法1
- discret target data into different states
- use markov chains
- 方法2
- RNN
方法小结
看导入的包
from sklearn import preprocessing
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.covariance import EllipticEnvelope
#from pyemma import msm # not available on Kaggle Kernel
from sklearn.ensemble import IsolationForest
from sklearn.svm import OneClassSVM
PCA+Cluster
原文的步骤是:
- 挑选几个cols
- PCA降维到2个components+minmaxscaler
- 用kmeans来cluster
- 人工定义1%的数据是outlier
outliers_fraction = 0.01
- 选择数据中最大的1%作为anomaly
- 在原数据中添加anomaly类,根据5.中选择的数据
Isolation Forest
是一个分割递归算法
# Take useful feature and standardize them
data = df[['value', 'hours', 'daylight', 'DayOfTheWeek', 'WeekDay']]
min_max_scaler = preprocessing.StandardScaler()
np_scaled = min_max_scaler.fit_transform(data)
data = pd.DataFrame(np_scaled)
# train isolation forest
model = IsolationForest(contamination = outliers_fraction)
model.fit(data)
# add the data to the main
df['anomaly25'] = pd.Series(model.predict(data))
df['anomaly25'] = df['anomaly25'].map( {1: 0, -1: 1} )
print(df['anomaly25'].value_counts())
OneClassSVM
单个class的SVM
# Take useful feature and standardize them
data = df[['value', 'hours', 'daylight', 'DayOfTheWeek', 'WeekDay']]
min_max_scaler = preprocessing.StandardScaler()
np_scaled = min_max_scaler.fit_transform(data)
# train one class SVM
model = OneClassSVM(nu=0.95 * outliers_fraction) #nu=0.95 * outliers_fraction + 0.05
data = pd.DataFrame(np_scaled)
model.fit(data)
# add the data to the main
df['anomaly26'] = pd.Series(model.predict(data))
df['anomaly26'] = df['anomaly26'].map( {1: 0, -1: 1} )
print(df['anomaly26'].value_counts())
EllipticEnvelope
原文的步骤:
- 根据一个关键categorical的col然后分成不同的context
# creation of 4 differents data set based on categories defined before
df_class0 = df.loc[df['categories'] == 0, 'value']
df_class1 = df.loc[df['categories'] == 1, 'value']
df_class2 = df.loc[df['categories'] == 2, 'value']
df_class3 = df.loc[df['categories'] == 3, 'value']
- 用gaussian distribution来拟合数据+contamination(污染程度):0.01
# apply ellipticEnvelope(gaussian distribution) at each categories
envelope = EllipticEnvelope(contamination = outliers_fraction)
X_train = df_class0.values.reshape(-1,1)
envelope.fit(X_train)
df_class0 = pd.DataFrame(df_class0)
df_class0['deviation'] = envelope.decision_function(X_train)
df_class0['anomaly'] = envelope.predict(X_train)
- 如果上面预测的是1,也就是数据拟合在分布中,那么就不是异常值
如果预测是-1,那么说明在分布外,就是异常值
MarkovChains
原文步骤:
- 离散化数据,数据分层,比如这里分5层
(very low, low, average, high, very high)/(VL, L, A, H, VH).
# definition of the different state
x1 = (df['value'] <=18).astype(int)
x2= ((df['value'] > 18) & (df['value']<=21)).astype(int)
x3 = ((df['value'] > 21) & (df['value']<=24)).astype(int)
x4 = ((df['value'] > 24) & (df['value']<=27)).astype(int)
x5 = (df['value'] >27).astype(int)
df_mm = x1 + 2*x2 + 3*x3 + 4*x4 + 5*x5
- 然后markov chain会计算比如这个状态变化序列(VL, L, L, A, A, L, A)的概率,如果概率很低那么这个序列就是一个异常值(???)
# getting the anomaly labels for our dataset (evaluating sequence of 5 values and anomaly = less than 20% probable)
# I USE pyemma NOT AVAILABLE IN KAGGLE KERNEL
#df_anomaly = markovAnomaly(df_mm, 5, 0.20)
#df_anomaly = pd.Series(df_anomaly)
#print(df_anomaly.value_counts())
RNN
这里用了50个数据来预测一个新的数据
然后如果选择预测diff最大的前1%来作为anomaly
- 数据准备
# important parameters and train/test size
prediction_time = 1
testdatasize = 1000
unroll_length = 50
testdatacut = testdatasize + unroll_length + 1
#train data
x_train = data_n[0:-prediction_time-testdatacut].as_matrix() #get ndarray as result
y_train = data_n[prediction_time:-testdatacut ][0].as_matrix()
# test data
x_test = data_n[0-testdatacut:-prediction_time].as_matrix()
y_test = data_n[prediction_time-testdatacut: ][0].as_matrix()
- batch切割
#unroll: create sequence of 50 previous data points for each data points
def unroll(data,sequence_length=24):
result = []
for index in range(len(data) - sequence_length):
result.append(data[index: index + sequence_length])
return np.asarray(result)
# adapt the datasets for the sequence data shape
x_train = unroll(x_train,unroll_length)
x_test = unroll(x_test,unroll_length)
y_train = y_train[-x_train.shape[0]:]
y_test = y_test[-x_test.shape[0]:]
- 建立模型
# specific libraries for RNN
# keras is a high layer build on Tensorflow layer to stay in high level/easy implementation
from keras.layers.core import Dense, Activation, Dropout
from keras.layers.recurrent import LSTM
from keras.models import Sequential
import time #helper libraries
from keras.models import model_from_json
import sys
# Build the model
model = Sequential()
model.add(LSTM(
input_dim=x_train.shape[-1],
output_dim=50,
return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(
100,
return_sequences=False))
model.add(Dropout(0.2))
model.add(Dense(
units=1))
model.add(Activation('linear'))
start = time.time()
model.compile(loss='mse', optimizer='rmsprop')
print('compilation time : {}'.format(time.time() - start))
- 训练
# Train the model
#nb_epoch = 350
model.fit(
x_train,
y_train,
batch_size=3028,
nb_epoch=30,
validation_split=0.1)
- 计算预测和实际的差值和radio,选择差值最大的1%作为outlier
# create the list of difference between prediction and test data
loaded_model = model
diff=[]
ratio=[]
p = loaded_model.predict(x_test)
# predictions = lstm.predict_sequences_multiple(loaded_model, x_test, 50, 50)
for u in range(len(y_test)):
pr = p[u][0]
ratio.append((y_test[u]/pr)-1)
diff.append(abs(y_test[u]- pr))
# select the most distant prediction/reality data points as anomalies
diff = pd.Series(diff)
number_of_outliers = int(outliers_fraction*len(diff))
threshold = diff.nlargest(number_of_outliers).min()
# data with anomaly label (test data part)
test = (diff >= threshold).astype(int)
# the training data part where we didn't predict anything (overfitting possible): no anomaly
complement = pd.Series(0, index=np.arange(len(data_n)-testdatasize))
# # add the data to the main
df['anomaly27'] = complement.append(test, ignore_index='True')
print(df['anomaly27'].value_counts())
联系方式:clarence_wu12@outlook.com