EDA中级-Kaggle学习6-异常值检测

前言

原文链接这里
本文着重的要点是：Anomaly Detection 异常值检测
也就是通过对于不同异常值的定义，用不同的检测算法/方法来检测异常值
在了解有哪些对应的方法之前，我们要先知道异常值有哪些类型

异常值类型

参考这篇文章
异常值分为三类：

global outlier/point anomalies 整体异常
contextual outlier/conditional anomalies 上下文异常
collective outliers 集合异常

本文的解释：

global outlier:
突然有数据点远远离开整体数据位置data point far away from entirety of data
contextual outlier:
上下文异常的前提是上下文是同样的context，通常我们指的都只是time-series也就是周期性的context
针对对象也是几个数据点
collective outlier:
集合型异常，是指多个数据集合并后，某一部分数据出现上下文异常/global outlier。
但是原先作为单独数据源的时候，对应的数据是没有任何异常的。
比如时间序列堵车情况：单独看G10，G20，G30高速在下文2点会堵车的情况是很可能发生的，但是如果在某一天，这三个高速同时2点都在堵车那就有异常了。

小结

这里小结一下，作者对于不同anomalies的方法。
然后大部分都是拟合数据为模型，然后在用模型预测原数据，选择数据中偏差最大的1%（没有拟合好的数据）作为anomalies。

1. collective anomalies:
- PCA+Cluster(K-means)
- Isolation Forest
- One class SVM
1. contextual data+ collective anomalies
- split data into different contexts
- use elliptic envelope(gaussian distribution) to fit data
1. sequential/contextual anomalies
- 方法1
  - discret target data into different states
  - use markov chains
- 方法2
  - RNN

方法小结

看导入的包

from sklearn import preprocessing

from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.covariance import EllipticEnvelope
#from pyemma import msm # not available on Kaggle Kernel
from sklearn.ensemble import IsolationForest
from sklearn.svm import OneClassSVM

PCA+Cluster

原文的步骤是：

挑选几个cols
PCA降维到2个components+minmaxscaler
用kmeans来cluster
人工定义1%的数据是outlieroutliers_fraction = 0.01
选择数据中最大的1%作为anomaly
在原数据中添加anomaly类，根据5.中选择的数据

Isolation Forest

是一个分割递归算法

# Take useful feature and standardize them 

data = df[['value', 'hours', 'daylight', 'DayOfTheWeek', 'WeekDay']]
min_max_scaler = preprocessing.StandardScaler()
np_scaled = min_max_scaler.fit_transform(data)
data = pd.DataFrame(np_scaled)
# train isolation forest 
model =  IsolationForest(contamination = outliers_fraction)
model.fit(data)

# add the data to the main  

df['anomaly25'] = pd.Series(model.predict(data))
df['anomaly25'] = df['anomaly25'].map( {1: 0, -1: 1} )
print(df['anomaly25'].value_counts())

OneClassSVM

单个class的SVM

# Take useful feature and standardize them 

data = df[['value', 'hours', 'daylight', 'DayOfTheWeek', 'WeekDay']]
min_max_scaler = preprocessing.StandardScaler()
np_scaled = min_max_scaler.fit_transform(data)
# train one class SVM 
model =  OneClassSVM(nu=0.95 * outliers_fraction) #nu=0.95 * outliers_fraction  + 0.05
data = pd.DataFrame(np_scaled)
model.fit(data)
# add the data to the main  
df['anomaly26'] = pd.Series(model.predict(data))
df['anomaly26'] = df['anomaly26'].map( {1: 0, -1: 1} )
print(df['anomaly26'].value_counts())

EllipticEnvelope

原文的步骤：

根据一个关键categorical的col然后分成不同的context

# creation of 4 differents data set based on categories defined before

df_class0 = df.loc[df['categories'] == 0, 'value']
df_class1 = df.loc[df['categories'] == 1, 'value']
df_class2 = df.loc[df['categories'] == 2, 'value']
df_class3 = df.loc[df['categories'] == 3, 'value']

用gaussian distribution来拟合数据+contamination（污染程度）:0.01

# apply ellipticEnvelope(gaussian distribution) at each categories

envelope =  EllipticEnvelope(contamination = outliers_fraction) 
X_train = df_class0.values.reshape(-1,1)
envelope.fit(X_train)
df_class0 = pd.DataFrame(df_class0)
df_class0['deviation'] = envelope.decision_function(X_train)
df_class0['anomaly'] = envelope.predict(X_train)

如果上面预测的是1，也就是数据拟合在分布中，那么就不是异常值
如果预测是-1,那么说明在分布外，就是异常值

MarkovChains

原文步骤：

离散化数据，数据分层，比如这里分5层
(very low, low, average, high, very high)/(VL, L, A, H, VH).

# definition of the different state

x1 = (df['value'] <=18).astype(int)
x2= ((df['value'] > 18) & (df['value']<=21)).astype(int)
x3 = ((df['value'] > 21) & (df['value']<=24)).astype(int)
x4 = ((df['value'] > 24) & (df['value']<=27)).astype(int)
x5 = (df['value'] >27).astype(int)
df_mm = x1 + 2*x2 + 3*x3 + 4*x4 + 5*x5

然后markov chain会计算比如这个状态变化序列(VL, L, L, A, A, L, A)的概率，如果概率很低那么这个序列就是一个异常值（？？？）

# getting the anomaly labels for our dataset (evaluating sequence of 5 values and anomaly = less than 20% probable)

# I USE pyemma NOT AVAILABLE IN KAGGLE KERNEL
#df_anomaly = markovAnomaly(df_mm, 5, 0.20)
#df_anomaly = pd.Series(df_anomaly)
#print(df_anomaly.value_counts())

RNN

这里用了50个数据来预测一个新的数据
然后如果选择预测diff最大的前1%来作为anomaly

数据准备

# important parameters and train/test size
prediction_time = 1 
testdatasize = 1000
unroll_length = 50
testdatacut = testdatasize + unroll_length  + 1

#train data
x_train = data_n[0:-prediction_time-testdatacut].as_matrix() #get ndarray as result
y_train = data_n[prediction_time:-testdatacut  ][0].as_matrix()

# test data
x_test = data_n[0-testdatacut:-prediction_time].as_matrix()
y_test = data_n[prediction_time-testdatacut:  ][0].as_matrix()

batch切割

#unroll: create sequence of 50 previous data points for each data points
def unroll(data,sequence_length=24):
    result = []
    for index in range(len(data) - sequence_length):
        result.append(data[index: index + sequence_length])
    return np.asarray(result)

# adapt the datasets for the sequence data shape
x_train = unroll(x_train,unroll_length)
x_test  = unroll(x_test,unroll_length)
y_train = y_train[-x_train.shape[0]:]
y_test  = y_test[-x_test.shape[0]:]

建立模型

# specific libraries for RNN
# keras is a high layer build on Tensorflow layer to stay in high level/easy implementation
from keras.layers.core import Dense, Activation, Dropout
from keras.layers.recurrent import LSTM
from keras.models import Sequential
import time #helper libraries
from keras.models import model_from_json
import sys

# Build the model
model = Sequential()

model.add(LSTM(
    input_dim=x_train.shape[-1],
    output_dim=50,
    return_sequences=True))
model.add(Dropout(0.2))

model.add(LSTM(
    100,
    return_sequences=False))
model.add(Dropout(0.2))

model.add(Dense(
    units=1))
model.add(Activation('linear'))

start = time.time()
model.compile(loss='mse', optimizer='rmsprop')
print('compilation time : {}'.format(time.time() - start))

训练

# Train the model

#nb_epoch = 350

model.fit(
    x_train,
    y_train,
    batch_size=3028,
    nb_epoch=30,
    validation_split=0.1)

计算预测和实际的差值和radio，选择差值最大的1%作为outlier

# create the list of difference between prediction and test data
loaded_model = model
diff=[]
ratio=[]
p = loaded_model.predict(x_test)
# predictions = lstm.predict_sequences_multiple(loaded_model, x_test, 50, 50)
for u in range(len(y_test)):
    pr = p[u][0]
    ratio.append((y_test[u]/pr)-1)
    diff.append(abs(y_test[u]- pr))

# select the most distant prediction/reality data points as anomalies
diff = pd.Series(diff)
number_of_outliers = int(outliers_fraction*len(diff))
threshold = diff.nlargest(number_of_outliers).min()
# data with anomaly label (test data part)
test = (diff >= threshold).astype(int)
# the training data part where we didn't predict anything (overfitting possible): no anomaly
complement = pd.Series(0, index=np.arange(len(data_n)-testdatasize))
# # add the data to the main
df['anomaly27'] = complement.append(test, ignore_index='True')
print(df['anomaly27'].value_counts())

发表于 2021-02-09 00:03 niemand-01 阅读(766) 评论(0) 编辑收藏举报

前言