应用高斯分布来解决异常检测问题(三)
(原创文章,转载请注明出处!)
本篇来解决一个异常检测问题。样本数据约300个,无标签的二维数据。此外,还有一个有标签的验证数据集,包含约300个样本。
一、将每个维度数据的直方图plot出来
1 hist(X[,1], breaks=seq(from=floor(min(X[,1])),to=ceiling(max(X[,1])),by=0.3), freq=TRUE, col="blue1", main="x1") # histogram of 1st feature 2 hist(X[,2], breaks=seq(from=floor(min(X[,2])),to=ceiling(max(X[,2])),by=0.3), freq=TRUE, col="blue1", main="x2") # histogram of 2nd feature
可以看出两个特征的样本数据的直方图与一元高斯分布的钟型图比较相似。所以,不需要对样本数据进行额外的处理,就能用了训练高斯模型。
二、阈值
在文章应用高斯分布来解决异常检测问题(二)中给出如何进行训练以及计算样本概率的函数,但要使用所获得的模型来进行预测,还需要用于判别的阈值。但计算得到的概率值小于阈值时,就认为样本异常。阈值的获取是通过遍历所有可能的概率值,然后计算每个值对应的F1值,F1值最大的概率值,被选为阈值。具体代码如下:(阈值使用验证数据集确实)
1 ## Find the abnormal threshold. Different step values should be tried to make the algorithm finding the best threshold. 2 ## To get the threshold, all the possible value will be checked and the one got the best F1 score will be chosen. 3 ## parameters: 4 ## valProb - a vector, which stores probability of validation samples. 5 ## yval - a vector, which stores label of validation samples. 6 ## step - integer, step size of the probability value traversal 7 ## return value: 8 ## a vector, the length of it is 2. 1st element in the vector is threshold. 9 ## 2nd element in the vector is the best F1 score. 10 craEx8AD_funFindThreshold<- function(valProb, yval, step=0.001) 11 { 12 # todo: check the parameters at first. 13 if (step < 0) { 14 step <- (max(valProb) - min(valProb)) / 1000 15 } 16 thresholdCandidate <- seq(from=min(valProb), to=max(valProb), by=step) 17 bestF1 <- 0 18 bestThreshold <- 0 19 for (val in thresholdCandidate) { 20 ## calculate the F1 score 21 # precision 22 p <- length(which(valProb[which(yval == 1)] < val)) / length( which(valProb < val) ) 23 # recall, 24 r <- length(which(valProb[which(yval == 1)] < val)) / length( which(yval == 1) ) 25 if ( is.na(p) || is.na(r) ) { 26 next 27 } 28 # F1 29 F1 <- (2 * p * r) / (p + r) 30 31 if (F1 > bestF1) { 32 bestThreshold <- val 33 bestF1 <- F1 34 } 35 } 36 return( c(bestThreshold, bestF1) ) 37 }
在寻找阈值时,要多次尝试不同的搜寻步长,以找到最合适的阈值。
三、用多个一元高斯分布模型训练与预测
将训练数据和验证数据加载后,使用文章应用高斯分布来解决异常检测问题(二)中给出的多个一元高斯分布建模的函数进训练,具体代码如下:
1 X <- read.table(file="X.txt") 2 Xval <- read.table(file="Xval.txt")
3 yval <- read.table(file="yval.txt")
4 X <- as.matrix(craEx8AD_X)
5 Xval <- as.matrix(craEx8AD_Xval)
6 yval <- as.matrix(yval) 7 probXvalByM_U_Norm <- funMultiUnivariateNorm (x = Xval, X = X, parameterFile = ".MultiUnivariateNorm", isTraining = TRUE)
8 thresholdOfM_U_Norm <- funFindThreshold( valProb = probXvalByM_U_Norm, yval = as.vector(yval) )
由于样本数据是二维的,所以可以方便的将数据plot出来,更方便的看到训练预测的效果。plot图形需要用到如下函数:
1 ## plot the outline of norm model at the threshold 2 ## To get the threshold, all the possible value will be checked and the one got the best F1 score will be chosen. 3 ## parameters: 4 ## lowBound - numeric, low bound of probability value 5 ## upBound - numeric, up bound of probability value
6 ## threshold - numeric, value of the threshold 7## whichModel - integer, 1 is Multi-Univariate Norm model 8 ## 2 is Multivariate Norm model 9 plotOutline <- function(lowBound, upBound, whichModel) 10 { 11 u <- seq(lowBound, upBound, by=.05) 12 v <- seq(lowBound, upBound, by=.05) 13 z <- matrix(0, nrow=length(u), ncol=length(v)) 14 15 for (i in 1:length(u)) { 16 for (j in 1:length(v)) { 17 if (whichModel == 1) { 18 z[i,j] <- funMultiUnivariateNorm (x=c(u[i],v[j]), X = NULL, parameterFile = ".MultiUnivariateNorm", isTraining = FALSE) 19 } else if (whichModel == 2) { 20 z[i,j] <- funMultivariateNorm (x=c(u[i],v[j]), X = NULL, parameterFile = ".MultivariateNorm", isTraining = FALSE) 21 } 22 } 23 } 24 25 contour(u, v, z, col = "green", add = TRUE, method = "simple", levels=c(threshold), lty="solid") 26 }
将验证集数据的图形plot出来:
1 lowBound <- min(floor(min(X[,1])), floor(min(X[,2])) ) 2 upBound <- max(ceiling(max(X[,1])), ceiling(max(X[,2])) ) 3 plot( x = c(lowBound, upBound), y = c(lowBound, upBound), type = "n") 4 points(X[,1], X[,2], pch=46) 5 ### red circle 6 probXByM_U_Norm <- funMultiUnivariateNorm (x = X, X = NULL, parameterFile = ".MultiUnivariateNorm", isTraining = FALSE) 7 redPoins <- X[which(as.vector(yval) == 1),] 8 points(redPoins[,1], redPoins[,2], col = "red", pch=1) 9 ### outline 10 plotOutline(lowBound, upBound, thresholdOfM_U_Norm[1], whichModel=1)
plot出来的图形:
图上有两条轮廓线,分别对应阈值0.0000896 (9.961567e-05)和0.001,它们对应的F1值分别是0.8750和0.8235294,这两个阈值是用两个不同的步长,分别计算出来的。通过F1值判断出0.0000896要更合适。
将训练数据plot出来:
图中,异常样本点标上了红圈,阈值的轮廓线是绿色的圈,圈上的0.001就是选择的阈值。
四、给样本数据加一维数据
多个一元高斯分布模型每个一元高斯分布都是独立的,不能扑捉到特征之间的关系。所以,尝试给样本数据创建一维新数据,x1/x2。
1 X <- cbind(X, X[,1] / X[,2]) 2 hist(X[,3], breaks=seq(from=floor(min(X[,3])),to=ceiling(max(X[,3])),by=0.03), freq=TRUE, col="blue1", main="x3") # histogram of 3rd feature
直方图如下:
可以看出,x3的直方图与一元高斯分布的钟型图比较相近,可以直接使用到模型中进行训练。训练与预测的代码与上述的一样,在此不再赘述。通过计算,最后预测的结果也上述的结果一样。
也许是由于原始样本数据的两个特征客观上具有独立性,或者新增的特性不能很好的描述特征间的关系,以至于新加的特征对训练与预测的结果的影响不明显。就本例来看,前者的可能性较大。
五、用一个多元高斯分布模型训练与预测
用一个多元高斯分布模型来建模与上述的过程基本一致,而且由于数据样本简单,在此不再赘述。