Lecture15 Anomaly detection

Lecture 15 Anomaly detection

problem motivation

Gaussian distribution

Algorithm

Density estimation

Training set : \({x^{(1)},x^{(2)},\dots,x^{(m)}}\)

Each example is \(x \in \mathbb{R}^n\)

\[x_1 \sim N(u_1,\sigma_1^2)\\ x_2 \sim N(u_2,\sigma_2^2)\\ \vdots\\ x_n \sim N(u_n,\sigma_n^2)\\ p(x) = p(x_1,u_1,\sigma_1^2)p(x_2,u_2,\sigma_2^2)\dots p(x_n,u_n,\sigma_n^2) =\prod_{j=1}^p(x_j,u_j,\sigma_j^2) \]

Anomaly detection algorithm

Choose features \(x_i\) that you think might be indicative of anomalous example
Fit parameters \(u_1,u_2,\dots,u_n,\sigma_1^2,\sigma_2^2,\dots,\sigma_n^2\)

\(u_j=\frac{1}{m}\sum_{i=1}^mx_j^{(i)}\)

\(\sigma_j^2=\frac{1}{m}\sum_{i=1}^m(x_j^{(i)}-u_j)^2\)
Given new example \(x\),compute \(p(x)\)

\[p(x)=\prod_{j=1}^np(x_j,u_j,\sigma_j^2)=\prod_{j=1}^n\frac{1}{\sqrt{2\pi}\sigma_j}exp(-\frac{(x_j-u_j)^2}{2\sigma^2_j}) \]
Anomaly if \(p(x)<\varepsilon\)

Anomaly detection example

\(x_1\sim N(5,4)\)，\(x_2\sim N(3,1)\)

对于测试样本\(x_{test}^{(1)}\)，其坐标为\((a,b)\)，计算其落在横坐标\(a\)上的概率\(p(a,u_1,\sigma_1^2)\)，落在纵坐标\(b\)上的概率\(p(b,u_2,\sigma_2^2)\)。\(p(x) = p(a,u_1,\sigma_1^2)p(b,u_2,\sigma_2^2)\)就是测试样本\(x_{test}^{(1)}\)落在平面上点\((a,b)\)的概率。因为\(p(x_{test}^{(1)})=0.0426 \geq \varepsilon\) 所以判定其为正常样本。

同理，\(p(x_{test}^{(2)})=0.0021 \leq \varepsilon\) 所以判定其为异常样本

Developing and evaluating an anomaly detection system

The inportance of real-number evaluation

When developing a learning algorithm (choosing features,etc.),making decisions is much easier if we have a way of evaluating our learning algorithm

Assume we have some labeld data,of anomalous and non-anomalous example(y = 0 if normal,y = 1 if anomalous)

Training set:\(x^{(1)},x^{(2)},\dots,x^{(m)}\) (assume normal examples/not anomalous).把训练集是正常样本的集合，即使溜进了一些异常样本。

Cross validation set: \((x_{cv}^{(1)},y_{cv}^{(1)}),\dots,(x_{cv}^{(m_{cv})},y_{cv}^{(m_{cv})})\)

Test set : \({(x_{test}^{(1)},y_{test}^{(1)}),\dots,(x_{test}^{(m_{test})},y_{test}^{(m_{test})})}\)

验证集和测试集都包含一些异常样本

Aircraft engines motivating example

10000 good (normal) engines

20 flawd engines (anomalous)

Training set : 6000 good engines (y = 0)

CV : 2000 good engines (y = 0),10 anomalous (y = 1)

Test : 2000 good engines (y = 0),10 anomalous (y = 1)

Alternative : （不推荐）

Training set : 6000 good engines (y = 0)

CV : 4000 good engines (y = 0),10 anomalous (y = 1)

Test : 4000 good engines (y = 0),10 anomalous (y = 1)

验证集和测试集应当是两个完全不同的数据集。

Algorithm evaluation

Fit model \(p(x)\) on training set \(\{x^{(1)},\dots,x^{(m)}\}\)

On a cross validation/test example \(x\),predict

\[y = \begin{cases} 1 & if\ p(x)<\varepsilon\ (anomaly)\\ 0 & if\ p(x)\geq\varepsilon\ (normal) \end{cases} \]

possible evaluation metrics:

True positive,false positive,false negative,true negative
Precision/Recall
\(F_1\)-score

Can also use cross validation set to choose parameter \(\varepsilon\)

尝试许多不同的 \(\varepsilon\)，找到使得\(F_1\)-score最大的\(\varepsilon\)

Anomaly detection vs. supervised learning

Anomaly detection	Supervised learning
vary small number of positive example (y = 1).(0-20 is common) Large number of negative (y = 0) examples	Laege number of positive and negative examples
Many different “types" of anomalies.Hard for any algorithm to learn from positive examples what the anomalies look like ; future anomalies may look nothing like any of the anomalous examples we've seen so far	Ecough positive examples for algorithm to get a sense of what positive examples are like,future positive examples likely to be similar to ones in training set

异常检测适用于正样本少，负样本多的情况。监督学习适用于正负样本都多的情况。

异常有很多种，异常样本较少，很难从正样本中学习异常。而且将来出现的异常很可能会与已有的截然不同。适用于异常检测。

监督学习拥有足够多的正样本去学习异常，而且将来出现的异常与当前训练集中的正样本相似。适用于监督学习。

关键是正样本的数量。

Anomaly detection	Supervised learning
Frad detection	Email spam classification
Manufacturing (e.g. aircraft engines)	Wheater prediction
Monitoring machines in a data center	Cancer classification

Choosing what features to use

Non-gaussian features

对于不符合高斯分布的数据，通过对数据进行一些转换，使得它看上去更接近高斯分布。虽然不进行转换算法也可以运行，但转换之后，算法可以运行的更好。

通常的转换方式有

\[\begin{align} & x_1 \leftarrow log(x_1)\\ & x_2 \leftarrow log(x_2 + c)\\ & x_3 \leftarrow \sqrt{x_3}\\ & x_4 \leftarrow x_4^{\frac{1}{3}} \end{align} \]

Error analysis for anomaly detection

Want \(p(x)\) large for normal examples \(x\)

\(p(x)\) small for anomalous examples \(x\)

Most common problem : \(p(x)\) is comparable (say,both large) for normal and anomalous examples.

发现一个新的特征\(x_2\)，他能够将异常样本和正常样本区分开来。

Monitoring computers in a data center

Choose features that might take on unusually large or small values in the event of an anomaly

\(x_1\)=memory use of computer

\(x_2\)=number of disk accesses/sec

\(x_3\)=CPU load

\(x_4\)=network traffic

\(x_5\)=CPU load / network traffic

\(x_6\)=(CPu load)^2 / network traffic

Multivariate Gaussian distribution

Motivating example:Monitoring machines in a data center

对于绿色样本\(x^{(1)}(0.5,1.5)\)，CPU负载很低，但内存使用量很高，是一个异常样本。如果认为两个变量线性无关，计算结果\(p(x^{(1)})=p(x_1^{(1)},u_1,\sigma_1^2)p(x_2^{(1)},u_2,\sigma_2^2)\)并不低，所以可能不会被归类为异常样本。紫红色的等高线图表示其概率在平面上的分布。但是CPU负载和内存使用量之间存在线性关系。其概率分布是椭圆形的。

Multivariate Gaussian (Normal) distribution

\(x\in\mathbb{R}^n\). Don't model \(p(x_1),p(x_2),\dots\),etc. separately.

Model \(p(x)\) all in one go.

Parameters : \(u \in \mathbb{R}^n\),\(\Sigma \in \mathbb{R}^{n\times n}\)(covariance matrix)

\[p(x;u,\Sigma)=\frac{1}{(2\pi)^{(n/2)}|\Sigma|^{\frac{1}{2}}}exp(-\frac{1}{2}(x-u)^T\Sigma^{-1}(x-u)) \]

Parameter fitting:

Given training set \(\{x^{(1)},x^{(2)},\dots,x^{(m)}\}\)

\[u=\frac{1}{m}\sum_{i=1}^mx^{(i)}\ \ \Sigma=\frac{1}{m}\sum_{i=1}^m(x^{(i)}-u)(x^{(i)}-u)^T \]

Multivariance Gaussian (Normal) examples

Anomaly detection using the multivariance Gaussian distribution

Relationship to origin model

Original model : \(p(x) = p(x_1,u_1,\sigma_1^2)p(x_2,u_2,\sigma_2^2)\dots p(x_n,u_n,\sigma_n^2)\)

Corresponds to multivariate Gaussian