代码改变世界

Outlier Detection

2013-08-23 11:27  Loull  阅读(827)  评论(0编辑  收藏  举报

1)正态分布数据,飘出95%的可能是异常值.变量var正态标准化,|var|<=1.96的可能是异常值,further chk needed!large sample better.
对于偏态分布的数据(histogram chk),这个方法貌似不是很好.

 

2)Boxplot Method
稳健,无正态分布假设.
箱线图判断异常值的标准以四分位数和四分位距为基础.
四分位距(QR, Quartile range):上四分位数与下四分位数之间的间距,即上四分位数减去下四分位数.
F代表中位数,QR代表四分位距.
在Q3+1.5QR(四分位距)和Q1-1.5QR处画两条与中位线一样的线段,这两条线段为异常值截断点,称其为内限.
在F(中位数)+3QR和F-3QR处画两条线段,称其为外限.
内限外限之间为弱异常值(Mild Outliers),外限之外为强异常值(Extreme Outliers)

http://blog.sina.com.cn/s/blog_7dc56e6e0100qzra.html

 

3)格拉布斯(Grubbs)检验法和狄克逊(Dixon)检验法

Grubbs' test for outliers
normality assumption
sample size greater than 6
the maximum normed residual test

http://en.wikipedia.org/wiki/Grubbs'_test_for_outliers

 

Dixon's Q test
once in a data set
arrange the data in order of increasing values and calculate Q as defined: Q=gap/raneg, Where gap is the absolute difference between the outlier in question and the closest number to it. if calculated Q > table Q then reject the questionable point.
http://en.wikipedia.org/wiki/Dixon's_Q_test