Three ways to detect outliers
Z-score
import numpy as np def outliers_z_score(ys): threshold = 3 mean_y = np.mean(ys) stdev_y = np.std(ys) z_scores = [(y - mean_y) / stdev_y for y in ys] return np.where(np.abs(z_scores) > threshold)
Modified Z-score
import numpy as np def outliers_modified_z_score(ys): threshold = 3.5 median_y = np.median(ys) median_absolute_deviation_y = np.median([np.abs(y - median_y) for y in ys]) modified_z_scores = [0.6745 * (y - median_y) / median_absolute_deviation_y for y in ys] return np.where(np.abs(modified_z_scores) > threshold)
IQR(interquartile range)
import numpy as np def outliers_iqr(ys): quartile_1, quartile_3 = np.percentile(ys, [25, 75]) iqr = quartile_3 - quartile_1 lower_bound = quartile_1 - (iqr * 1.5) upper_bound = quartile_3 + (iqr * 1.5) return np.where((ys > upper_bound) | (ys < lower_bound))
Conclusion
It is important to reiterate that these methods should not be used mechanically. They should be used to explore the data – they let you know which points might be worth a closer look. What to do with this information depends heavily on the situation. Sometimes it is appropriate to exclude outliers from a dataset to make a model trained on that dataset more predictive. Sometimes, however, the presence of outliers is a warning sign that the real-world process generating the data is more complicated than expected. As an astute commenter on CrossValidated put it: “Sometimes outliers are bad data, and should be excluded, such as typos. Sometimes they are Wayne Gretzky or Michael Jordan, and should be kept.” Domain knowledge and practical wisdom are the only ways to tell the difference.
摘自:http://colingorrie.github.io/outlier-detection.html
作者:Standby — 一生热爱名山大川、草原沙漠,还有我们小郭宝贝!
出处:http://www.cnblogs.com/standby/
本文版权归作者和博客园共有,欢迎转载,但未经作者同意必须保留此段声明,且在文章页面明显位置给出原文连接,否则保留追究法律责任的权利。
出处:http://www.cnblogs.com/standby/
本文版权归作者和博客园共有,欢迎转载,但未经作者同意必须保留此段声明,且在文章页面明显位置给出原文连接,否则保留追究法律责任的权利。