Time Series Anomaly Detection
这里有个2015年的综述文章,概括的比较好,各种技术的适用场景. https://iwringer.wordpress.com/2015/11/17/anomaly-detection-concepts-and-techniques/
其中 Clustering 技术可以使用 K-Means, Gaussian Mixture Model. GMM 模型可以参考这个很棒的文章 https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/05.12-Gaussian-Mixtures.ipynb#scrollTo=2l9rOarpNSi0
还有一个比较新的 2019 年的 DEEP LEARNING FOR ANOMALY DETECTION: A SURVEY https://arxiv.org/pdf/1901.03407.pdf. 对所有领域的异常检测做了综述.
异常种类:
Point Anomalies, 单点异常,就是一个点和其他点不同,比如突然有一笔大额花费.
Contextual Anomalies, 上下文异常,考虑特定上下文的异常,比如在半夜休息时间突然有很大的访问量
Collective Anomalies, 组异常,是一组,一小撮数据的异常,单看每个点都正常,但是一组数据就不正常,比如
- Events in unexpected order ( ordered. e.g. breaking rhythm in ECG)
- Unexpected value combinations ( unordered. e.g. buying a large number of expensive items)
unsupervised:
- Isolation Forest Algorithm
- 这里看来感觉比 K-means要好?https://www.kaggle.com/rgaddati/unsupervised-fraud-detection-isolation-forest
- 本质上也是基于统计的,不考虑时间序列. 通过看[4] 感觉 IF 比 AutoEncoder 效果还好点. [7] 的测试结果也表明这个IF很强悍.
- 比如中小数据集低维度的情况下可以选择KNN,大数据集高维度时可以选择Isolation Forest. 参考[5]
- IF 的升级版 EIF https://towardsdatascience.com/outlier-detection-with-extended-isolation-forest-1e248a3fe97b
- https://towardsdatascience.com/anomaly-detection-with-isolation-forest-visualization-23cd75c281e2
- Local Outlier Factor(LOF) Algorithm
- Clustering: K-means
- Clustering:GMM,与时序无关,只是基于统计的, 比K-mean 高级点
- Boxplot, 这个很简单,就是类似画出boxplot,一定比例范围外的就算作异常
- AutoEncoder, 这个训练要只使用正常数据,所以需要你知道哪些是正常数据,不是纯粹的 unsupervised learning
- 总觉得用graph会更好,准备研究一下
- Graphs Analytics for Fraud Detection (这里提到NRL Network Representation Learning 是比较新的技术,对稀疏的graph进行 embeding 压缩, 这个文章 TitAnt: Online Real-time Transaction Fraud Detection in Ant Financial 提到蚂蚁金服也是用的这个算法)
- http://snap.stanford.edu/proj/embeddings-www/
- https://www.mdpi.com/2076-3417/9/19/4018/htm#B13-applsci-09-04018 这个跟我们场景应该很相似
- http://snap.stanford.edu/class/cs224w-2015/projects_2015/Anomaly_Detection_in_Graphs.pdf
- http://web.stanford.edu/class/cs224w/project/26424135.pdf
- https://www.andrew.cmu.edu/user/lakoglu/pubs/14-dami-graphanomalysurvey.pdf
- Anomaly Detection in Graphs and Time Series: Algorithms and Applications
- MIDAS https://towardsdatascience.com/anomaly-detection-in-dynamic-graphs-using-midas-e4f8d0b1db45 这个感觉和我们场景相似,这个算法是基于CMS算法 的
- Graphs Analytics for Fraud Detection (这里提到NRL Network Representation Learning 是比较新的技术,对稀疏的graph进行 embeding 压缩, 这个文章 TitAnt: Online Real-time Transaction Fraud Detection in Ant Financial 提到蚂蚁金服也是用的这个算法)
- big data 方面的AD
- https://medium.com/rahasak/anomaly-detection-with-isolation-forest-spark-scala-8d8b5f36c47c
- ref:
- https://www.kaggle.com/pavansanagapati/anomaly-detection-credit-card-fraud-analysis
- https://www.experoinc.com/post/fraud-detection-using-deep-learning-on-graph-embeddings-and-topology-metrics
- https://www.knime.com/blog/four-techniques-for-outlier-detection 这里提到了四种异常检测算法及对比(Numeric Outlier, Z-Score, DBSCAN, Isolation Forest)
- https://www.infoq.com/articles/fraud-detection-random-forest/ (提到用 Random Forest, AutoEncoder, Isolation Forest)
- 数据挖掘中常见的「异常检测」算法有哪些?
- Anomaly Detection Techniques in Python
- A comparative evaluation of outlier detection algorithms: experiments and analyses
- yzhao062/anomaly-detection-resources CMU一个大神的github
-
system log anomaly detection:
-
https://www.researchgate.net/publication/220925081_Anomaly_Detection_in_Computer_Security_and_an_Application_to_File_System_Accesses
- Insider Threat Detection Based on User Behavior Modeling and Anomaly Detection Algorithms, 2019
- Data-Driven Model-Based Detection of Malicious Insiders via Physical Access Logs, 2019
- Detecting insider information theft using features from file access logs, 2014, 这个就是我的场景
- Detection of Anomalous Insiders in Collaborative Environments via Relational Analysis of Access Logs, 2011 这个感觉更接近我的场景
- A Review of Insider Threat Detection: Classification, Machine Learning Techniques, Datasets, Open Challenges, and Recommendations, 2020 的一个review 文章
-
Data Stream Clustering for Real-time Anomaly Detection: An Application to Insider Threats, 2018
- AN ABNORMAL FILE ACCESS BEHAVIOR DETECTION APPROACH BASED ON FILE PATH DIVERSITY, 2014, 国防科大的,提出了FPD算法,同时也提到了PAD算法,这个PAD我还没看
- Ghostbuster: A Fine-grained Approach for Anomaly Detection in File System Accesses,2017, file block level 的,需要kernel 支持,不适合我的场景
RNN 的应用
https://github.com/chickenbestlover/RNN-Time-series-Anomaly-Detection
https://towardsdatascience.com/time-series-of-price-anomaly-detection-13586cd5ff46 聚类的一些常用方法