计算机科学 ›› 2024, Vol. 51 ›› Issue (8): 20-33.doi: 10.11896/jsjkx.230600052

• 数据库&大数据&数据科学 • 上一篇    下一篇

离群点检测算法综述

孔翎超, 刘国柱   

  1. 青岛科技大学信息科学技术学院 山东 青岛 266061
  • 收稿日期:2023-06-06 修回日期:2023-12-13 出版日期:2024-08-15 发布日期:2024-08-13
  • 通讯作者: 刘国柱(lgz_0228@163.com)
  • 作者简介:(kk1392567492@163.com)
  • 基金资助:
    国家自然科学基金(61973180)

Review of Outlier Detection Algorithms

KONG Lingchao, LIU Guozhu   

  1. School of Information Science and Technology,Qingdao,Shandong 266061,China
  • Received:2023-06-06 Revised:2023-12-13 Online:2024-08-15 Published:2024-08-13
  • About author:KONG Lingchao,born in 1998,postgraduate,is a student member of CCF(No.G8696G).His main research interests include data mining and fault detection.
    LIU Guozhu,born in 1965,Ph.D,professor,master supervisor.His main research interests include network security and fault detection.
  • Supported by:
    National Natural Science Foundation of China(61973180).

摘要: 离群点检测作为数据挖掘领域的一个重要研究方向,其目的是发掘隐藏在数据集合中与众不同且具有潜在分析价值的数据,辅助研究人员甄别数据源可能存在的问题。目前,离群点检测已被广泛应用于欺诈识别、智慧医疗、入侵检测、故障诊断等诸多领域。文中在总结前人经验的基础上,首先讨论离群点的定义、产生原因以及典型应用领域,综述了DBSCAN和LOF等离群点检测经典算法及其改进算法的优势和局限,分析了深度学习方法在离群点检测领域的优势;其次结合当前互联网背景下海量、高维、时序数据处理需求,对离群点检测算法在新环境下的发展状况做进一步研究;最后介绍离群点检测算法的评价指标、代价因子在离群点检测评价中的作用以及常用工具包和数据集,总结展望了离群点检测面临的挑战和未来的发展方向。

关键词: 离群点, 异常检测, 深度学习, 时序数据, 数据挖掘

Abstract: Outlier detection,as an important research direction in the field of data mining,aims to discover data points in a dataset that are different from the majority and have potential analytical value,assistresearchers in identifying potential issues in the data source.Currently,outlier detection has been widely applied in various domains such as fraud detection,smart healthcare,intrusion detection,and fault diagnosis.This study,based on summarizing previous experiences,first discusses the definition of outliers,their causes,and typical application domains.It reviews the advantages and limitations of classical outlier detection algorithms such as DBSCAN and LOF,as well as their improved algorithms.Additionally,it analyzes the advantages of deep learning me-thods in the field of outlier detection.Secondly,considering the requirements for processing massive,high-dimensional,and temporal data in the current internet context,further research is conducted on the development status of outlier detection algorithms in new environments.Finally,the evaluation indicators of outlier detection algorithms,the role of cost factors in outlier detection evaluation,as well as commonly used toolkits and datasets,are introduced.The challenges and future development directions of outlier detection are summarized and prospected.

Key words: Outliers, Anomaly detection, Deep learning, Time-series data, Data mining

中图分类号: 

  • TP301
[1]HAWKINS D M.Identification of outliers[M].Vol.11.Lon-don:Chapman and Hall,1980.
[2]WANG H,BAH M J,HAMMAD M.Progress in outlier detection techniques:A survey[J].IEEE Access 7(2019):107964-108000.
[3]JIANG F,WANG K L,YU X,et al.Summary of Intrusion Detection Models Based on Deep Learning[J].Control and Decision,2020,35(5):1199-1204.
[4]ZHANG W A,HONG Z,ZHU J W,et al.A survey of network intrusion detection methods for industrial control systems[J].Control and Decision,2019,34(11):2277-2288.
[5]CHENG Z,CHAI S.A cyber intrusion detection method based on focal loss neural network[C]//2020 39th Chinese Control Conference(CCC).IEEE,2020.
[6]ZHOU Y J,HE P F,QIU R F,et al.Research on Intrusion Detection Based on Random Forest and Gradient Boosting Tree[J].Journal of Software,2021,32(10):3254-3265.
[7]LIU Y,YANG K.Credit Fraud Detection for Extremely Imba-lanced Data Based on Ensembled Deep Learning[J].Journal of Computer Research and Development,2021,58(3):539-547.
[8]POURHABIBI T,ONG K L,KAM B H,et al.Fraud detection:A systematic literature review of graph-based anomaly detection approaches[J].Decision Support Systems,2020,133:113303.
[9]AL-HASHEDI K G,MAGALINGAM P.Financial fraud detection applying data mining techniques:A comprehensive review from 2009 to 2019[J].Computer Science Review2021,40:100402.
[10]FIORE U,AD S,PERLA F,et al.Using generative adversarial networks for improving classification effectiveness in credit card fraud detection[J].Information Sciences,2019,479:448-455.
[11]FERNANDO T,GAMMULLE H,DENMAN S,et al.Deeplearning for medical anomaly detection-a survey[J].ACM Computing Surveys(CSUR),2021,54(7):1-37.
[12]HAN C,RUNDO L,MURAO K,et al.MADGAN:Unsupervised medical anomaly detection GAN using multiple adjacent brain MRI slice reconstruction[J].BMC bioinformatics,2021,22(2):1-20.
[13]SHVETSOVA N,BAKKER B,FEDULOVA I,et al.Anomaly detection in medical imaging with deep perceptual autoencoders[J].IEEE Access,2021,9:118571-118583.
[14]POORNIMA I,PARAMASIVAN B.Anomaly detection in wireless sensor network using machine learning algorithm[J].Computer communications,2020,151:331-337.
[15]FRANCESCO C,GIANCARLO F,ANTONIO G,et al.Short-long term anomaly detection in wireless sensor networks based on machine learning and multi-parameterized edit distance[J].Information Fusion,2019,52:13-30.
[16]ZHOU J T,DU J,ZHU H,et al.Anomalynet:An anomaly detection network for video surveillance[J].IEEE Transactions on Information Forensics and Security,2019,14(10):2537-2550.
[17]SULTANI W,CHEN C,SHAH M.Real-world anomaly detection in surveillance videos[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018.
[18]CHANDOLA V,BANERJEE A,KUMAR V.Anomaly detec-tion:A survey[J].ACM computing surveys(CSUR),2009,41(3):1-58.
[19]XU X,LIU J W,LUO X L.Research on outlier mining[J].Application Research of Computers,2009,26(1):34-40.
[20]XUE A R,YAO L,JU S G,et al.Survey of Outlier Mining[J].Computer Science,2008(11):13-18,27.
[21]MEI L,ZHANG F L,GAO Q.Overview of outlier detectiontechnology[J].Application Research of Computers,2020,37(12):3521-3527.
[22]WU J F,JIN W D,TANG P.Survey on Monitoring Techniques for Data Abnormalities[J].Computer Science,2017,44(S2):24-28.
[23]LEI H L,TUERHONG G,WUSHOUER M,et al.Review of Novelty Detection[J].Computer Engineering and Applications,2021,57(5):47-55.
[24]JOHNSON T,KWOK I,NG R T.Fast Computation of 2-Dimensional Depth Contours[C]//KDD.1998:224-228.
[25]KNOX E M,NG R T.Algorithms for mining distancebased outliers in large datasets[C]//Proceedings of the International Conference on Very Large Data Bases.1998:392-403.
[26]RAMASWAMY S,RASTOGI R,SHIM K.Efficient algorithms for mining outliers from large data sets[C]//Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data.2000.
[27]ESTER M,KRIEGEL H P,SANDER J,et al.A density-based algorithm for discovering clusters in large spatial databases with noise[C]//KDD.1996:226-231.
[28]ERTÖZ L,STEINBACH M,KUMAR V.Finding topics in collections of documents:A shared nearest neighbor approach[J].Clustering and information retrieval.Springer,Boston,MA,2004:83-103.
[29]GUHA S,RASTOGI R,SHIM K.ROCK:A robust clustering algorithm for categorical attributes[J].Information systems,2000,25(5):345-66.
[30]MACQUEEN J.Some methods for classification and analysis of multivariate observations[C]//Proceedings of the fifth Berkeley Symposium on Mathematical Statistics and Probability.1967:281-297.
[31]KOHONEN T.Self-organization and associative memory[M].Springer Science & Business Media.2012.
[32]HE Z,XU X,DENG S.Discovering cluster-based local outliers[J].Pattern recognition letters,2003,24(9/10):1641-1650.
[33]AMER M,GOLDSTEIN M.Nearest-neighbor and clusteringbased anomaly detection algorithms for rapidminer[C]//Proceedings of the 3rd RapidMiner Community Meeting and Conference(RCOMM 2012).2012:1-12.
[34]MUHAMMAD M,DANIEL ANI U,ABDULLAHI A A,et al.Device-Type Profiling for Network Access Control Systems using Clustering-Based Multivariate Gaussian Outlier Score[C]//The 5th International Conference on Future Networks & Distributed Systems.2021.
[35]ALHUSSEIN I,ALI A H.Application of DBSCAN to Anomaly Detection in Airport Terminals[C]//2020 3rd International Conference on Engineering Technology and its Applications(IICETA).IEEE,2020.
[36]ANKERST M,BREUNIG M M,KRIEGEL H P,et al.OP-TICS:Ordering points to identify the clustering structure[J].ACM Sigmod Record,1999,28(2):49-60.
[37]BREUNIG M M,KRIEGEL H P,NG R T,et al.LOF:identi-fying density-based local outliers[C]//Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data.2000.
[38]XU X,LEI Y,ZHOU X.A lof-based method for abnormal segment detection in machinery condition monitoring[C]//2018 Prognostics and System Health Management Conference(PHM-Chongqing).IEEE,2018.
[39]TANG J,CHEN Z,FU A W C,et al.Enhancing effectiveness of outlier detections for low density patterns[C]//Pacific-Asia Conference on Knowledge Discovery and Data Mining.Springer,Berlin,Heidelberg,2002.
[40]JIN W,TUNG A K H,HAN J,et al.Ranking outliers using symmetric neighborhood relationship[C]//Pacific-Asia Confe-rence on Knowledge Discovery and Data Mining.Springer,Berlin,Heidelberg,2006.
[41]KRIEGEL H P,KRÖGER P,SCHUBERT E,et al.LoOP:local outlier probabilities[C]//Proceedings of the 18th ACM Confe-rence on Information and Knowledge Management.2009.
[42]PAPADIMITRIOU S,KITAGAWA H,GIBBONS P B,et al.Loci:Fast outlier detection using the local correlation integral[C]//Proceedings 19th International Conference on Data Engineering(Cat.No.03CH37405).IEEE,2003.
[43]TANG B,HE H.A localdensity-based approach for outlier detection[J].Neurocomputing,2017,241:171-180.
[44]KIRAN B R,THOMAS D M,PARAKKAL R.An overview of deep learning based methods for unsupervised and semi-supervised anomaly detection in videos[J].Journal of Imaging,2018,4(2):36.
[45]CHEN Z,YEO C K,LEE B S,et al.Autoencoder-based network anomaly detection[C]//2018 Wireless Telecommunications Symposium(WTS).IEEE,2018.
[46]WU Y K,LI W,NI M Y,et al.Anomaly Detection Model Based on One-class Support Vector Machine Fused Deep Auto-encoder[J].Computer Science,2022,49(3):144-151.
[47]VINCENT P,LAROCHELLE H,LAJOIE I,et al.Stacked denoising autoencoders:Learning useful representations in a deep network with a local denoising criterion[J].Journal of Machine Learning Research,2010,11(12):3371-3408.
[48]DOERSCH C.Tutorial on variational autoencoders[J].arXiv:1606.05908,2016.
[49]ZHANG C H,ZHOU X T,ZHANG Y A,et al.Application Research of Deep Auto Encoder in Data Anomaly Detection[J].Computer Engineering and Applications,2020,56(17):93-99.
[50]DI MATTIA F,GALEONE P,DE SIMONI M,et al.A survey on gans for anomaly detection[J].arXiv:1906.11632,2019.
[51]SCHLEGL T,SEEBÖCK P,WALDSTEIN S M,et al.Unsupervised anomaly detection with generative adversarial networks to guide marker discovery[C]//International Conference on Information Processing in Medical Imaging.Cham:Springer,2017:145-157.
[52]ZENATI H,FOO C S,LECOUAT B,et al.Efficient gan-based anomaly detection[J].arXiv:1802.06222,2018.
[53]SCHLEGL T,SEEBÖCK P,WALDSTEIN S M,et al.f-AnoGAN:Fast unsupervised anomaly detection with generative adversarial networks[J].Medical Image Analysis,2019,54:30-44.
[54]DONAHUE J,KRÄHENBÜHL P,DARRELL T.Adversarial feature learning[J].arXiv:1605.09782,2016.
[55]AKCAY S,ATAPOUR-ABARGHOUEI A,BRECKON T P.Ganomaly:Semi-supervised anomaly detection via adversarial training[C]//Asian Conference on Computer Vision.Cham:Springer,2018.
[56]ARJOVSKY M,CHINTALA S,BOTTOU L.Wasserstein generative adversarial networks[C]//International Conference on Machine Learning.PMLR,2017.
[57]ZHU J Y,PARK T,ISOLA P,et al.Unpaired image-to-imagetranslation using cycle-consistent adversarial networks[C]//Proceedings of the IEEE International Conference on Computer Vision.2017.
[58]ZAREMBA W,SUTSKEVER I,VINYALS O.Recurrent neural network regularization[J].arXiv:1409.2329,2014.
[59]LIU F T,TING K M,ZHOU Z H.Isolation forest[C]//2008 Eighth IEEE International Conference on Data Mining.IEEE,2008:413-422.
[60]LIU F T,TING K M,ZHOU Z H.On detecting clustered anomalies using sciforest[C]//Joint European Conference on Machine Learning and Knowledge Discovery in Databases.Sprin-ger,Berlin,Heidelberg,2010.
[61]ZHONG Y Y,CHEN S C.High-order Multi-view Outlier Detection[J].Computer Science,2020,47(9):99-104.
[62]AGGARWAL C C,YU P S.Outlier detection for high dimen-sional data[C]//Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data.2001.
[63]KRIEGEL H P,SCHUBERT M,ZIMEK A.Angle-based outlier detection inhigh-dimensional data[C]//Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Disco-very and Data Mining.2008.
[64]KRIEGEL H P,KRÖGER P,SCHUBERT E,et al.Outlier detection in axis-parallel subspaces of high dimensional data[C]//Pacific-asia Conference on Knowledge Discovery and Data Mi-ning.Springer,Berlin,Heidelberg,2009.
[65]KELLER F,MULLER E,BOHM K.HiCS:High contrast subspaces for density-based outlierranking[C]//2012 IEEE 28th International Conference on Data Engineering.IEEE,2012.
[66]CHEN S N,QIAN H Y,LI W.Hybrid outlier detection algo-rithm based on angle variance for high-dimensional data[J].Application Research of Computers,2016,33(11):3383-3386.
[67]PHAM N.L1-depth revisited:A robust angle-based outlier factor in high-dimensional space[C]//Joint European Conference on Machine Learning and Knowledge Discovery in Databases.Cham:Springer,2018.
[68]CHANDOLA V,MITHAL V,KUMAR V.Comparative evaluation of anomaly detection techniques for sequence data[C]//2008 Eighth IEEE International Conference on Data Mining.IEEE,2008.
[69]HAWKINS J,AHMAD S.Why neurons have thousands of sy-napses,a theory of sequence memory in neocortex[J].Frontiers in neural circuits,2016,10:174222.
[70]AHMAD S,LAVIN A,PURDY S,et al.Unsupervised real-timeanomaly detection for streaming data[J].Neurocomputing,2017,262:134-147.
[71]XU J,WU H,WANG J,et al.Anomaly Transformer:Time Series Anomaly Detection with Association Discrepancy[J].arXiv:2110.02642,2021.
[72]DEAN J,GHEMAWAT S.MapReduce:Simplified data proces-sing on large clusters[J].Communications of ACM,2008,51(1):107-113.
[73]ZAHARIA M,CHOWDHURY M,DAS T,et al.Resilient Distributed Datasets:A {Fault-Tolerant} Abstraction for {In-Memory} Cluster Computing[C]//9th USENIX Symposium on Networked Systems Design and Implementation(NSDI 12).2012.
[74]KANNA P R,SANTHI P.Hybrid intrusion detection using mapreduce based black widow optimized convolutional long short-term memory neural networks[J].Expert Systems with Applications,2022,194:116545.
[75]FATHNIA F,BARAZESH M R,BAYAZ M H J D.RuntimeOptimization of a New Anomaly Detection Method for Smart Metering Data Using Hadoop Map-Reduce[C]//2019 International Power System Conference(PSC).IEEE,2019.
[76]ALNAFESSAH A,CASALE G.Artificial neural networksbased techniques for anomaly detection in Apache Spark[J].Cluster Computing,2020,23(2):1345-1360.
[77]POURHABIBI T,ONG K L,KAM B H,et al.Fraud detection:A systematic literature review of graph-based anomaly detection approaches[J].Decision Support Systems,2020,133:113303.
[78]MA X,WU J,XUE S,et al.A comprehensive survey on graphanomaly detection with deep learning[J].IEEE Transactions on Knowledge and Data Engineering,2021,35(12):12012-12038.
[79]CHEN B F,LI J D,LU X J,et al.Survey of Deep Learning Based Graph Anomaly Detection Methods[J].Journal of Computer Research and Development,2021,58(7):1436-1455.
[80]MOONESINGHE H D K,TAN P N.Outrank:a graph-based outlier detection framework using random walk[J].Interna-tional Journal on Artificial Intelligence Tools,2008,17(1):19-36.
[81]BANDYOPADHYAY S,VIVEK S V,MURTY M N.Outlierresistant unsupervised deep architectures for attributed network embedding[C]//Proceedings of the 13th International Confe-rence on Web Search and Data Mining.2020.
[82]SU J,DONG Y H,YAN M J,et al.Research progress of anomaly detectionfor complex networks[J].Control and Decision,2021,36(6):1293-1310.
[83]MOJARAD M,NEJATIAN S,PARVIN H,et al.A fuzzy clustering ensemble based on cluster clustering and iterative Fusion of base clusters[J].Applied Intelligence,2019,49:2567-2581.
[84]GUO Y L,ZUO X J,CUI J Y.An abnormal behavior detection algorithm based on fuzzy clusteringfor multi-categories affiliation of power entities[J].Journal of Hebei University of Science and Technology,2022,43(5):528-537.
[85]CHEN Z,SHENG V,EDWARDS A,et al.An effective cost-sensitive sparse online learning framework for imbalanced streaming data classification and its application to online anomaly detection[J].Knowledge and Information Systems,2023,65(1):59-87.
[86]CHEN X,LIU H,XU X,et al.Identification of Suitable Technologies for Drinking Water Quality Prediction:A Comparative Study of Traditional,Ensemble,Cost-Sensitive,Outlier Detection Learning Models and Sampling Algorithms[J].ACS ES&T Water,2021,1(8):1676-1685.
[87]BISONG E.Introduction to Scikit-learn[C]//Building machine learning and deep learning models on Google cloud platform.Apress,Berkeley,CA,2019:215-229.
[88]ZHAO Y,NASRULLAH Z,LI Z.Pyod:A python toolbox for scalable outlier detection[J].arXiv:1901.01588,2019.
[89]SCHUBERT E,ZIMEK A.ELKI:A large open-source libraryfor data analysis-ELKI Release 0.7.5 “Heidelberg”[J].arXiv:1902.03616,2019.
[90]FU L F,CHEN Z,AO C L.Dynamic outlier detection algorithm for network large data set based on classification and regression trees decision tree[J].Journal of Jilin University(Engineering and Technology Edition),2023,53(9):2620-2625.
[91]HUANG J R,WANG Q,CAI X J,et al.Multi-objective Adaptive DBSCAN Outlier Detection Algorithm[J].Journal of Chinese Computer Systems,2022,43(4):702-706.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!