计算机科学 ›› 2024, Vol. 51 ›› Issue (8): 20-33.doi: 10.11896/jsjkx.230600052
孔翎超, 刘国柱
KONG Lingchao, LIU Guozhu
摘要: 离群点检测作为数据挖掘领域的一个重要研究方向,其目的是发掘隐藏在数据集合中与众不同且具有潜在分析价值的数据,辅助研究人员甄别数据源可能存在的问题。目前,离群点检测已被广泛应用于欺诈识别、智慧医疗、入侵检测、故障诊断等诸多领域。文中在总结前人经验的基础上,首先讨论离群点的定义、产生原因以及典型应用领域,综述了DBSCAN和LOF等离群点检测经典算法及其改进算法的优势和局限,分析了深度学习方法在离群点检测领域的优势;其次结合当前互联网背景下海量、高维、时序数据处理需求,对离群点检测算法在新环境下的发展状况做进一步研究;最后介绍离群点检测算法的评价指标、代价因子在离群点检测评价中的作用以及常用工具包和数据集,总结展望了离群点检测面临的挑战和未来的发展方向。
中图分类号:
[1]HAWKINS D M.Identification of outliers[M].Vol.11.Lon-don:Chapman and Hall,1980. [2]WANG H,BAH M J,HAMMAD M.Progress in outlier detection techniques:A survey[J].IEEE Access 7(2019):107964-108000. [3]JIANG F,WANG K L,YU X,et al.Summary of Intrusion Detection Models Based on Deep Learning[J].Control and Decision,2020,35(5):1199-1204. [4]ZHANG W A,HONG Z,ZHU J W,et al.A survey of network intrusion detection methods for industrial control systems[J].Control and Decision,2019,34(11):2277-2288. [5]CHENG Z,CHAI S.A cyber intrusion detection method based on focal loss neural network[C]//2020 39th Chinese Control Conference(CCC).IEEE,2020. [6]ZHOU Y J,HE P F,QIU R F,et al.Research on Intrusion Detection Based on Random Forest and Gradient Boosting Tree[J].Journal of Software,2021,32(10):3254-3265. [7]LIU Y,YANG K.Credit Fraud Detection for Extremely Imba-lanced Data Based on Ensembled Deep Learning[J].Journal of Computer Research and Development,2021,58(3):539-547. [8]POURHABIBI T,ONG K L,KAM B H,et al.Fraud detection:A systematic literature review of graph-based anomaly detection approaches[J].Decision Support Systems,2020,133:113303. [9]AL-HASHEDI K G,MAGALINGAM P.Financial fraud detection applying data mining techniques:A comprehensive review from 2009 to 2019[J].Computer Science Review2021,40:100402. [10]FIORE U,AD S,PERLA F,et al.Using generative adversarial networks for improving classification effectiveness in credit card fraud detection[J].Information Sciences,2019,479:448-455. [11]FERNANDO T,GAMMULLE H,DENMAN S,et al.Deeplearning for medical anomaly detection-a survey[J].ACM Computing Surveys(CSUR),2021,54(7):1-37. [12]HAN C,RUNDO L,MURAO K,et al.MADGAN:Unsupervised medical anomaly detection GAN using multiple adjacent brain MRI slice reconstruction[J].BMC bioinformatics,2021,22(2):1-20. [13]SHVETSOVA N,BAKKER B,FEDULOVA I,et al.Anomaly detection in medical imaging with deep perceptual autoencoders[J].IEEE Access,2021,9:118571-118583. [14]POORNIMA I,PARAMASIVAN B.Anomaly detection in wireless sensor network using machine learning algorithm[J].Computer communications,2020,151:331-337. [15]FRANCESCO C,GIANCARLO F,ANTONIO G,et al.Short-long term anomaly detection in wireless sensor networks based on machine learning and multi-parameterized edit distance[J].Information Fusion,2019,52:13-30. [16]ZHOU J T,DU J,ZHU H,et al.Anomalynet:An anomaly detection network for video surveillance[J].IEEE Transactions on Information Forensics and Security,2019,14(10):2537-2550. [17]SULTANI W,CHEN C,SHAH M.Real-world anomaly detection in surveillance videos[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018. [18]CHANDOLA V,BANERJEE A,KUMAR V.Anomaly detec-tion:A survey[J].ACM computing surveys(CSUR),2009,41(3):1-58. [19]XU X,LIU J W,LUO X L.Research on outlier mining[J].Application Research of Computers,2009,26(1):34-40. [20]XUE A R,YAO L,JU S G,et al.Survey of Outlier Mining[J].Computer Science,2008(11):13-18,27. [21]MEI L,ZHANG F L,GAO Q.Overview of outlier detectiontechnology[J].Application Research of Computers,2020,37(12):3521-3527. [22]WU J F,JIN W D,TANG P.Survey on Monitoring Techniques for Data Abnormalities[J].Computer Science,2017,44(S2):24-28. [23]LEI H L,TUERHONG G,WUSHOUER M,et al.Review of Novelty Detection[J].Computer Engineering and Applications,2021,57(5):47-55. [24]JOHNSON T,KWOK I,NG R T.Fast Computation of 2-Dimensional Depth Contours[C]//KDD.1998:224-228. [25]KNOX E M,NG R T.Algorithms for mining distancebased outliers in large datasets[C]//Proceedings of the International Conference on Very Large Data Bases.1998:392-403. [26]RAMASWAMY S,RASTOGI R,SHIM K.Efficient algorithms for mining outliers from large data sets[C]//Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data.2000. [27]ESTER M,KRIEGEL H P,SANDER J,et al.A density-based algorithm for discovering clusters in large spatial databases with noise[C]//KDD.1996:226-231. [28]ERTÖZ L,STEINBACH M,KUMAR V.Finding topics in collections of documents:A shared nearest neighbor approach[J].Clustering and information retrieval.Springer,Boston,MA,2004:83-103. [29]GUHA S,RASTOGI R,SHIM K.ROCK:A robust clustering algorithm for categorical attributes[J].Information systems,2000,25(5):345-66. [30]MACQUEEN J.Some methods for classification and analysis of multivariate observations[C]//Proceedings of the fifth Berkeley Symposium on Mathematical Statistics and Probability.1967:281-297. [31]KOHONEN T.Self-organization and associative memory[M].Springer Science & Business Media.2012. [32]HE Z,XU X,DENG S.Discovering cluster-based local outliers[J].Pattern recognition letters,2003,24(9/10):1641-1650. [33]AMER M,GOLDSTEIN M.Nearest-neighbor and clusteringbased anomaly detection algorithms for rapidminer[C]//Proceedings of the 3rd RapidMiner Community Meeting and Conference(RCOMM 2012).2012:1-12. [34]MUHAMMAD M,DANIEL ANI U,ABDULLAHI A A,et al.Device-Type Profiling for Network Access Control Systems using Clustering-Based Multivariate Gaussian Outlier Score[C]//The 5th International Conference on Future Networks & Distributed Systems.2021. [35]ALHUSSEIN I,ALI A H.Application of DBSCAN to Anomaly Detection in Airport Terminals[C]//2020 3rd International Conference on Engineering Technology and its Applications(IICETA).IEEE,2020. [36]ANKERST M,BREUNIG M M,KRIEGEL H P,et al.OP-TICS:Ordering points to identify the clustering structure[J].ACM Sigmod Record,1999,28(2):49-60. [37]BREUNIG M M,KRIEGEL H P,NG R T,et al.LOF:identi-fying density-based local outliers[C]//Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data.2000. [38]XU X,LEI Y,ZHOU X.A lof-based method for abnormal segment detection in machinery condition monitoring[C]//2018 Prognostics and System Health Management Conference(PHM-Chongqing).IEEE,2018. [39]TANG J,CHEN Z,FU A W C,et al.Enhancing effectiveness of outlier detections for low density patterns[C]//Pacific-Asia Conference on Knowledge Discovery and Data Mining.Springer,Berlin,Heidelberg,2002. [40]JIN W,TUNG A K H,HAN J,et al.Ranking outliers using symmetric neighborhood relationship[C]//Pacific-Asia Confe-rence on Knowledge Discovery and Data Mining.Springer,Berlin,Heidelberg,2006. [41]KRIEGEL H P,KRÖGER P,SCHUBERT E,et al.LoOP:local outlier probabilities[C]//Proceedings of the 18th ACM Confe-rence on Information and Knowledge Management.2009. [42]PAPADIMITRIOU S,KITAGAWA H,GIBBONS P B,et al.Loci:Fast outlier detection using the local correlation integral[C]//Proceedings 19th International Conference on Data Engineering(Cat.No.03CH37405).IEEE,2003. [43]TANG B,HE H.A localdensity-based approach for outlier detection[J].Neurocomputing,2017,241:171-180. [44]KIRAN B R,THOMAS D M,PARAKKAL R.An overview of deep learning based methods for unsupervised and semi-supervised anomaly detection in videos[J].Journal of Imaging,2018,4(2):36. [45]CHEN Z,YEO C K,LEE B S,et al.Autoencoder-based network anomaly detection[C]//2018 Wireless Telecommunications Symposium(WTS).IEEE,2018. [46]WU Y K,LI W,NI M Y,et al.Anomaly Detection Model Based on One-class Support Vector Machine Fused Deep Auto-encoder[J].Computer Science,2022,49(3):144-151. [47]VINCENT P,LAROCHELLE H,LAJOIE I,et al.Stacked denoising autoencoders:Learning useful representations in a deep network with a local denoising criterion[J].Journal of Machine Learning Research,2010,11(12):3371-3408. [48]DOERSCH C.Tutorial on variational autoencoders[J].arXiv:1606.05908,2016. [49]ZHANG C H,ZHOU X T,ZHANG Y A,et al.Application Research of Deep Auto Encoder in Data Anomaly Detection[J].Computer Engineering and Applications,2020,56(17):93-99. [50]DI MATTIA F,GALEONE P,DE SIMONI M,et al.A survey on gans for anomaly detection[J].arXiv:1906.11632,2019. [51]SCHLEGL T,SEEBÖCK P,WALDSTEIN S M,et al.Unsupervised anomaly detection with generative adversarial networks to guide marker discovery[C]//International Conference on Information Processing in Medical Imaging.Cham:Springer,2017:145-157. [52]ZENATI H,FOO C S,LECOUAT B,et al.Efficient gan-based anomaly detection[J].arXiv:1802.06222,2018. [53]SCHLEGL T,SEEBÖCK P,WALDSTEIN S M,et al.f-AnoGAN:Fast unsupervised anomaly detection with generative adversarial networks[J].Medical Image Analysis,2019,54:30-44. [54]DONAHUE J,KRÄHENBÜHL P,DARRELL T.Adversarial feature learning[J].arXiv:1605.09782,2016. [55]AKCAY S,ATAPOUR-ABARGHOUEI A,BRECKON T P.Ganomaly:Semi-supervised anomaly detection via adversarial training[C]//Asian Conference on Computer Vision.Cham:Springer,2018. [56]ARJOVSKY M,CHINTALA S,BOTTOU L.Wasserstein generative adversarial networks[C]//International Conference on Machine Learning.PMLR,2017. [57]ZHU J Y,PARK T,ISOLA P,et al.Unpaired image-to-imagetranslation using cycle-consistent adversarial networks[C]//Proceedings of the IEEE International Conference on Computer Vision.2017. [58]ZAREMBA W,SUTSKEVER I,VINYALS O.Recurrent neural network regularization[J].arXiv:1409.2329,2014. [59]LIU F T,TING K M,ZHOU Z H.Isolation forest[C]//2008 Eighth IEEE International Conference on Data Mining.IEEE,2008:413-422. [60]LIU F T,TING K M,ZHOU Z H.On detecting clustered anomalies using sciforest[C]//Joint European Conference on Machine Learning and Knowledge Discovery in Databases.Sprin-ger,Berlin,Heidelberg,2010. [61]ZHONG Y Y,CHEN S C.High-order Multi-view Outlier Detection[J].Computer Science,2020,47(9):99-104. [62]AGGARWAL C C,YU P S.Outlier detection for high dimen-sional data[C]//Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data.2001. [63]KRIEGEL H P,SCHUBERT M,ZIMEK A.Angle-based outlier detection inhigh-dimensional data[C]//Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Disco-very and Data Mining.2008. [64]KRIEGEL H P,KRÖGER P,SCHUBERT E,et al.Outlier detection in axis-parallel subspaces of high dimensional data[C]//Pacific-asia Conference on Knowledge Discovery and Data Mi-ning.Springer,Berlin,Heidelberg,2009. [65]KELLER F,MULLER E,BOHM K.HiCS:High contrast subspaces for density-based outlierranking[C]//2012 IEEE 28th International Conference on Data Engineering.IEEE,2012. [66]CHEN S N,QIAN H Y,LI W.Hybrid outlier detection algo-rithm based on angle variance for high-dimensional data[J].Application Research of Computers,2016,33(11):3383-3386. [67]PHAM N.L1-depth revisited:A robust angle-based outlier factor in high-dimensional space[C]//Joint European Conference on Machine Learning and Knowledge Discovery in Databases.Cham:Springer,2018. [68]CHANDOLA V,MITHAL V,KUMAR V.Comparative evaluation of anomaly detection techniques for sequence data[C]//2008 Eighth IEEE International Conference on Data Mining.IEEE,2008. [69]HAWKINS J,AHMAD S.Why neurons have thousands of sy-napses,a theory of sequence memory in neocortex[J].Frontiers in neural circuits,2016,10:174222. [70]AHMAD S,LAVIN A,PURDY S,et al.Unsupervised real-timeanomaly detection for streaming data[J].Neurocomputing,2017,262:134-147. [71]XU J,WU H,WANG J,et al.Anomaly Transformer:Time Series Anomaly Detection with Association Discrepancy[J].arXiv:2110.02642,2021. [72]DEAN J,GHEMAWAT S.MapReduce:Simplified data proces-sing on large clusters[J].Communications of ACM,2008,51(1):107-113. [73]ZAHARIA M,CHOWDHURY M,DAS T,et al.Resilient Distributed Datasets:A {Fault-Tolerant} Abstraction for {In-Memory} Cluster Computing[C]//9th USENIX Symposium on Networked Systems Design and Implementation(NSDI 12).2012. [74]KANNA P R,SANTHI P.Hybrid intrusion detection using mapreduce based black widow optimized convolutional long short-term memory neural networks[J].Expert Systems with Applications,2022,194:116545. [75]FATHNIA F,BARAZESH M R,BAYAZ M H J D.RuntimeOptimization of a New Anomaly Detection Method for Smart Metering Data Using Hadoop Map-Reduce[C]//2019 International Power System Conference(PSC).IEEE,2019. [76]ALNAFESSAH A,CASALE G.Artificial neural networksbased techniques for anomaly detection in Apache Spark[J].Cluster Computing,2020,23(2):1345-1360. [77]POURHABIBI T,ONG K L,KAM B H,et al.Fraud detection:A systematic literature review of graph-based anomaly detection approaches[J].Decision Support Systems,2020,133:113303. [78]MA X,WU J,XUE S,et al.A comprehensive survey on graphanomaly detection with deep learning[J].IEEE Transactions on Knowledge and Data Engineering,2021,35(12):12012-12038. [79]CHEN B F,LI J D,LU X J,et al.Survey of Deep Learning Based Graph Anomaly Detection Methods[J].Journal of Computer Research and Development,2021,58(7):1436-1455. [80]MOONESINGHE H D K,TAN P N.Outrank:a graph-based outlier detection framework using random walk[J].Interna-tional Journal on Artificial Intelligence Tools,2008,17(1):19-36. [81]BANDYOPADHYAY S,VIVEK S V,MURTY M N.Outlierresistant unsupervised deep architectures for attributed network embedding[C]//Proceedings of the 13th International Confe-rence on Web Search and Data Mining.2020. [82]SU J,DONG Y H,YAN M J,et al.Research progress of anomaly detectionfor complex networks[J].Control and Decision,2021,36(6):1293-1310. [83]MOJARAD M,NEJATIAN S,PARVIN H,et al.A fuzzy clustering ensemble based on cluster clustering and iterative Fusion of base clusters[J].Applied Intelligence,2019,49:2567-2581. [84]GUO Y L,ZUO X J,CUI J Y.An abnormal behavior detection algorithm based on fuzzy clusteringfor multi-categories affiliation of power entities[J].Journal of Hebei University of Science and Technology,2022,43(5):528-537. [85]CHEN Z,SHENG V,EDWARDS A,et al.An effective cost-sensitive sparse online learning framework for imbalanced streaming data classification and its application to online anomaly detection[J].Knowledge and Information Systems,2023,65(1):59-87. [86]CHEN X,LIU H,XU X,et al.Identification of Suitable Technologies for Drinking Water Quality Prediction:A Comparative Study of Traditional,Ensemble,Cost-Sensitive,Outlier Detection Learning Models and Sampling Algorithms[J].ACS ES&T Water,2021,1(8):1676-1685. [87]BISONG E.Introduction to Scikit-learn[C]//Building machine learning and deep learning models on Google cloud platform.Apress,Berkeley,CA,2019:215-229. [88]ZHAO Y,NASRULLAH Z,LI Z.Pyod:A python toolbox for scalable outlier detection[J].arXiv:1901.01588,2019. [89]SCHUBERT E,ZIMEK A.ELKI:A large open-source libraryfor data analysis-ELKI Release 0.7.5 “Heidelberg”[J].arXiv:1902.03616,2019. [90]FU L F,CHEN Z,AO C L.Dynamic outlier detection algorithm for network large data set based on classification and regression trees decision tree[J].Journal of Jilin University(Engineering and Technology Edition),2023,53(9):2620-2625. [91]HUANG J R,WANG Q,CAI X J,et al.Multi-objective Adaptive DBSCAN Outlier Detection Algorithm[J].Journal of Chinese Computer Systems,2022,43(4):702-706. |
|