计算机科学 ›› 2022, Vol. 49 ›› Issue (1): 146-152.doi: 10.11896/jsjkx.201000156

• 数据库&大数据&数据科学 • 上一篇    下一篇

基于邻域一致性的异常检测序列集成方法

刘意, 毛莺池, 程杨堃, 高建, 王龙宝   

  1. 河海大学计算机与信息学院 南京211100
    水利部水利大数据重点实验室 南京211100
  • 收稿日期:2020-10-27 修回日期:2020-12-08 出版日期:2022-01-15 发布日期:2022-01-18
  • 通讯作者: 毛莺池(yingchimao@hhu.edu.cn)
  • 作者简介:1175476508@qq.com
  • 基金资助:
    国家重点研发课题(2018YFC0407105);国家自然科学基金重点项目(61832005);华能集团重点研发课题(HNKJ17-21)

Locality and Consistency Based Sequential Ensemble Method for Outlier Detection

LIU Yi, MAO Ying-chi, CHENG Yang-kun, GAO Jian, WANG Long-bao   

  1. College of Computer and Information,Hohai University,Nanjing 211100,China
    Key Laboratory of Water Big Data Technology of Ministry of Water Resources,Nanjing 211100,China
  • Received:2020-10-27 Revised:2020-12-08 Online:2022-01-15 Published:2022-01-18
  • About author:LIU Yi,born in 1996,postgraduate.Her main research interests include distributed computing,IoT and edge intelligence computing.
    MAO Ying-chi,born in 1976,Ph.D,professor,is a senior member of China Computer Federation.Her main research interests include distributed computing and parallel processing,IoT,and edge intelligence computing.
  • Supported by:
    National Key Research Program of China(2018YFC0407105),Key Program of the National Natural Science Foundation of China(61832005) and Key Research Program of China Huaneng(HNKJ17-21).

摘要: 异常检测已广泛应用于多个应用领域,如网络入侵检测、信用卡欺诈检测等。数据维度的增加导致出现许多不相关和冗余的特征,这些特征会掩盖相关特征,出现假阳性结果。由于高维数据具有稀疏性和距离聚集效应,传统的基于密度、距离等的异常检测算法不再适用。大部分基于机器学习的异常检测研究都关注单一模型,而单一模型在抗过拟合能力上存在一定的不足。集成学习模型有着良好的泛化能力,而且在实际应用中展现出比单一模型更好的预测准确性。文中提出了基于邻域一致性的异常检测序列集成方法(Locality and Consistency Based Sequential Ensemble Method for Outlier Detection,LCSE)。首先基于多样性构造异常检测基本模型,其次根据全局集成一致性筛选出异常候选点,最后考虑数据局部邻域相关性选择并组合基本模型结果。通过实验验证,LCSE相比传统方法异常检测的准确率平均提升了20.7%,与集成算法LSCP_AOM和iForest相比,性能 (AUC)平均提升了3.6%,因此其性能优于其他集成方法和神经网络方法。

关键词: 高维数据, 集成多样性, 集成一致性, 领域相关性, 异常检测

Abstract: Outlier detection has been widely used in many fields,such as network intrusion detection,credit card fraud detection,etc.The increase in data dimensions leads to many irrelevant and redundant features,which will obscure the relevant features and result in false positive results.Due to the sparseness and distance aggregation effects of high-dimensional data,the traditional outlier detection algorithms based on density and distance are no longer applicable.Most of the outlier detection research based on machine learning focuses on a single model,which has certain deficiencies in anti-overfitting ability.The ensemble learning model has good generalization ability,and in actual application shows better prediction accuracy than the single model.This paper proposes an outlier detection sequence integration method LCSE based on neighborhood consistency (locality and consistency based sequential ensemble method for outlier detection).Firstly,it constructs a basic model of outlier detection based on diversity,secondly,selects the abnormal candidate points according to the global integration consistency,and finally considers the local neighborhood correlation of the data to select and combine the basic model results.Experiments verify that LCSE has an average outlier detection accuracy increase of 20.7% compared with traditional methods.Compared with the ensemble methods LSCP_AOM and iForest,the performance is increased by 3.6% on average.Therefore,it is better than other ensemble methods and neural network methods.

Key words: Ensemble consistency, Ensemble diversity, High-dimensional data, Neighborhood correlation, Outlier detection

中图分类号: 

  • TP391.4
[1]AGGARWALC C.Outlier analysis[C]//Data mining.Cham:Springer,2015:237-263.
[2]SCHUBERT E,WOJDANOWSKI R,ZIMEK A,et al.On eva-luation of outlier rankings and outlier scores[C]//Proceedings of the 2012 SIAM International Conference on Data Mining.Philadelphia:SIAM,2012:1047-1058.
[3]CAMPOS G O,ZIMEK A,MEIRA W.An unsupervised boosting strategy for outlier detection ensembles[C]//Pacific-Asia Conference on Knowledge Discovery and Data Mining.Cham:Springer,2018:564-576.
[4]ZHAO Y,NASRULLAH Z,HRYNIEWICKI M K,et al.LSCP:Locally selective combination in parallel outlier ensembles[C]//Proceedings of the 2019 SIAM International Confe-rence on Data Mining.Philadelphia:SIAM,2019:585-593.
[5]CHEN Y P,YU L,CHEN H.Traffic Anomaly Detection Based on Wavelet Neural Network and ARMA Model in Big Data Environment[J].Journal of Chongqing Institute of Technology(Natural Science),2019,33(10):149-154.
[6]CHEN J,SATHE S,AGGARWAL C,et al.Outlier detectionwith autoencoder ensembles[C]//Proceedings of the 2017 SIAM International Conference on Data Mining.Philadelphia:SIAM,2017:90-98.
[7]XING H J,HAO Z.Novelty Detection Method Based on Global and Local Discriminative Adversarial Autoencoder[J].Computer Science,2021,48(6):202-209.
[8]CHALAPATHY R,CHAWLA S.Deep learning for anomaly detection:A survey[J].arXiv:1901.03407,2019.
[9]LAZAREVIC A,KUMAR V.Feature bagging for outlier detection[C]//Proceedings of the eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining.New York:ACM,2005:157-166.
[10]RAYANA S,AKOGLU L.Less is more:building selectiveanomaly ensembles[J].ACM TKDD,2016,10(4):1-33.
[11]LIU F T,TING K M,ZHOU Z H.Isolation forest[C]//2008 Eighth IEEE International Conference on Data Mining.Pisca-taway:IEEE,2008:413-422.
[12]RAYANA S,ZHONG W,AKOGLU L.Sequential ensemblelearning for outlier detection:A bias-variance perspective[C]//2016 IEEE 16th International Conference on Data Mining (ICDM).Piscataway:IEEE,2016:1167-1172.
[13]GAO J,TAN P N.Converting output scores from outlier detection algorithms into probability estimates[C]//Sixth International Conference on Data Mining (ICDM'06).Piscataway:IEEE,2006:212-221.
[14]ZIMEK A,GAUDET M,CAMPELLO R J G B,et al.Subsampling for efficient and effective unsupervised outlier detection ensembles[C]//Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.New York:ACM,2013:428-436.
[15]ZIMEK A,CAMPELLO R J G B,SANDER J.Data perturbationfor outlier detection ensembles[C]//Proceedings of the 26th International Conference on Scientific and Statistical Database Management.New York:ACM,2014:1-12.
[16]PASILLAS-DÍAZ J R,RATTÉ S.Bagged subspaces for unsu-pervised outlier detection[J].Computational Intelligence,2017,33(3):507-523.
[17]NGUYEN H V,ANG H H,GOPALKRISHNAN V.Miningoutliers with ensemble of heterogeneous detectors on random subspaces[C]//International Conference on Database Systems for Advanced Applications.Berlin,Heidelberg:Springer,2010:368-383.
[18]CAMPOS G O,ZIMEK A,MEIRA W.An unsupervised boosting strategy for outlier detection ensembles[C]//Pacific-Asia Conference on Knowledge Discovery and Data Mining.Cham:Springer,2018:564-576.
[19]VAN STEIN B,VAN LEEUWEN M,BÄCK T.Local subspace-based outlier detection using global neighborhoods[C]//2016 IEEE International Conference on Big Data (Big Data).Piscata-way:IEEE,2016:1136-1142.
[20]BREUNIG M M,KRIEGEL H P,NG R T,et al.LOF:identi-fying density-based local outliers[C]//Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data.New York:ACM,2000:93-104.
[21]KRIEGEL H P,KRÖGER P,SCHUBERT E,et al.LoOP:local outlier probabilities[C]//Proceedings of the 18th ACMConfe-rence on Information and Knowledge Management.New York:ACM,2009:1649-1652.
[22]RAYANA S.ODDS Library[DB/OL].http://odds.cs.stonybrook.edu,2016/2020-03-15.
[23]CAMPOS G O,ZIMEK A,SANDER J,et al.On the evaluation of unsupervised outlier detection:measures,datasets,and an empirical study[J].Data Mining and Knowledge Discovery,2016,30(4):891-927.
[1] 徐天慧, 郭强, 张彩明.
基于全变分比分隔距离的时序数据异常检测
Time Series Data Anomaly Detection Based on Total Variation Ratio Separation Distance
计算机科学, 2022, 49(9): 101-110. https://doi.org/10.11896/jsjkx.210600174
[2] 李其烨, 邢红杰.
基于最大相关熵的KPCA异常检测方法
KPCA Based Novelty Detection Method Using Maximum Correntropy Criterion
计算机科学, 2022, 49(8): 267-272. https://doi.org/10.11896/jsjkx.210700175
[3] 王馨彤, 王璇, 孙知信.
基于多尺度记忆残差网络的网络流量异常检测模型
Network Traffic Anomaly Detection Method Based on Multi-scale Memory Residual Network
计算机科学, 2022, 49(8): 314-322. https://doi.org/10.11896/jsjkx.220200011
[4] 杜航原, 李铎, 王文剑.
一种面向电商网络的异常用户检测方法
Method for Abnormal Users Detection Oriented to E-commerce Network
计算机科学, 2022, 49(7): 170-178. https://doi.org/10.11896/jsjkx.210600092
[5] 武玉坤, 李伟, 倪敏雅, 许志骋.
单类支持向量机融合深度自编码器的异常检测模型
Anomaly Detection Model Based on One-class Support Vector Machine Fused Deep Auto-encoder
计算机科学, 2022, 49(3): 144-151. https://doi.org/10.11896/jsjkx.210100142
[6] 冷佳旭, 谭明圮, 胡波, 高新波.
基于隐式视角转换的视频异常检测
Video Anomaly Detection Based on Implicit View Transformation
计算机科学, 2022, 49(2): 142-148. https://doi.org/10.11896/jsjkx.210900266
[7] 张叶, 李志华, 王长杰.
基于核密度估计的轻量级物联网异常流量检测方法
Kernel Density Estimation-based Lightweight IoT Anomaly Traffic Detection Method
计算机科学, 2021, 48(9): 337-344. https://doi.org/10.11896/jsjkx.200600108
[8] 郭奕杉, 刘漫丹.
基于时空轨迹数据的异常检测
Anomaly Detection Based on Spatial-temporal Trajectory Data
计算机科学, 2021, 48(6A): 213-219. https://doi.org/10.11896/jsjkx.201100193
[9] 周钢, 郭福亮.
基于特征选择的高维数据集成学习方法研究
Research on Ensemble Learning Method Based on Feature Selection for High-dimensional Data
计算机科学, 2021, 48(6A): 250-254. https://doi.org/10.11896/jsjkx.200700102
[10] 邢红杰, 郝忠.
基于全局和局部判别对抗自编码器的异常检测方法
Novelty Detection Method Based on Global and Local Discriminative Adversarial Autoencoder
计算机科学, 2021, 48(6): 202-209. https://doi.org/10.11896/jsjkx.200400083
[11] 管文华, 林春雨, 杨尚蓉, 刘美琴, 赵耀.
基于人体关节点的低头异常行人检测
Detection of Head-bowing Abnormal Pedestrians Based on Human Joint Points
计算机科学, 2021, 48(5): 163-169. https://doi.org/10.11896/jsjkx.200800214
[12] 刘立成, 徐一凡, 谢贵才, 段磊.
面向NoSQL数据库的JSON文档异常检测与语义消歧模型
Outlier Detection and Semantic Disambiguation of JSON Document for NoSQL Database
计算机科学, 2021, 48(2): 93-99. https://doi.org/10.11896/jsjkx.200900039
[13] 邹承明, 陈德.
高维大数据分析的无监督异常检测方法
Unsupervised Anomaly Detection Method for High-dimensional Big Data Analysis
计算机科学, 2021, 48(2): 121-127. https://doi.org/10.11896/jsjkx.191100141
[14] 石琳姗, 马创, 杨云, 靳敏.
基于SSC-BP神经网络的异常检测算法
Anomaly Detection Algorithm Based on SSC-BP Neural Network
计算机科学, 2021, 48(12): 357-363. https://doi.org/10.11896/jsjkx.201000086
[15] 杨月麟, 毕宗泽.
基于深度学习的网络流量异常检测
Network Anomaly Detection Based on Deep Learning
计算机科学, 2021, 48(11A): 540-546. https://doi.org/10.11896/jsjkx.201200077
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!