Computer Science ›› 2022, Vol. 49 ›› Issue (1): 146-152.doi: 10.11896/jsjkx.201000156

• Database & Big Data & Data Science • Previous Articles     Next Articles

Locality and Consistency Based Sequential Ensemble Method for Outlier Detection

LIU Yi, MAO Ying-chi, CHENG Yang-kun, GAO Jian, WANG Long-bao   

  1. College of Computer and Information,Hohai University,Nanjing 211100,China
    Key Laboratory of Water Big Data Technology of Ministry of Water Resources,Nanjing 211100,China
  • Received:2020-10-27 Revised:2020-12-08 Online:2022-01-15 Published:2022-01-18
  • About author:LIU Yi,born in 1996,postgraduate.Her main research interests include distributed computing,IoT and edge intelligence computing.
    MAO Ying-chi,born in 1976,Ph.D,professor,is a senior member of China Computer Federation.Her main research interests include distributed computing and parallel processing,IoT,and edge intelligence computing.
  • Supported by:
    National Key Research Program of China(2018YFC0407105),Key Program of the National Natural Science Foundation of China(61832005) and Key Research Program of China Huaneng(HNKJ17-21).

Abstract: Outlier detection has been widely used in many fields,such as network intrusion detection,credit card fraud detection,etc.The increase in data dimensions leads to many irrelevant and redundant features,which will obscure the relevant features and result in false positive results.Due to the sparseness and distance aggregation effects of high-dimensional data,the traditional outlier detection algorithms based on density and distance are no longer applicable.Most of the outlier detection research based on machine learning focuses on a single model,which has certain deficiencies in anti-overfitting ability.The ensemble learning model has good generalization ability,and in actual application shows better prediction accuracy than the single model.This paper proposes an outlier detection sequence integration method LCSE based on neighborhood consistency (locality and consistency based sequential ensemble method for outlier detection).Firstly,it constructs a basic model of outlier detection based on diversity,secondly,selects the abnormal candidate points according to the global integration consistency,and finally considers the local neighborhood correlation of the data to select and combine the basic model results.Experiments verify that LCSE has an average outlier detection accuracy increase of 20.7% compared with traditional methods.Compared with the ensemble methods LSCP_AOM and iForest,the performance is increased by 3.6% on average.Therefore,it is better than other ensemble methods and neural network methods.

Key words: Ensemble consistency, Ensemble diversity, High-dimensional data, Neighborhood correlation, Outlier detection

CLC Number: 

  • TP391.4
[1]AGGARWALC C.Outlier analysis[C]//Data mining.Cham:Springer,2015:237-263.
[2]SCHUBERT E,WOJDANOWSKI R,ZIMEK A,et al.On eva-luation of outlier rankings and outlier scores[C]//Proceedings of the 2012 SIAM International Conference on Data Mining.Philadelphia:SIAM,2012:1047-1058.
[3]CAMPOS G O,ZIMEK A,MEIRA W.An unsupervised boosting strategy for outlier detection ensembles[C]//Pacific-Asia Conference on Knowledge Discovery and Data Mining.Cham:Springer,2018:564-576.
[4]ZHAO Y,NASRULLAH Z,HRYNIEWICKI M K,et al.LSCP:Locally selective combination in parallel outlier ensembles[C]//Proceedings of the 2019 SIAM International Confe-rence on Data Mining.Philadelphia:SIAM,2019:585-593.
[5]CHEN Y P,YU L,CHEN H.Traffic Anomaly Detection Based on Wavelet Neural Network and ARMA Model in Big Data Environment[J].Journal of Chongqing Institute of Technology(Natural Science),2019,33(10):149-154.
[6]CHEN J,SATHE S,AGGARWAL C,et al.Outlier detectionwith autoencoder ensembles[C]//Proceedings of the 2017 SIAM International Conference on Data Mining.Philadelphia:SIAM,2017:90-98.
[7]XING H J,HAO Z.Novelty Detection Method Based on Global and Local Discriminative Adversarial Autoencoder[J].Computer Science,2021,48(6):202-209.
[8]CHALAPATHY R,CHAWLA S.Deep learning for anomaly detection:A survey[J].arXiv:1901.03407,2019.
[9]LAZAREVIC A,KUMAR V.Feature bagging for outlier detection[C]//Proceedings of the eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining.New York:ACM,2005:157-166.
[10]RAYANA S,AKOGLU L.Less is more:building selectiveanomaly ensembles[J].ACM TKDD,2016,10(4):1-33.
[11]LIU F T,TING K M,ZHOU Z H.Isolation forest[C]//2008 Eighth IEEE International Conference on Data Mining.Pisca-taway:IEEE,2008:413-422.
[12]RAYANA S,ZHONG W,AKOGLU L.Sequential ensemblelearning for outlier detection:A bias-variance perspective[C]//2016 IEEE 16th International Conference on Data Mining (ICDM).Piscataway:IEEE,2016:1167-1172.
[13]GAO J,TAN P N.Converting output scores from outlier detection algorithms into probability estimates[C]//Sixth International Conference on Data Mining (ICDM'06).Piscataway:IEEE,2006:212-221.
[14]ZIMEK A,GAUDET M,CAMPELLO R J G B,et al.Subsampling for efficient and effective unsupervised outlier detection ensembles[C]//Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.New York:ACM,2013:428-436.
[15]ZIMEK A,CAMPELLO R J G B,SANDER J.Data perturbationfor outlier detection ensembles[C]//Proceedings of the 26th International Conference on Scientific and Statistical Database Management.New York:ACM,2014:1-12.
[16]PASILLAS-DÍAZ J R,RATTÉ S.Bagged subspaces for unsu-pervised outlier detection[J].Computational Intelligence,2017,33(3):507-523.
[17]NGUYEN H V,ANG H H,GOPALKRISHNAN V.Miningoutliers with ensemble of heterogeneous detectors on random subspaces[C]//International Conference on Database Systems for Advanced Applications.Berlin,Heidelberg:Springer,2010:368-383.
[18]CAMPOS G O,ZIMEK A,MEIRA W.An unsupervised boosting strategy for outlier detection ensembles[C]//Pacific-Asia Conference on Knowledge Discovery and Data Mining.Cham:Springer,2018:564-576.
[19]VAN STEIN B,VAN LEEUWEN M,BÄCK T.Local subspace-based outlier detection using global neighborhoods[C]//2016 IEEE International Conference on Big Data (Big Data).Piscata-way:IEEE,2016:1136-1142.
[20]BREUNIG M M,KRIEGEL H P,NG R T,et al.LOF:identi-fying density-based local outliers[C]//Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data.New York:ACM,2000:93-104.
[21]KRIEGEL H P,KRÖGER P,SCHUBERT E,et al.LoOP:local outlier probabilities[C]//Proceedings of the 18th ACMConfe-rence on Information and Knowledge Management.New York:ACM,2009:1649-1652.
[22]RAYANA S.ODDS Library[DB/OL].http://odds.cs.stonybrook.edu,2016/2020-03-15.
[23]CAMPOS G O,ZIMEK A,SANDER J,et al.On the evaluation of unsupervised outlier detection:measures,datasets,and an empirical study[J].Data Mining and Knowledge Discovery,2016,30(4):891-927.
[1] ZHOU Gang, GUO Fu-liang. Research on Ensemble Learning Method Based on Feature Selection for High-dimensional Data [J]. Computer Science, 2021, 48(6A): 250-254.
[2] LIU Li-cheng, XU Yi-fan, XIE Gui-cai, DUAN Lei. Outlier Detection and Semantic Disambiguation of JSON Document for NoSQL Database [J]. Computer Science, 2021, 48(2): 93-99.
[3] ZHONG Ying-yu, CHEN Song-can. High-order Multi-view Outlier Detection [J]. Computer Science, 2020, 47(9): 99-104.
[4] LIU Zhen-peng, SU Nan, QIN Yi-wen, LU Jia-huan, LI Xiao-fei. FS-CRF:Outlier Detection Model Based on Feature Segmentation and Cascaded Random Forest [J]. Computer Science, 2020, 47(8): 185-188.
[5] LIU Peng, YE Bin. Linear Discriminant Analysis of High-dimensional Data Using Random Matrix Theory [J]. Computer Science, 2019, 46(6A): 423-426.
[6] LI Chang-jing,ZHAO Shu-liang,CHI Yun-xian. Outlier Detection Algorithm Based on Spectral Embedding and Local Density [J]. Computer Science, 2019, 46(3): 260-266.
[7] FENG Gui-lan, ZHOU Wen-gang. Spark-based Parallel Outlier Detection Algorithm of K-nearest Neighbor [J]. Computer Science, 2018, 45(11A): 349-352.
[8] YING Yi, REN Kai, LIU Ya-jun. Network Log Analysis Technology Based on Big Data [J]. Computer Science, 2018, 45(11A): 353-355.
[9] XU Dong, WANG Yan-jun, MENG Yu-long, ZHANG Zi-ying. Improved Data Anomaly Detection Method Based on Isolation Forest [J]. Computer Science, 2018, 45(10): 155-159.
[10] GOU Jie, MA Zi-tang and ZHANG Zhe-cheng. PODKNN:A Parallel Outlier Detection Algorithm for Large Dataset [J]. Computer Science, 2016, 43(7): 251-254.
[11] LI Meng-jie, XIE Qiang and DING Qiu-lin. Orthogonal Non-negative Matrix Factorization for K-means Clustering [J]. Computer Science, 2016, 43(5): 204-208.
[12] XU Tong-de. High-dimensional Data Discretization Method Based on Improved LLE [J]. Computer Science, 2015, 42(Z6): 146-150.
[13] HONG Sha, LIN Jia-li and ZHANG Yue-liang. Density-based Outlier Detection on Uncertain Data [J]. Computer Science, 2015, 42(5): 230-233.
[14] JIANG Yuan-kai, ZHENG Hong-yuan and DING Qiu-lin. On Density Based Outlier Detection for Uncertain Data [J]. Computer Science, 2015, 42(4): 172-176.
[15] WANG Wei, LI Lei and ZHANG Zhi-hong. Improvement of C4.5 Algorithm with Free Noise Capacity [J]. Computer Science, 2015, 42(12): 268-271.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!