Computer Science ›› 2020, Vol. 47 ›› Issue (8): 185-188.doi: 10.11896/jsjkx.190600162

Previous Articles     Next Articles

FS-CRF:Outlier Detection Model Based on Feature Segmentation and Cascaded Random Forest

LIU Zhen-peng1, 2, SU Nan1, QIN Yi-wen3, LU Jia-huan1, LI Xiao-fei2   

  1. 1 School of Cyber Security and Computer, Hebei University, Baoding, Hebei 071002, China
    2 Information Technology Center, Hebei University, Baoding, Hebei 071002, China
    3 School of Electronic and Information Engineering, Lanzhou Jiaotong University, Lanzhou 730070, China
  • Online:2020-08-15 Published:2020-08-10
  • About author:LIU Zhen-peng, born in 1966, Ph.D, professor, is a senior member of China Computer Federation.His main research inte-rests include network information security and outlier detection.
    LI Xiao-fei, born in 1979, master, engineer.Her main research interests include network information security and outlier detection.
  • Supported by:
    This work was supported by the Natural Science Foundation of Hebei Province, China (F2019201427) andMinistry of Education Fund for “Integration of Cloud Computing and Big Data, Innovation of Science and Education”, China (2017A20004).

Abstract: In the era of big data, there are many abnormal values hidden in massive data due to attack tampering, equipment fai-lure, artificial fraud and other reasons.Accurately detect outliers in data is critical to data cleaning.Therefore, an outlier detection model combining feature segmentation and multi-level cascaded random forest (FS-CRF) is proposed.Using the sliding window and the random forest to segment the original features, the generated class probability vector is used to train the multi-level cascaded random forest.Finally, the category of the sample is determined by the vote of the last layer.Simulation experiment results show that the new method can effectively detect outlier in classification tasks on UCI data sets, with high F1-measure scores obtained on both high and low dimensional data sets.The cascade structure further improves the generalization ability of the model compared to the classical random forest.Compared with the GBDT and XGBoost, the proposed method has performance advantages on high-dimensional data sets, and has fewer hyper-parameters that easy to tune and has better comprehensive performance.

Key words: Cascade random forest, Data cleaning, Ensemble learning, Grained feature, Outlier detection

CLC Number: 

  • TP301
[1]AHMED M, MAHMOOD A N, ISLAM M R.A survey ofanomaly detection techniques in financial domain[J].Future Ge-neration Computer Systems, 2016, 55(6):278-288.
[2]DJENOURI Y, ZIMEK A.Outlier detection in urban traffic data[C]∥Proceedings of the 8th International Conference on Web Intelligence, Mining and Semantics.ACM, 2018:1-12.
[3]DOMINGUES R, FILIPPONE M, MICHIARDI P, et al.A com-
parative evaluation of outlier detection algorithms:Experiments and analyses[J].Pattern Recognition, 2018, 74:406-421.
[4]WANG H, BAH M J, HAMMAD M.Progress in Outlier Detection Techniques:A Survey[J].IEEE Access, 2019, 7:107964-108000.
[5]GUO K, LIU D, PENG Y, et al.Data-Driven Anomaly Detection Using OCSVM with Boundary Optimzation[C]∥2018 Prognostics and System Health Management Conference.IEEE, 2018:244-248.
[6]BREUNIG M M, KRIEGEL H P, NG R T, et al.LOF:identifying density-based local outliers[J].ACM SIGMOD Record, 2000, 29(2):93-104.
[7]RAMASWAMY S, RASTOGI R, SHIM K.Efficient algorithms for mining outliers from large data sets[J].ACM SIGMOD Record, 2000, 29(2):427-438.
[8]LIU Y, LI Z, ZHOU C, et al.Generative adversarial active lear-ning for unsupervised outlier detection[J].arXiv:1809.10816.
[9]CHEN J, SATHE S, AGGARWAL C, et al.Outlier detectionwith autoencoder ensembles[C]∥Proceedings of the 2017 SIAM International Conference on Data Mining.Society for Industrial and Applied Mathematics, 2017:90-98.
[10]LIU F T, TING K M, ZHOU Z H.Isolation-based anomaly detection[J].ACM Transactions on Knowledge Discovery from Data (TKDD), 2012, 6(1):1-39.
[11]FRIEDMAN J H.Greedy function approximation:a gradientboosting machine[J].Annals of Statistics, 2001, 29(5):1189-1232.
[12]CHEN T, GUESTRIN C.Xgboost:A scalable tree boosting system[C]∥Proceedings of the 22nd ACM Sigkdd International Conference on Knowledge Discovery and Data Mining.ACM, 2016:785-794.
[13]GONG Z H, WANG J N, SU C.A Weighted Deep Forest Algorithm[J].Computer Applications and Software, 2019, 36(2):274-278.
[14]DUA D, GRAFF C.UCI Machine Learning Repository[EB/OL].http://archive.ics.uci.edu/ml.
[15]BREIMAN L.Random forests[J].Machine learning, 2001, 45(1):5-32.
[1] LIN Xi, CHEN Zi-zhuo, WANG Zhong-qing. Aspect-level Sentiment Classification Based on Imbalanced Data and Ensemble Learning [J]. Computer Science, 2022, 49(6A): 144-149.
[2] KANG Yan, WU Zhi-wei, KOU Yong-qi, ZHANG Lan, XIE Si-yu, LI Hao. Deep Integrated Learning Software Requirement Classification Fusing Bert and Graph Convolution [J]. Computer Science, 2022, 49(6A): 150-158.
[3] WANG Yu-fei, CHEN Wen. Tri-training Algorithm Based on DECORATE Ensemble Learning and Credibility Assessment [J]. Computer Science, 2022, 49(6): 127-133.
[4] HAN Hong-qi, RAN Ya-xin, ZHANG Yun-liang, GUI Jie, GAO Xiong, YI Meng-lin. Study on Cross-media Information Retrieval Based on Common Subspace Classification Learning [J]. Computer Science, 2022, 49(5): 33-42.
[5] REN Shou-peng, LI Jin, WANG Jing-ru, YUE Kun. Ensemble Regression Decision Trees-based lncRNA-disease Association Prediction [J]. Computer Science, 2022, 49(2): 265-271.
[6] CHEN Wei, LI Hang, LI Wei-hua. Ensemble Learning Method for Nucleosome Localization Prediction [J]. Computer Science, 2022, 49(2): 285-291.
[7] LIU Zhen-yu, SONG Xiao-ying. Multivariate Regression Forest for Categorical Attribute Data [J]. Computer Science, 2022, 49(1): 108-114.
[8] LIU Yi, MAO Ying-chi, CHENG Yang-kun, GAO Jian, WANG Long-bao. Locality and Consistency Based Sequential Ensemble Method for Outlier Detection [J]. Computer Science, 2022, 49(1): 146-152.
[9] WANG Jun, WANG Xiu-lai, PANG Wei, ZHAO Hong-fei. Research on Big Data Governance for Science and Technology Forecast [J]. Computer Science, 2021, 48(9): 36-42.
[10] ZHOU Xin-min, HU Yi-gui, LIU Wen-jie, SUN Rong-jun. Research on Urban Function Recognition Based on Multi-modal and Multi-level Data Fusion Method [J]. Computer Science, 2021, 48(9): 50-58.
[11] ZHOU Gang, GUO Fu-liang. Research on Ensemble Learning Method Based on Feature Selection for High-dimensional Data [J]. Computer Science, 2021, 48(6A): 250-254.
[12] DAI Zong-ming, HU Kai, XIE Jie, GUO Ya. Ensemble Learning Algorithm Based on Intuitionistic Fuzzy Sets [J]. Computer Science, 2021, 48(6A): 270-274.
[13] LIU Li-cheng, XU Yi-fan, XIE Gui-cai, DUAN Lei. Outlier Detection and Semantic Disambiguation of JSON Document for NoSQL Database [J]. Computer Science, 2021, 48(2): 93-99.
[14] HUAN Wen-ming, LIN Hai-tao. Design of Intrusion Detection System Based on Sampling Ensemble Algorithm [J]. Computer Science, 2021, 48(11A): 705-712.
[15] ZHONG Ying-yu, CHEN Song-can. High-order Multi-view Outlier Detection [J]. Computer Science, 2020, 47(9): 99-104.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!