计算机科学 ›› 2020, Vol. 47 ›› Issue (8): 185-188.doi: 10.11896/jsjkx.190600162

• 数据库&大数据&数据科学 • 上一篇    下一篇

FS-CRF:基于特征切分与级联随机森林的异常点检测模型

刘振鹏1, 2, 苏楠1, 秦益文3, 卢家欢1, 李小菲2   

  1. 1 河北大学网络空间安全与计算机学院 河北 保定 071002
    2 河北大学信息技术中心 河北 保定 071002
    3 兰州交通大学电子与信息工程学院 兰州 730070
  • 出版日期:2020-08-15 发布日期:2020-08-10
  • 通讯作者: 李小菲(lixiaofei@hbu.edu.cn)
  • 作者简介:lzp@hbu.edu.cn
  • 基金资助:
    河北省自然科学基金(F2019201427);教育部“云数融合科教创新”基金(2017A20004)

FS-CRF:Outlier Detection Model Based on Feature Segmentation and Cascaded Random Forest

LIU Zhen-peng1, 2, SU Nan1, QIN Yi-wen3, LU Jia-huan1, LI Xiao-fei2   

  1. 1 School of Cyber Security and Computer, Hebei University, Baoding, Hebei 071002, China
    2 Information Technology Center, Hebei University, Baoding, Hebei 071002, China
    3 School of Electronic and Information Engineering, Lanzhou Jiaotong University, Lanzhou 730070, China
  • Online:2020-08-15 Published:2020-08-10
  • About author:LIU Zhen-peng, born in 1966, Ph.D, professor, is a senior member of China Computer Federation.His main research inte-rests include network information security and outlier detection.
    LI Xiao-fei, born in 1979, master, engineer.Her main research interests include network information security and outlier detection.
  • Supported by:
    This work was supported by the Natural Science Foundation of Hebei Province, China (F2019201427) andMinistry of Education Fund for “Integration of Cloud Computing and Big Data, Innovation of Science and Education”, China (2017A20004).

摘要: 大数据时代, 攻击篡改、设备故障、人为造假等原因导致海量数据中潜藏着许多异常值。准确地检测出数据中的异常点, 实现数据清洗, 至关重要。文中提出一种结合特征切分与多层级联随机森林的异常点检测模型(outlier detection model based on Feature Segmentation and Cascaded Random Forest, FS-CRF)。利用滑动窗口与随机森林对原始特征进行细粒度切分, 生成类概率向量, 用于训练多层级联的随机森林;由级联层中最后一层的随机森林投票决定样本的最终类别。仿真实验结果表明, 新方法在基于多个UCI数据集进行的异常分类任务中均获得较高F1-measure评分;级联结构使新模型相比于经典的随机森林算法进一步提高了泛化能力;在高维数据集上所提方法比梯度提升决策树和XGBoost拥有更优的性能, 且超参数较少, 易于调优, 具有更好的综合性能。

关键词: 级联随机森林, 集成学习, 数据清洗, 细粒度特征, 异常点检测

Abstract: In the era of big data, there are many abnormal values hidden in massive data due to attack tampering, equipment fai-lure, artificial fraud and other reasons.Accurately detect outliers in data is critical to data cleaning.Therefore, an outlier detection model combining feature segmentation and multi-level cascaded random forest (FS-CRF) is proposed.Using the sliding window and the random forest to segment the original features, the generated class probability vector is used to train the multi-level cascaded random forest.Finally, the category of the sample is determined by the vote of the last layer.Simulation experiment results show that the new method can effectively detect outlier in classification tasks on UCI data sets, with high F1-measure scores obtained on both high and low dimensional data sets.The cascade structure further improves the generalization ability of the model compared to the classical random forest.Compared with the GBDT and XGBoost, the proposed method has performance advantages on high-dimensional data sets, and has fewer hyper-parameters that easy to tune and has better comprehensive performance.

Key words: Cascade random forest, Data cleaning, Ensemble learning, Grained feature, Outlier detection

中图分类号: 

  • TP301
[1]AHMED M, MAHMOOD A N, ISLAM M R.A survey ofanomaly detection techniques in financial domain[J].Future Ge-neration Computer Systems, 2016, 55(6):278-288.
[2]DJENOURI Y, ZIMEK A.Outlier detection in urban traffic data[C]∥Proceedings of the 8th International Conference on Web Intelligence, Mining and Semantics.ACM, 2018:1-12.
[3]DOMINGUES R, FILIPPONE M, MICHIARDI P, et al.A com-
parative evaluation of outlier detection algorithms:Experiments and analyses[J].Pattern Recognition, 2018, 74:406-421.
[4]WANG H, BAH M J, HAMMAD M.Progress in Outlier Detection Techniques:A Survey[J].IEEE Access, 2019, 7:107964-108000.
[5]GUO K, LIU D, PENG Y, et al.Data-Driven Anomaly Detection Using OCSVM with Boundary Optimzation[C]∥2018 Prognostics and System Health Management Conference.IEEE, 2018:244-248.
[6]BREUNIG M M, KRIEGEL H P, NG R T, et al.LOF:identifying density-based local outliers[J].ACM SIGMOD Record, 2000, 29(2):93-104.
[7]RAMASWAMY S, RASTOGI R, SHIM K.Efficient algorithms for mining outliers from large data sets[J].ACM SIGMOD Record, 2000, 29(2):427-438.
[8]LIU Y, LI Z, ZHOU C, et al.Generative adversarial active lear-ning for unsupervised outlier detection[J].arXiv:1809.10816.
[9]CHEN J, SATHE S, AGGARWAL C, et al.Outlier detectionwith autoencoder ensembles[C]∥Proceedings of the 2017 SIAM International Conference on Data Mining.Society for Industrial and Applied Mathematics, 2017:90-98.
[10]LIU F T, TING K M, ZHOU Z H.Isolation-based anomaly detection[J].ACM Transactions on Knowledge Discovery from Data (TKDD), 2012, 6(1):1-39.
[11]FRIEDMAN J H.Greedy function approximation:a gradientboosting machine[J].Annals of Statistics, 2001, 29(5):1189-1232.
[12]CHEN T, GUESTRIN C.Xgboost:A scalable tree boosting system[C]∥Proceedings of the 22nd ACM Sigkdd International Conference on Knowledge Discovery and Data Mining.ACM, 2016:785-794.
[13]GONG Z H, WANG J N, SU C.A Weighted Deep Forest Algorithm[J].Computer Applications and Software, 2019, 36(2):274-278.
[14]DUA D, GRAFF C.UCI Machine Learning Repository[EB/OL].http://archive.ics.uci.edu/ml.
[15]BREIMAN L.Random forests[J].Machine learning, 2001, 45(1):5-32.
[1] 林夕, 陈孜卓, 王中卿.
基于不平衡数据与集成学习的属性级情感分类
Aspect-level Sentiment Classification Based on Imbalanced Data and Ensemble Learning
计算机科学, 2022, 49(6A): 144-149. https://doi.org/10.11896/jsjkx.210500205
[2] 康雁, 吴志伟, 寇勇奇, 张兰, 谢思宇, 李浩.
融合Bert和图卷积的深度集成学习软件需求分类
Deep Integrated Learning Software Requirement Classification Fusing Bert and Graph Convolution
计算机科学, 2022, 49(6A): 150-158. https://doi.org/10.11896/jsjkx.210500065
[3] 王宇飞, 陈文.
基于DECORATE集成学习与置信度评估的Tri-training算法
Tri-training Algorithm Based on DECORATE Ensemble Learning and Credibility Assessment
计算机科学, 2022, 49(6): 127-133. https://doi.org/10.11896/jsjkx.211100043
[4] 韩红旗, 冉亚鑫, 张运良, 桂婕, 高雄, 易梦琳.
基于共同子空间分类学习的跨媒体检索研究
Study on Cross-media Information Retrieval Based on Common Subspace Classification Learning
计算机科学, 2022, 49(5): 33-42. https://doi.org/10.11896/jsjkx.210200157
[5] 任首朋, 李劲, 王静茹, 岳昆.
基于集成回归决策树的lncRNA-疾病关联预测方法
Ensemble Regression Decision Trees-based lncRNA-disease Association Prediction
计算机科学, 2022, 49(2): 265-271. https://doi.org/10.11896/jsjkx.201100132
[6] 陈伟, 李杭, 李维华.
核小体定位预测的集成学习方法
Ensemble Learning Method for Nucleosome Localization Prediction
计算机科学, 2022, 49(2): 285-291. https://doi.org/10.11896/jsjkx.201100195
[7] 刘振宇, 宋晓莹.
一种可用于分类型属性数据的多变量回归森林
Multivariate Regression Forest for Categorical Attribute Data
计算机科学, 2022, 49(1): 108-114. https://doi.org/10.11896/jsjkx.201200189
[8] 王俊, 王修来, 庞威, 赵鸿飞.
面向科技前瞻预测的大数据治理研究
Research on Big Data Governance for Science and Technology Forecast
计算机科学, 2021, 48(9): 36-42. https://doi.org/10.11896/jsjkx.210500207
[9] 周新民, 胡宜桂, 刘文洁, 孙荣俊.
基于多模态多层级数据融合方法的城市功能识别研究
Research on Urban Function Recognition Based on Multi-modal and Multi-level Data Fusion Method
计算机科学, 2021, 48(9): 50-58. https://doi.org/10.11896/jsjkx.210500220
[10] 周钢, 郭福亮.
基于特征选择的高维数据集成学习方法研究
Research on Ensemble Learning Method Based on Feature Selection for High-dimensional Data
计算机科学, 2021, 48(6A): 250-254. https://doi.org/10.11896/jsjkx.200700102
[11] 戴宗明, 胡凯, 谢捷, 郭亚.
基于直觉模糊集的集成学习算法
Ensemble Learning Algorithm Based on Intuitionistic Fuzzy Sets
计算机科学, 2021, 48(6A): 270-274. https://doi.org/10.11896/jsjkx.200700036
[12] 郇文明, 林海涛.
基于采样集成算法的入侵检测系统设计
Design of Intrusion Detection System Based on Sampling Ensemble Algorithm
计算机科学, 2021, 48(11A): 705-712. https://doi.org/10.11896/jsjkx.201100101
[13] 邓一姣, 张凤荔, 陈学勤, 艾擎, 余苏喆.
面向跨模态检索的协同注意力网络模型
Collaborative Attention Network Model for Cross-modal Retrieval
计算机科学, 2020, 47(4): 54-59. https://doi.org/10.11896/jsjkx.190600181
[14] 丁荣莉, 李杰, 张曼, 刘艳丽, 伍伟.
基于S-HOG的遥感图像舰船目标检测
Ship Target Detection in Remote Sensing Image Based on S-HOG
计算机科学, 2020, 47(11A): 248-252. https://doi.org/10.11896/jsjkx.191200090
[15] 徐鹤, 吴昊, 李鹏.
面向物联网的时空数据处理算法设计
Design of Temporal-spatial Data Processing Algorithm for IoT
计算机科学, 2020, 47(11): 310-315. https://doi.org/10.11896/jsjkx.200400045
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!