计算机科学 ›› 2020, Vol. 47 ›› Issue (8): 185-188.doi: 10.11896/jsjkx.190600162
刘振鹏1, 2, 苏楠1, 秦益文3, 卢家欢1, 李小菲2
LIU Zhen-peng1, 2, SU Nan1, QIN Yi-wen3, LU Jia-huan1, LI Xiao-fei2
摘要: 大数据时代, 攻击篡改、设备故障、人为造假等原因导致海量数据中潜藏着许多异常值。准确地检测出数据中的异常点, 实现数据清洗, 至关重要。文中提出一种结合特征切分与多层级联随机森林的异常点检测模型(outlier detection model based on Feature Segmentation and Cascaded Random Forest, FS-CRF)。利用滑动窗口与随机森林对原始特征进行细粒度切分, 生成类概率向量, 用于训练多层级联的随机森林;由级联层中最后一层的随机森林投票决定样本的最终类别。仿真实验结果表明, 新方法在基于多个UCI数据集进行的异常分类任务中均获得较高F1-measure评分;级联结构使新模型相比于经典的随机森林算法进一步提高了泛化能力;在高维数据集上所提方法比梯度提升决策树和XGBoost拥有更优的性能, 且超参数较少, 易于调优, 具有更好的综合性能。
中图分类号:
[1]AHMED M, MAHMOOD A N, ISLAM M R.A survey ofanomaly detection techniques in financial domain[J].Future Ge-neration Computer Systems, 2016, 55(6):278-288. [2]DJENOURI Y, ZIMEK A.Outlier detection in urban traffic data[C]∥Proceedings of the 8th International Conference on Web Intelligence, Mining and Semantics.ACM, 2018:1-12. [3]DOMINGUES R, FILIPPONE M, MICHIARDI P, et al.A com- parative evaluation of outlier detection algorithms:Experiments and analyses[J].Pattern Recognition, 2018, 74:406-421. [4]WANG H, BAH M J, HAMMAD M.Progress in Outlier Detection Techniques:A Survey[J].IEEE Access, 2019, 7:107964-108000. [5]GUO K, LIU D, PENG Y, et al.Data-Driven Anomaly Detection Using OCSVM with Boundary Optimzation[C]∥2018 Prognostics and System Health Management Conference.IEEE, 2018:244-248. [6]BREUNIG M M, KRIEGEL H P, NG R T, et al.LOF:identifying density-based local outliers[J].ACM SIGMOD Record, 2000, 29(2):93-104. [7]RAMASWAMY S, RASTOGI R, SHIM K.Efficient algorithms for mining outliers from large data sets[J].ACM SIGMOD Record, 2000, 29(2):427-438. [8]LIU Y, LI Z, ZHOU C, et al.Generative adversarial active lear-ning for unsupervised outlier detection[J].arXiv:1809.10816. [9]CHEN J, SATHE S, AGGARWAL C, et al.Outlier detectionwith autoencoder ensembles[C]∥Proceedings of the 2017 SIAM International Conference on Data Mining.Society for Industrial and Applied Mathematics, 2017:90-98. [10]LIU F T, TING K M, ZHOU Z H.Isolation-based anomaly detection[J].ACM Transactions on Knowledge Discovery from Data (TKDD), 2012, 6(1):1-39. [11]FRIEDMAN J H.Greedy function approximation:a gradientboosting machine[J].Annals of Statistics, 2001, 29(5):1189-1232. [12]CHEN T, GUESTRIN C.Xgboost:A scalable tree boosting system[C]∥Proceedings of the 22nd ACM Sigkdd International Conference on Knowledge Discovery and Data Mining.ACM, 2016:785-794. [13]GONG Z H, WANG J N, SU C.A Weighted Deep Forest Algorithm[J].Computer Applications and Software, 2019, 36(2):274-278. [14]DUA D, GRAFF C.UCI Machine Learning Repository[EB/OL].http://archive.ics.uci.edu/ml. [15]BREIMAN L.Random forests[J].Machine learning, 2001, 45(1):5-32. |
[1] | 林夕, 陈孜卓, 王中卿. 基于不平衡数据与集成学习的属性级情感分类 Aspect-level Sentiment Classification Based on Imbalanced Data and Ensemble Learning 计算机科学, 2022, 49(6A): 144-149. https://doi.org/10.11896/jsjkx.210500205 |
[2] | 康雁, 吴志伟, 寇勇奇, 张兰, 谢思宇, 李浩. 融合Bert和图卷积的深度集成学习软件需求分类 Deep Integrated Learning Software Requirement Classification Fusing Bert and Graph Convolution 计算机科学, 2022, 49(6A): 150-158. https://doi.org/10.11896/jsjkx.210500065 |
[3] | 王宇飞, 陈文. 基于DECORATE集成学习与置信度评估的Tri-training算法 Tri-training Algorithm Based on DECORATE Ensemble Learning and Credibility Assessment 计算机科学, 2022, 49(6): 127-133. https://doi.org/10.11896/jsjkx.211100043 |
[4] | 韩红旗, 冉亚鑫, 张运良, 桂婕, 高雄, 易梦琳. 基于共同子空间分类学习的跨媒体检索研究 Study on Cross-media Information Retrieval Based on Common Subspace Classification Learning 计算机科学, 2022, 49(5): 33-42. https://doi.org/10.11896/jsjkx.210200157 |
[5] | 任首朋, 李劲, 王静茹, 岳昆. 基于集成回归决策树的lncRNA-疾病关联预测方法 Ensemble Regression Decision Trees-based lncRNA-disease Association Prediction 计算机科学, 2022, 49(2): 265-271. https://doi.org/10.11896/jsjkx.201100132 |
[6] | 陈伟, 李杭, 李维华. 核小体定位预测的集成学习方法 Ensemble Learning Method for Nucleosome Localization Prediction 计算机科学, 2022, 49(2): 285-291. https://doi.org/10.11896/jsjkx.201100195 |
[7] | 刘振宇, 宋晓莹. 一种可用于分类型属性数据的多变量回归森林 Multivariate Regression Forest for Categorical Attribute Data 计算机科学, 2022, 49(1): 108-114. https://doi.org/10.11896/jsjkx.201200189 |
[8] | 王俊, 王修来, 庞威, 赵鸿飞. 面向科技前瞻预测的大数据治理研究 Research on Big Data Governance for Science and Technology Forecast 计算机科学, 2021, 48(9): 36-42. https://doi.org/10.11896/jsjkx.210500207 |
[9] | 周新民, 胡宜桂, 刘文洁, 孙荣俊. 基于多模态多层级数据融合方法的城市功能识别研究 Research on Urban Function Recognition Based on Multi-modal and Multi-level Data Fusion Method 计算机科学, 2021, 48(9): 50-58. https://doi.org/10.11896/jsjkx.210500220 |
[10] | 周钢, 郭福亮. 基于特征选择的高维数据集成学习方法研究 Research on Ensemble Learning Method Based on Feature Selection for High-dimensional Data 计算机科学, 2021, 48(6A): 250-254. https://doi.org/10.11896/jsjkx.200700102 |
[11] | 戴宗明, 胡凯, 谢捷, 郭亚. 基于直觉模糊集的集成学习算法 Ensemble Learning Algorithm Based on Intuitionistic Fuzzy Sets 计算机科学, 2021, 48(6A): 270-274. https://doi.org/10.11896/jsjkx.200700036 |
[12] | 郇文明, 林海涛. 基于采样集成算法的入侵检测系统设计 Design of Intrusion Detection System Based on Sampling Ensemble Algorithm 计算机科学, 2021, 48(11A): 705-712. https://doi.org/10.11896/jsjkx.201100101 |
[13] | 邓一姣, 张凤荔, 陈学勤, 艾擎, 余苏喆. 面向跨模态检索的协同注意力网络模型 Collaborative Attention Network Model for Cross-modal Retrieval 计算机科学, 2020, 47(4): 54-59. https://doi.org/10.11896/jsjkx.190600181 |
[14] | 丁荣莉, 李杰, 张曼, 刘艳丽, 伍伟. 基于S-HOG的遥感图像舰船目标检测 Ship Target Detection in Remote Sensing Image Based on S-HOG 计算机科学, 2020, 47(11A): 248-252. https://doi.org/10.11896/jsjkx.191200090 |
[15] | 徐鹤, 吴昊, 李鹏. 面向物联网的时空数据处理算法设计 Design of Temporal-spatial Data Processing Algorithm for IoT 计算机科学, 2020, 47(11): 310-315. https://doi.org/10.11896/jsjkx.200400045 |
|