基于特征选择的高维数据集成学习方法研究

doi:10.11896/jsjkx.200700102

摘要/Abstract

摘要： 从集成学习的预测误差分析和偏差-方差分解可以发现使用有限的、具有正确率和差异性的基学习器进行集成学习,具有更好的泛化精度。利用信息熵构建了两阶段的特征选择集成学习方法,第一阶段先按照相对分类信息熵构建精度高于0.5的基特征集B;第二阶段先在B的基础上按互信息熵标准评判独立性,运用贪心算法构建独立的特征子集,再运用Jaccard系数评价特征子集间多样性,选取多样性的独立特征子集并构建基学习器。通过数据实验分析发现,该优化方法的执行效率和测试精度优于普通Bagging方法,在多分类的高维数据集上优化效果更好,但不适用于二分类问题。

关键词: 多样性, 高维数据, 集成学习, 特征选择, 信息熵

Abstract: From the prediction error analysis and deviation-variance decomposition of ensemble learning,it can be found that the use of limited,accurate and differentiated basic learners for ensemble learning has better generalization accuracy.A two-stage feature selection ensemble learning method is constructed by using information entropy.In the first stage,the basic feature set B with accuracy higher than 0.5 is constructed according to the relative classification information entropy.In the second stage,independent feature subset is constructed by greedy algorithm and mutual information entropy criterion on the basis of B.Then Jaccard coefficient is used to evaluate the diversity among feature subsets,and the independent feature subset of diversity is selected and the basic learner is constructed.Through the analysis of data experiments,it is found that the efficiency and accuracy of the optimization method are better than the general Bagging method,especially in multi-classification high-dimensional datasets,the optimization effect is good,but it is not suitable for the two-classification problem.

Key words: Diversity, Ensemble learning, Feature selection, High-dimensional data, Information entropy

中图分类号:

TP181

周钢, 郭福亮. 基于特征选择的高维数据集成学习方法研究[J]. 计算机科学, 2021, 48(6A): 250-254. https://doi.org/10.11896/jsjkx.200700102

ZHOU Gang, GUO Fu-liang. Research on Ensemble Learning Method Based on Feature Selection for High-dimensional Data[J]. Computer Science, 2021, 48(6A): 250-254. https://doi.org/10.11896/jsjkx.200700102

参考文献

[1] 王清.集成学习中若干关键问题的研究[D].上海:复旦大学,2011.
[2] BREIMAN L.Bagging Predictors [J].Machine Learning,1996,24(2):123-140.
[3] DORIGO M.Optimization,Learning and Natural Algorithms[D].Milan:Dipartimento di Elettronica,Politecnio di Milano,1992.
[4] FREUND Y,SCHAPIRE R E.A decision-theoretic generalization of on-line learning and an application to boosting[C]//Barcelona:Proceedings of the 2nd European Conference on Computational Learning Theory.1995:23-37.
[5] WOLPERT D H.Stacked Generalization[M].Springer US,2011.
[6] 郭福亮,周钢.集成学习中预测精度的影响因素分析[J].兵工自动化,2019,38(1):78-83.
[7] 王秀霞.分类器的选择性集成及其差异性研究[D].兰州:兰州理工大学,2011.
[8] 张春霞,张讲社.选择性集成学习算法综述[J].计算机学报,2011,34(8):1399-1410.
[9] YIN H,HUY P.An Imbalanced Feature Selection AlgorithmBased on Random Forest[J].Acta Scientia rum Naturalism Universitatis Sannyasins,2014,53(5):59-65.
[10] BROWN G,WYATT J L,TIÑO P,et al.Managing Diversity in Regression Ensembles[J].Journal of Machine Learning Research,2005,6(1):1621-1650.
[11] 孙博.经典集成学习算法的有效性解释及算法改进研究[D].南京:南京航空航天大学,2016.
[12] 徐继伟,杨云.集成学习方法:研究综述[J].云南大学学报(自然科学版),2018,40(6):1082-1092.
[13] 姜正申,刘宏志,付彬,等.集成学习的泛化误差和AUC分解理论及其在权重优化中的应用[J].计算机学报,2019,42(1):1-15.
[14] KEARNS M,VAZIRANI U.The Probably Approximately Correct Learning Model[M].MIT Press,1994.
[15] 唐伟,周志华.基于Bagging的选择性聚类集成[J].软件学报,2005,16(4):496-502.
[16] 张丽新.高维数据的特征选择及基于特征选择的集成学习研究[D].北京:清华大学,2004.
[17] 吕子昂,罗四维,杨坚,等.模型的固有复杂度和泛化能力与几何曲率的关系[J].计算机学报,2007(7):1094-1103.
[18] VENKATESH B,ANURADHA J.A Review of Feature Selection and Its Methods[J].Cybernetics and Information Technologies,2019,29(1):3-26.
[19] 张晶,李裕,李培培.基于随机子空间的多标签类属特征提取算法[J].计算机应用研究,2019,36(2):25-29.
[20] 赵云,刘惟一.基于遗传算法的特征选择方法[J].计算机工程与应用,2004,40(15):52-54.
[21] IZUTANI A,UEHARA K.A Modeling Approach Using Multiple Graphs for Semi-Supervised Learning[C]//International Conference on Discovery Science.2008:296-307.
[22] 翟俊海,刘博,张素芳,等.基于相对分类信息熵的进化特征选择算法[J].模式识别与人工智能,2016,29(8):682-690.
[23] 叶东毅,黄翠微,赵斌.粗糙集中属性约简的一个贪心算法[J].系统工程与电子技术,2000,22(9):63-65.
[24] HALL M A.Correlation-based Feature Selection for Discreteand Numeric Class Machine Learning [C]//Seventeenth International Conference on Machine Learning.2000:359-366.
[25] 黎竹平.基于特征选择技术的集成学习方法及其应用研究[D].哈尔滨:哈尔滨工业大学,2018.
[26] 孙博,王建东,陈海燕,等.集成学习中的多样性度量[J].控制与决策,2014(3):385-395.

相关文章 15

[1]	李斌, 万源. 基于相似度矩阵学习和矩阵校正的无监督多视角特征选择 Unsupervised Multi-view Feature Selection Based on Similarity Matrix Learning and Matrix Alignment 计算机科学, 2022, 49(8): 86-96. https://doi.org/10.11896/jsjkx.210700124
[2]	胡艳羽, 赵龙, 董祥军. 一种用于癌症分类的两阶段深度特征选择提取算法 Two-stage Deep Feature Selection Extraction Algorithm for Cancer Classification 计算机科学, 2022, 49(7): 73-78. https://doi.org/10.11896/jsjkx.210500092
[3]	康雁, 王海宁, 陶柳, 杨海潇, 杨学昆, 王飞, 李浩. 混合改进的花授粉算法与灰狼算法用于特征选择 Hybrid Improved Flower Pollination Algorithm and Gray Wolf Algorithm for Feature Selection 计算机科学, 2022, 49(6A): 125-132. https://doi.org/10.11896/jsjkx.210600135
[4]	林夕, 陈孜卓, 王中卿. 基于不平衡数据与集成学习的属性级情感分类 Aspect-level Sentiment Classification Based on Imbalanced Data and Ensemble Learning 计算机科学, 2022, 49(6A): 144-149. https://doi.org/10.11896/jsjkx.210500205
[5]	康雁, 吴志伟, 寇勇奇, 张兰, 谢思宇, 李浩. 融合Bert和图卷积的深度集成学习软件需求分类 Deep Integrated Learning Software Requirement Classification Fusing Bert and Graph Convolution 计算机科学, 2022, 49(6A): 150-158. https://doi.org/10.11896/jsjkx.210500065
[6]	王宇飞, 陈文. 基于DECORATE集成学习与置信度评估的Tri-training算法 Tri-training Algorithm Based on DECORATE Ensemble Learning and Credibility Assessment 计算机科学, 2022, 49(6): 127-133. https://doi.org/10.11896/jsjkx.211100043
[7]	韩红旗, 冉亚鑫, 张运良, 桂婕, 高雄, 易梦琳. 基于共同子空间分类学习的跨媒体检索研究 Study on Cross-media Information Retrieval Based on Common Subspace Classification Learning 计算机科学, 2022, 49(5): 33-42. https://doi.org/10.11896/jsjkx.210200157
[8]	陈壮, 邹海涛, 郑尚, 于化龙, 高尚. 基于用户覆盖及评分差异的多样性推荐算法 Diversity Recommendation Algorithm Based on User Coverage and Rating Differences 计算机科学, 2022, 49(5): 159-164. https://doi.org/10.11896/jsjkx.210300263
[9]	储安琪, 丁志军. 基于灰狼优化算法的信用评估样本均衡化与特征选择同步处理 Application of Gray Wolf Optimization Algorithm on Synchronous Processing of Sample Equalization and Feature Selection in Credit Evaluation 计算机科学, 2022, 49(4): 134-139. https://doi.org/10.11896/jsjkx.210300075
[10]	孙林, 黄苗苗, 徐久成. 基于邻域粗糙集和Relief的弱标记特征选择方法 Weak Label Feature Selection Method Based on Neighborhood Rough Sets and Relief 计算机科学, 2022, 49(4): 152-160. https://doi.org/10.11896/jsjkx.210300094
[11]	夏源, 赵蕴龙, 范其林. 基于信息熵更新权重的数据流集成分类算法 Data Stream Ensemble Classification Algorithm Based on Information Entropy Updating Weight 计算机科学, 2022, 49(3): 92-98. https://doi.org/10.11896/jsjkx.210200047
[12]	李宗然, 陈秀宏, 陆赟, 邵政毅. 鲁棒联合稀疏不相关回归 Robust Joint Sparse Uncorrelated Regression 计算机科学, 2022, 49(2): 191-197. https://doi.org/10.11896/jsjkx.210300034
[13]	任首朋, 李劲, 王静茹, 岳昆. 基于集成回归决策树的lncRNA-疾病关联预测方法 Ensemble Regression Decision Trees-based lncRNA-disease Association Prediction 计算机科学, 2022, 49(2): 265-271. https://doi.org/10.11896/jsjkx.201100132
[14]	陈伟, 李杭, 李维华. 核小体定位预测的集成学习方法 Ensemble Learning Method for Nucleosome Localization Prediction 计算机科学, 2022, 49(2): 285-291. https://doi.org/10.11896/jsjkx.201100195
[15]	刘振宇, 宋晓莹. 一种可用于分类型属性数据的多变量回归森林 Multivariate Regression Forest for Categorical Attribute Data 计算机科学, 2022, 49(1): 108-114. https://doi.org/10.11896/jsjkx.201200189

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed