基于特征选择的高维数据集成学习方法研究

doi:10.11896/jsjkx.200700102

Abstract

Abstract: From the prediction error analysis and deviation-variance decomposition of ensemble learning,it can be found that the use of limited,accurate and differentiated basic learners for ensemble learning has better generalization accuracy.A two-stage feature selection ensemble learning method is constructed by using information entropy.In the first stage,the basic feature set B with accuracy higher than 0.5 is constructed according to the relative classification information entropy.In the second stage,independent feature subset is constructed by greedy algorithm and mutual information entropy criterion on the basis of B.Then Jaccard coefficient is used to evaluate the diversity among feature subsets,and the independent feature subset of diversity is selected and the basic learner is constructed.Through the analysis of data experiments,it is found that the efficiency and accuracy of the optimization method are better than the general Bagging method,especially in multi-classification high-dimensional datasets,the optimization effect is good,but it is not suitable for the two-classification problem.

Key words: Diversity, Ensemble learning, Feature selection, High-dimensional data, Information entropy

CLC Number:

TP181

ZHOU Gang, GUO Fu-liang. Research on Ensemble Learning Method Based on Feature Selection for High-dimensional Data[J].Computer Science, 2021, 48(6A): 250-254.

References

[1] 王清.集成学习中若干关键问题的研究[D].上海:复旦大学,2011.
[2] BREIMAN L.Bagging Predictors [J].Machine Learning,1996,24(2):123-140.
[3] DORIGO M.Optimization,Learning and Natural Algorithms[D].Milan:Dipartimento di Elettronica,Politecnio di Milano,1992.
[4] FREUND Y,SCHAPIRE R E.A decision-theoretic generalization of on-line learning and an application to boosting[C]//Barcelona:Proceedings of the 2nd European Conference on Computational Learning Theory.1995:23-37.
[5] WOLPERT D H.Stacked Generalization[M].Springer US,2011.
[6] 郭福亮,周钢.集成学习中预测精度的影响因素分析[J].兵工自动化,2019,38(1):78-83.
[7] 王秀霞.分类器的选择性集成及其差异性研究[D].兰州:兰州理工大学,2011.
[8] 张春霞,张讲社.选择性集成学习算法综述[J].计算机学报,2011,34(8):1399-1410.
[9] YIN H,HUY P.An Imbalanced Feature Selection AlgorithmBased on Random Forest[J].Acta Scientia rum Naturalism Universitatis Sannyasins,2014,53(5):59-65.
[10] BROWN G,WYATT J L,TIÑO P,et al.Managing Diversity in Regression Ensembles[J].Journal of Machine Learning Research,2005,6(1):1621-1650.
[11] 孙博.经典集成学习算法的有效性解释及算法改进研究[D].南京:南京航空航天大学,2016.
[12] 徐继伟,杨云.集成学习方法:研究综述[J].云南大学学报(自然科学版),2018,40(6):1082-1092.
[13] 姜正申,刘宏志,付彬,等.集成学习的泛化误差和AUC分解理论及其在权重优化中的应用[J].计算机学报,2019,42(1):1-15.
[14] KEARNS M,VAZIRANI U.The Probably Approximately Correct Learning Model[M].MIT Press,1994.
[15] 唐伟,周志华.基于Bagging的选择性聚类集成[J].软件学报,2005,16(4):496-502.
[16] 张丽新.高维数据的特征选择及基于特征选择的集成学习研究[D].北京:清华大学,2004.
[17] 吕子昂,罗四维,杨坚,等.模型的固有复杂度和泛化能力与几何曲率的关系[J].计算机学报,2007(7):1094-1103.
[18] VENKATESH B,ANURADHA J.A Review of Feature Selection and Its Methods[J].Cybernetics and Information Technologies,2019,29(1):3-26.
[19] 张晶,李裕,李培培.基于随机子空间的多标签类属特征提取算法[J].计算机应用研究,2019,36(2):25-29.
[20] 赵云,刘惟一.基于遗传算法的特征选择方法[J].计算机工程与应用,2004,40(15):52-54.
[21] IZUTANI A,UEHARA K.A Modeling Approach Using Multiple Graphs for Semi-Supervised Learning[C]//International Conference on Discovery Science.2008:296-307.
[22] 翟俊海,刘博,张素芳,等.基于相对分类信息熵的进化特征选择算法[J].模式识别与人工智能,2016,29(8):682-690.
[23] 叶东毅,黄翠微,赵斌.粗糙集中属性约简的一个贪心算法[J].系统工程与电子技术,2000,22(9):63-65.
[24] HALL M A.Correlation-based Feature Selection for Discreteand Numeric Class Machine Learning [C]//Seventeenth International Conference on Machine Learning.2000:359-366.
[25] 黎竹平.基于特征选择技术的集成学习方法及其应用研究[D].哈尔滨:哈尔滨工业大学,2018.
[26] 孙博,王建东,陈海燕,等.集成学习中的多样性度量[J].控制与决策,2014(3):385-395.

Related Articles 15

[1]	LI Bin, WAN Yuan. Unsupervised Multi-view Feature Selection Based on Similarity Matrix Learning and Matrix Alignment [J]. Computer Science, 2022, 49(8): 86-96.
[2]	HU Yan-yu, ZHAO Long, DONG Xiang-jun. Two-stage Deep Feature Selection Extraction Algorithm for Cancer Classification [J]. Computer Science, 2022, 49(7): 73-78.
[3]	KANG Yan, WANG Hai-ning, TAO Liu, YANG Hai-xiao, YANG Xue-kun, WANG Fei, LI Hao. Hybrid Improved Flower Pollination Algorithm and Gray Wolf Algorithm for Feature Selection [J]. Computer Science, 2022, 49(6A): 125-132.
[4]	LIN Xi, CHEN Zi-zhuo, WANG Zhong-qing. Aspect-level Sentiment Classification Based on Imbalanced Data and Ensemble Learning [J]. Computer Science, 2022, 49(6A): 144-149.
[5]	KANG Yan, WU Zhi-wei, KOU Yong-qi, ZHANG Lan, XIE Si-yu, LI Hao. Deep Integrated Learning Software Requirement Classification Fusing Bert and Graph Convolution [J]. Computer Science, 2022, 49(6A): 150-158.
[6]	WANG Yu-fei, CHEN Wen. Tri-training Algorithm Based on DECORATE Ensemble Learning and Credibility Assessment [J]. Computer Science, 2022, 49(6): 127-133.
[7]	HAN Hong-qi, RAN Ya-xin, ZHANG Yun-liang, GUI Jie, GAO Xiong, YI Meng-lin. Study on Cross-media Information Retrieval Based on Common Subspace Classification Learning [J]. Computer Science, 2022, 49(5): 33-42.
[8]	CHEN Zhuang, ZOU Hai-tao, ZHENG Shang, YU Hua-long, GAO Shang. Diversity Recommendation Algorithm Based on User Coverage and Rating Differences [J]. Computer Science, 2022, 49(5): 159-164.
[9]	CHU An-qi, DING Zhi-jun. Application of Gray Wolf Optimization Algorithm on Synchronous Processing of Sample Equalization and Feature Selection in Credit Evaluation [J]. Computer Science, 2022, 49(4): 134-139.
[10]	SUN Lin, HUANG Miao-miao, XU Jiu-cheng. Weak Label Feature Selection Method Based on Neighborhood Rough Sets and Relief [J]. Computer Science, 2022, 49(4): 152-160.
[11]	XIA Yuan, ZHAO Yun-long, FAN Qi-lin. Data Stream Ensemble Classification Algorithm Based on Information Entropy Updating Weight [J]. Computer Science, 2022, 49(3): 92-98.
[12]	LI Zong-ran, CHEN XIU-Hong, LU Yun, SHAO Zheng-yi. Robust Joint Sparse Uncorrelated Regression [J]. Computer Science, 2022, 49(2): 191-197.
[13]	REN Shou-peng, LI Jin, WANG Jing-ru, YUE Kun. Ensemble Regression Decision Trees-based lncRNA-disease Association Prediction [J]. Computer Science, 2022, 49(2): 265-271.
[14]	CHEN Wei, LI Hang, LI Wei-hua. Ensemble Learning Method for Nucleosome Localization Prediction [J]. Computer Science, 2022, 49(2): 285-291.
[15]	LIU Zhen-yu, SONG Xiao-ying. Multivariate Regression Forest for Categorical Attribute Data [J]. Computer Science, 2022, 49(1): 108-114.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Research on Ensemble Learning Method Based on Feature Selection for High-dimensional Data

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 15

Metrics

Comments

Recommended 0