计算机科学 ›› 2019, Vol. 46 ›› Issue (5): 203-208.doi: 10.11896/j.issn.1002-137X.2019.05.031
曹雅茜, 黄海燕
CAO Ya-xi, HUANG Hai-yan
摘要: 集成学习由于泛化能力强,被广泛应用于信息检索、图像处理、生物学等类别不平衡的场景。为了提高算法在不平衡数据上的分类效果,文中提出一种基于采样平衡和特征选择的集成学习算法OBPD-EFSBoost。该算法主要包括3个步骤:首先,依据少数类高斯混合分布得到的概率模型,进行过采样构造平衡数集,扩大少数类的潜在决策域;其次,每轮训练个体分类器时,根据上一轮的错分样本综合考虑样本和特征的加权,过滤冗余噪声特征;最后,通过个体分类器的加权投票得到最终的集成分类器。8组UCI数据分类结果表明,该算法不仅有效提高了少数类的分类精度,同时还弥补了Boosting类算法对噪声特征敏感的缺陷,具有较强的鲁棒性。
中图分类号:
[1]POZZOLO A D,CAELEN O,BORGNE Y A L,et al.Learned Lessons in Credit Card Fraud Detection from A Practitioner Perspective[J].Expert Systems with Applications,2014,41(10):4915-4928. [2]PARVIN H,MINAEIBIDGOLI B,ALINEJADROKNY H.ANew Imbalanced Learning and Dictions Tree Method for Breast Cancer Diagnosis[J].Journal of Bionanoscience,2013,7(6):673-678. [3]LARADJI I H,ALSHAYEB M,GHOUTI L.Software DefectPrediction Using Ensemble Learning on Selected Features[J].Information & Software Technology,2015,58:388-402. [4]ZHANG C,WANG G,ZHOU Y,et al.A new approach for imbalanced data classification based on minimize loss learning[C]∥IEEE Second International Conference on Data Science in Cyberspace.IEEE,2017:82-87. [5]CAO P,YANG J,LI W,et al.Hybrid Sampling AlgorithmBased on Probability Distribution Estimation[J].Control and Decision,2014,29(5):815-520.(in Chinese)曹鹏,李博,栗伟,等.基于概率分布估计的混合采样算法[J].控制与决策,2014,29(5):815-520. [6]FREUND,YOAV,SCHAPIRE,et al.A Decision-theoretic Generalization of On-line Learning and An Application to Boosting[C]∥European Conference on Computational Learning Theory.Springer,Berlin,Heidelberg,1995:23-37. [7]CHAWLA N V,BOWYER K W,Hall L O,et al.SMOTE:synthetic minority over-sampling technique[J].Journal of Artificial Intelligence Research,2002,16(1):321-357. [8]SEIFFERT C,KHOSHGOFTAAR T M,HULSE J V,et al. RUSBoost:A Hybrid Approach to Alleviating Class Imbalance[J].IEEE Transactions on Systems Man & Cybernetics Part A Systems & Humans,2009,40(1):185-197. [9]LI K,FANG X,ZHAI J,et al.An Imbalanced Data Classification Method Driven by Boundary Samples-Boundary-Boost[C]∥International Conference on Information Science and Control Engineering.IEEE,2016:194-199. [10]BAO L,CAO J,LI J,et al.Boosted Near-miss Under-sampling on SVM ensembles for concept detection in large-scale imba-lanced datasets[J].Neurocomputing,2016,172(C):198-206. [11]YIN H,HUY P.An Imbalanced Feature Selection AlgorithmBased on Random Forest[J].Acta Scientiarum Naturalium Universitatis Sunyatseni,2014,53(5):59-65. [12]YIN L,GE Y,XIAO K,et al.Feature Selection for High-dimensional Imbalanced Data[J].Neurocomputing,2013,105(3):3-11. [13]ALIBEIGI M,HASHEMI S,HAMZEH A.Unsupervised Fea-ture Selection Based on the Distribution of Features Attributed to Imbalanced Data Sets[J].International Journal of Artificial Intelligence & Expert Systems,2011,2(1):2011-2014. [14]HANSEN L K,SALAMON P.Neural Network Ensembles[M].IEEE Computer Society,1990,12(10):993-1001. [15]FIGUEIREDO M A T,JAIN A K.Unsupervised Learning of Finite Mixture Models[J].IEEE Transactions on Pattern Analysis &Machine Intelligence,2002,24(3):381-396. [16]ZHANG H,LI M.RWO-Sampling:A random walk over-sampling approach to imbalanced data classification[J].Information Fusion,2014,20(1):99-116. [17]WEISS G M.The Impact of Small Disjuncts on Classifier Lear-ning[M].Data Mining,2009:193-226. |
[1] | 李斌, 万源. 基于相似度矩阵学习和矩阵校正的无监督多视角特征选择 Unsupervised Multi-view Feature Selection Based on Similarity Matrix Learning and Matrix Alignment 计算机科学, 2022, 49(8): 86-96. https://doi.org/10.11896/jsjkx.210700124 |
[2] | 胡艳羽, 赵龙, 董祥军. 一种用于癌症分类的两阶段深度特征选择提取算法 Two-stage Deep Feature Selection Extraction Algorithm for Cancer Classification 计算机科学, 2022, 49(7): 73-78. https://doi.org/10.11896/jsjkx.210500092 |
[3] | 康雁, 王海宁, 陶柳, 杨海潇, 杨学昆, 王飞, 李浩. 混合改进的花授粉算法与灰狼算法用于特征选择 Hybrid Improved Flower Pollination Algorithm and Gray Wolf Algorithm for Feature Selection 计算机科学, 2022, 49(6A): 125-132. https://doi.org/10.11896/jsjkx.210600135 |
[4] | 林夕, 陈孜卓, 王中卿. 基于不平衡数据与集成学习的属性级情感分类 Aspect-level Sentiment Classification Based on Imbalanced Data and Ensemble Learning 计算机科学, 2022, 49(6A): 144-149. https://doi.org/10.11896/jsjkx.210500205 |
[5] | 康雁, 吴志伟, 寇勇奇, 张兰, 谢思宇, 李浩. 融合Bert和图卷积的深度集成学习软件需求分类 Deep Integrated Learning Software Requirement Classification Fusing Bert and Graph Convolution 计算机科学, 2022, 49(6A): 150-158. https://doi.org/10.11896/jsjkx.210500065 |
[6] | 王宇飞, 陈文. 基于DECORATE集成学习与置信度评估的Tri-training算法 Tri-training Algorithm Based on DECORATE Ensemble Learning and Credibility Assessment 计算机科学, 2022, 49(6): 127-133. https://doi.org/10.11896/jsjkx.211100043 |
[7] | 韩红旗, 冉亚鑫, 张运良, 桂婕, 高雄, 易梦琳. 基于共同子空间分类学习的跨媒体检索研究 Study on Cross-media Information Retrieval Based on Common Subspace Classification Learning 计算机科学, 2022, 49(5): 33-42. https://doi.org/10.11896/jsjkx.210200157 |
[8] | 储安琪, 丁志军. 基于灰狼优化算法的信用评估样本均衡化与特征选择同步处理 Application of Gray Wolf Optimization Algorithm on Synchronous Processing of Sample Equalization and Feature Selection in Credit Evaluation 计算机科学, 2022, 49(4): 134-139. https://doi.org/10.11896/jsjkx.210300075 |
[9] | 孙林, 黄苗苗, 徐久成. 基于邻域粗糙集和Relief的弱标记特征选择方法 Weak Label Feature Selection Method Based on Neighborhood Rough Sets and Relief 计算机科学, 2022, 49(4): 152-160. https://doi.org/10.11896/jsjkx.210300094 |
[10] | 李宗然, 陈秀宏, 陆赟, 邵政毅. 鲁棒联合稀疏不相关回归 Robust Joint Sparse Uncorrelated Regression 计算机科学, 2022, 49(2): 191-197. https://doi.org/10.11896/jsjkx.210300034 |
[11] | 任首朋, 李劲, 王静茹, 岳昆. 基于集成回归决策树的lncRNA-疾病关联预测方法 Ensemble Regression Decision Trees-based lncRNA-disease Association Prediction 计算机科学, 2022, 49(2): 265-271. https://doi.org/10.11896/jsjkx.201100132 |
[12] | 陈伟, 李杭, 李维华. 核小体定位预测的集成学习方法 Ensemble Learning Method for Nucleosome Localization Prediction 计算机科学, 2022, 49(2): 285-291. https://doi.org/10.11896/jsjkx.201100195 |
[13] | 刘振宇, 宋晓莹. 一种可用于分类型属性数据的多变量回归森林 Multivariate Regression Forest for Categorical Attribute Data 计算机科学, 2022, 49(1): 108-114. https://doi.org/10.11896/jsjkx.201200189 |
[14] | 周新民, 胡宜桂, 刘文洁, 孙荣俊. 基于多模态多层级数据融合方法的城市功能识别研究 Research on Urban Function Recognition Based on Multi-modal and Multi-level Data Fusion Method 计算机科学, 2021, 48(9): 50-58. https://doi.org/10.11896/jsjkx.210500220 |
[15] | 张叶, 李志华, 王长杰. 基于核密度估计的轻量级物联网异常流量检测方法 Kernel Density Estimation-based Lightweight IoT Anomaly Traffic Detection Method 计算机科学, 2021, 48(9): 337-344. https://doi.org/10.11896/jsjkx.200600108 |
|