计算机科学 ›› 2019, Vol. 46 ›› Issue (6A): 439-441.
钟熙, 孙祥娥
ZHONG Xi, SUN Xiang-e
摘要: 朴素贝叶斯方法简单、计算高效、精确度高,且具有坚实的理论基础,得到了广泛应用。文中针对差异性是集成学习的关键条件,提出了基于Kmeans++聚类技术来提高朴素贝叶斯分类器集成差异性的方法,从而提升了朴素贝叶斯的泛化性能。首先,通过训练样本集训练出多个朴素贝叶斯基分类器模型;然后,为了增大基分类器之间的差异性,利用Kmeans++算法对基分类器在验证集上的预测结果进行聚类;最后,从每个聚类簇中选择泛化性能最佳的基分类器进行集成学习,最终结果由简单投票法得出。利用UCI标准数据集对该方法进行验证,结果表明该方法的泛化性能得到了较大的提升。
中图分类号:
[1]周志华.机器学习[M].北京:清华大学出版社,2016:2-4. [2]HARRINGTON P.机器学习实战[M].李锐,李鹏,曲亚东,等译.北京:人民邮电出版社,2013:171-173. [3]DIETTERICH T G.Machine learning research:four current directions[J].AI Magazine,1997,18(4):97-136. [4]ZHOU Z H,WU J,TANG W.Ensembling neural networks: many could be better than all[J].Artificial intelligence,2002,137(1):239-263. [5]BLACK C,KEOGH E,MERZ C J.UCI repository of machine lear-ningdatabase[EB/OL].http://www.ics.uci.edu/~mlearn/MLReposito-ry.html.1998. [6]郭英明,李虹利.基于斯皮尔曼系数的加权朴素贝叶斯分类算法研究[J].信息与电脑,2018(13):57-59. [7]JIANG Q,WANG W,HAN X,et al.Deep feature weighting in Nai-ve Bayes for Chinese text classification[C]∥International Conference on Cloud Computing and Intelligence Systems(CCIS).Beijing,2016:160-164. [8]邓广彪,黄振功,岳晓光.基于Nesterov平滑的高阶路径朴素贝叶斯文本隐式分类研究[J].西南师范大学学报(自然科学版),2018,43(7):107-112. [9]KATKAR V D,KULKARNI S V.A novel parallel implementation of Naive Bayesian classifier for Big Data[C]∥International Conferen-ce on Green Computing,Communication and Conservation of Energy (ICGCE).Chennai,2013:847-852. [10]ZAGORECKIA.Feature selection for naive Bayesian network ensemble using evolutionary algorithms[C]∥Federated Conference on Computer Science and Information Systems.Warsaw,2014:381-385. [11]TSYMBAL A,PUURONEN S,PATTERSON D W.Ensemble f-eature selection with the simple Bayesian classification[J].Information Fusion,2003,4(2):87-100. [12]张剑飞,刘克会,杜晓昕.基于k阶依赖扩展的贝叶斯网络分类器集成学习算法[J].东北师大学报(自然科学版),2016,48(1):65-71. [13]王玲娣,徐华.一种基于聚类和AdaBoost的自适应集成算法[J].吉林大学学报(理学版),2018,56(4):917-924. [14]GIACINTO G,ROLI F.Design of effective neural network ense-mbles for image classification purposes[J].Image and Vision Comput-ing,2001,19(9):699-707. [15]何梦娇,杨燕,王淑营.一种基于非负矩阵分解的聚类集成算法[J].计算机科学,2017,44(9):58-61. [16]HAN J W,KAMBER M.数据挖掘概念与技术[M].范明,孟小锋,译.北京:机械工业出版社,2000:173-175. [17]ARTHUR D,VASSILVITSKII S.k-means++:the advantages of careful seeding[C]∥In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms.New Orleans,SIAM,2007:1027-1035. [18]BREIMAN L.Bagging predictors[J].Machine learning,1996,24(2):123-140. [19]KROGN A,VEDLEBSBY J.Neural network ensembles,cross v-alidation and active learning[C]∥International Conference on Neural Information Processing Systems.MIT Press,1994:231-238. [20]李凯,李昆仑,崔丽娟.模型聚类及在集成学习中的应用研究[J].计算机研究与发展,2007(S2):203-207. |
[1] | 林夕, 陈孜卓, 王中卿. 基于不平衡数据与集成学习的属性级情感分类 Aspect-level Sentiment Classification Based on Imbalanced Data and Ensemble Learning 计算机科学, 2022, 49(6A): 144-149. https://doi.org/10.11896/jsjkx.210500205 |
[2] | 康雁, 吴志伟, 寇勇奇, 张兰, 谢思宇, 李浩. 融合Bert和图卷积的深度集成学习软件需求分类 Deep Integrated Learning Software Requirement Classification Fusing Bert and Graph Convolution 计算机科学, 2022, 49(6A): 150-158. https://doi.org/10.11896/jsjkx.210500065 |
[3] | 王宇飞, 陈文. 基于DECORATE集成学习与置信度评估的Tri-training算法 Tri-training Algorithm Based on DECORATE Ensemble Learning and Credibility Assessment 计算机科学, 2022, 49(6): 127-133. https://doi.org/10.11896/jsjkx.211100043 |
[4] | 韩红旗, 冉亚鑫, 张运良, 桂婕, 高雄, 易梦琳. 基于共同子空间分类学习的跨媒体检索研究 Study on Cross-media Information Retrieval Based on Common Subspace Classification Learning 计算机科学, 2022, 49(5): 33-42. https://doi.org/10.11896/jsjkx.210200157 |
[5] | 任首朋, 李劲, 王静茹, 岳昆. 基于集成回归决策树的lncRNA-疾病关联预测方法 Ensemble Regression Decision Trees-based lncRNA-disease Association Prediction 计算机科学, 2022, 49(2): 265-271. https://doi.org/10.11896/jsjkx.201100132 |
[6] | 陈伟, 李杭, 李维华. 核小体定位预测的集成学习方法 Ensemble Learning Method for Nucleosome Localization Prediction 计算机科学, 2022, 49(2): 285-291. https://doi.org/10.11896/jsjkx.201100195 |
[7] | 刘振宇, 宋晓莹. 一种可用于分类型属性数据的多变量回归森林 Multivariate Regression Forest for Categorical Attribute Data 计算机科学, 2022, 49(1): 108-114. https://doi.org/10.11896/jsjkx.201200189 |
[8] | 周新民, 胡宜桂, 刘文洁, 孙荣俊. 基于多模态多层级数据融合方法的城市功能识别研究 Research on Urban Function Recognition Based on Multi-modal and Multi-level Data Fusion Method 计算机科学, 2021, 48(9): 50-58. https://doi.org/10.11896/jsjkx.210500220 |
[9] | 韩丽霞, 张占营. 基于树增益朴素贝叶斯网络的服务定价策略 TAN-based Service Pricing Strategy 计算机科学, 2021, 48(6A): 203-. https://doi.org/10.11896/jsjkx.200900024 |
[10] | 周钢, 郭福亮. 基于特征选择的高维数据集成学习方法研究 Research on Ensemble Learning Method Based on Feature Selection for High-dimensional Data 计算机科学, 2021, 48(6A): 250-254. https://doi.org/10.11896/jsjkx.200700102 |
[11] | 戴宗明, 胡凯, 谢捷, 郭亚. 基于直觉模糊集的集成学习算法 Ensemble Learning Algorithm Based on Intuitionistic Fuzzy Sets 计算机科学, 2021, 48(6A): 270-274. https://doi.org/10.11896/jsjkx.200700036 |
[12] | 雷剑梅, 曾令秋, 牟洁, 陈立东, 王淙, 柴勇. 基于整车EMC标准测试和机器学习的反向诊断方法 Reverse Diagnostic Method Based on Vehicle EMC Standard Test and Machine Learning 计算机科学, 2021, 48(6): 190-195. https://doi.org/10.11896/jsjkx.200700204 |
[13] | 郇文明, 林海涛. 基于采样集成算法的入侵检测系统设计 Design of Intrusion Detection System Based on Sampling Ensemble Algorithm 计算机科学, 2021, 48(11A): 705-712. https://doi.org/10.11896/jsjkx.201100101 |
[14] | 梁伟, 段晓东, 徐健锋. 基于差异性度量的基础聚类三支过滤算法 Three-way Filtering Algorithm of Basic Clustering Based on Differential Measurement 计算机科学, 2021, 48(1): 136-144. https://doi.org/10.11896/jsjkx.200700213 |
[15] | 刘振鹏, 苏楠, 秦益文, 卢家欢, 李小菲. FS-CRF:基于特征切分与级联随机森林的异常点检测模型 FS-CRF:Outlier Detection Model Based on Feature Segmentation and Cascaded Random Forest 计算机科学, 2020, 47(8): 185-188. https://doi.org/10.11896/jsjkx.190600162 |
|