计算机科学 ›› 2019, Vol. 46 ›› Issue (6A): 439-441.

• 大数据与数据挖掘 • 上一篇    下一篇

基于Kmeans++聚类的朴素贝叶斯集成方法研究

钟熙, 孙祥娥   

  1. 长江大学电工电子国家级实验教学示范中心 湖北 荆州434000
  • 出版日期:2019-06-14 发布日期:2019-07-02
  • 通讯作者: 孙祥娥(1970-),女,教授,博士生导师,主要研究方向为信号处理的方法与研究实现,E-mail:xinges2000@yangzteu.edu.cn(通信作者)。
  • 作者简介:钟 熙(1992-),男,硕士生,主要研究方向为信号检测与处理;
  • 基金资助:
    本文受国家自然科学基金(51604038)资助。

Research on Naive Bayes Ensemble Method Based on Kmeans++ Clustering

ZHONG Xi, SUN Xiang-e   

  1. National Electrical and Electronic Demonstration Center for Experimental Education,Yangtze University,Jingzhou,Hubei 434000,China
  • Online:2019-06-14 Published:2019-07-02

摘要: 朴素贝叶斯方法简单、计算高效、精确度高,且具有坚实的理论基础,得到了广泛应用。文中针对差异性是集成学习的关键条件,提出了基于Kmeans++聚类技术来提高朴素贝叶斯分类器集成差异性的方法,从而提升了朴素贝叶斯的泛化性能。首先,通过训练样本集训练出多个朴素贝叶斯基分类器模型;然后,为了增大基分类器之间的差异性,利用Kmeans++算法对基分类器在验证集上的预测结果进行聚类;最后,从每个聚类簇中选择泛化性能最佳的基分类器进行集成学习,最终结果由简单投票法得出。利用UCI标准数据集对该方法进行验证,结果表明该方法的泛化性能得到了较大的提升。

关键词: Kmeans++聚类, 差异性, 集成学习, 朴素贝叶斯

Abstract: Naive Bayes is widely applied because of its simple method,high computation efficiency,high accuracy and solid the oretical foundation.Since the difference is a key condition of ensemble learning,this paper studied the method for improving the ensemble difference of naive Bayes classifier based on kmeans++ clustering technology,so as to improve the generalization performance of naive Bayes.Firstly,plurality of naive Bayesian classifier models are trained through a training sample set.In order to increase the difference between the base classifiers,Kmeans++ algorithm is used to cluster the prediction results of the base classifiers on the verification set.Finally,the base classifier with the best generalization performance is selected from each cluster for ensemble learning,and the final result is obtained by simple voting method.UCI standard data sets are used to verify the algorithm at the end of this paper,and its generalization performance has been greatly improved.

Key words: Difference, Esemble learning, Kmeans++ clustering, Naive bayes

中图分类号: 

  • TP391
[1]周志华.机器学习[M].北京:清华大学出版社,2016:2-4.
[2]HARRINGTON P.机器学习实战[M].李锐,李鹏,曲亚东,等译.北京:人民邮电出版社,2013:171-173.
[3]DIETTERICH T G.Machine learning research:four current directions[J].AI Magazine,1997,18(4):97-136.
[4]ZHOU Z H,WU J,TANG W.Ensembling neural networks: many could be better than all[J].Artificial intelligence,2002,137(1):239-263.
[5]BLACK C,KEOGH E,MERZ C J.UCI repository of machine lear-ningdatabase[EB/OL].http://www.ics.uci.edu/~mlearn/MLReposito-ry.html.1998.
[6]郭英明,李虹利.基于斯皮尔曼系数的加权朴素贝叶斯分类算法研究[J].信息与电脑,2018(13):57-59.
[7]JIANG Q,WANG W,HAN X,et al.Deep feature weighting in Nai-ve Bayes for Chinese text classification[C]∥International Conference on Cloud Computing and Intelligence Systems(CCIS).Beijing,2016:160-164.
[8]邓广彪,黄振功,岳晓光.基于Nesterov平滑的高阶路径朴素贝叶斯文本隐式分类研究[J].西南师范大学学报(自然科学版),2018,43(7):107-112.
[9]KATKAR V D,KULKARNI S V.A novel parallel implementation of Naive Bayesian classifier for Big Data[C]∥International Conferen-ce on Green Computing,Communication and Conservation of Energy (ICGCE).Chennai,2013:847-852.
[10]ZAGORECKIA.Feature selection for naive Bayesian network ensemble using evolutionary algorithms[C]∥Federated Conference on Computer Science and Information Systems.Warsaw,2014:381-385.
[11]TSYMBAL A,PUURONEN S,PATTERSON D W.Ensemble f-eature selection with the simple Bayesian classification[J].Information Fusion,2003,4(2):87-100.
[12]张剑飞,刘克会,杜晓昕.基于k阶依赖扩展的贝叶斯网络分类器集成学习算法[J].东北师大学报(自然科学版),2016,48(1):65-71.
[13]王玲娣,徐华.一种基于聚类和AdaBoost的自适应集成算法[J].吉林大学学报(理学版),2018,56(4):917-924.
[14]GIACINTO G,ROLI F.Design of effective neural network ense-mbles for image classification purposes[J].Image and Vision Comput-ing,2001,19(9):699-707.
[15]何梦娇,杨燕,王淑营.一种基于非负矩阵分解的聚类集成算法[J].计算机科学,2017,44(9):58-61.
[16]HAN J W,KAMBER M.数据挖掘概念与技术[M].范明,孟小锋,译.北京:机械工业出版社,2000:173-175.
[17]ARTHUR D,VASSILVITSKII S.k-means++:the advantages of careful seeding[C]∥In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms.New Orleans,SIAM,2007:1027-1035.
[18]BREIMAN L.Bagging predictors[J].Machine learning,1996,24(2):123-140.
[19]KROGN A,VEDLEBSBY J.Neural network ensembles,cross v-alidation and active learning[C]∥International Conference on Neural Information Processing Systems.MIT Press,1994:231-238.
[20]李凯,李昆仑,崔丽娟.模型聚类及在集成学习中的应用研究[J].计算机研究与发展,2007(S2):203-207.
[1] 林夕, 陈孜卓, 王中卿.
基于不平衡数据与集成学习的属性级情感分类
Aspect-level Sentiment Classification Based on Imbalanced Data and Ensemble Learning
计算机科学, 2022, 49(6A): 144-149. https://doi.org/10.11896/jsjkx.210500205
[2] 康雁, 吴志伟, 寇勇奇, 张兰, 谢思宇, 李浩.
融合Bert和图卷积的深度集成学习软件需求分类
Deep Integrated Learning Software Requirement Classification Fusing Bert and Graph Convolution
计算机科学, 2022, 49(6A): 150-158. https://doi.org/10.11896/jsjkx.210500065
[3] 王宇飞, 陈文.
基于DECORATE集成学习与置信度评估的Tri-training算法
Tri-training Algorithm Based on DECORATE Ensemble Learning and Credibility Assessment
计算机科学, 2022, 49(6): 127-133. https://doi.org/10.11896/jsjkx.211100043
[4] 韩红旗, 冉亚鑫, 张运良, 桂婕, 高雄, 易梦琳.
基于共同子空间分类学习的跨媒体检索研究
Study on Cross-media Information Retrieval Based on Common Subspace Classification Learning
计算机科学, 2022, 49(5): 33-42. https://doi.org/10.11896/jsjkx.210200157
[5] 任首朋, 李劲, 王静茹, 岳昆.
基于集成回归决策树的lncRNA-疾病关联预测方法
Ensemble Regression Decision Trees-based lncRNA-disease Association Prediction
计算机科学, 2022, 49(2): 265-271. https://doi.org/10.11896/jsjkx.201100132
[6] 陈伟, 李杭, 李维华.
核小体定位预测的集成学习方法
Ensemble Learning Method for Nucleosome Localization Prediction
计算机科学, 2022, 49(2): 285-291. https://doi.org/10.11896/jsjkx.201100195
[7] 刘振宇, 宋晓莹.
一种可用于分类型属性数据的多变量回归森林
Multivariate Regression Forest for Categorical Attribute Data
计算机科学, 2022, 49(1): 108-114. https://doi.org/10.11896/jsjkx.201200189
[8] 周新民, 胡宜桂, 刘文洁, 孙荣俊.
基于多模态多层级数据融合方法的城市功能识别研究
Research on Urban Function Recognition Based on Multi-modal and Multi-level Data Fusion Method
计算机科学, 2021, 48(9): 50-58. https://doi.org/10.11896/jsjkx.210500220
[9] 韩丽霞, 张占营.
基于树增益朴素贝叶斯网络的服务定价策略
TAN-based Service Pricing Strategy
计算机科学, 2021, 48(6A): 203-. https://doi.org/10.11896/jsjkx.200900024
[10] 周钢, 郭福亮.
基于特征选择的高维数据集成学习方法研究
Research on Ensemble Learning Method Based on Feature Selection for High-dimensional Data
计算机科学, 2021, 48(6A): 250-254. https://doi.org/10.11896/jsjkx.200700102
[11] 戴宗明, 胡凯, 谢捷, 郭亚.
基于直觉模糊集的集成学习算法
Ensemble Learning Algorithm Based on Intuitionistic Fuzzy Sets
计算机科学, 2021, 48(6A): 270-274. https://doi.org/10.11896/jsjkx.200700036
[12] 雷剑梅, 曾令秋, 牟洁, 陈立东, 王淙, 柴勇.
基于整车EMC标准测试和机器学习的反向诊断方法
Reverse Diagnostic Method Based on Vehicle EMC Standard Test and Machine Learning
计算机科学, 2021, 48(6): 190-195. https://doi.org/10.11896/jsjkx.200700204
[13] 郇文明, 林海涛.
基于采样集成算法的入侵检测系统设计
Design of Intrusion Detection System Based on Sampling Ensemble Algorithm
计算机科学, 2021, 48(11A): 705-712. https://doi.org/10.11896/jsjkx.201100101
[14] 梁伟, 段晓东, 徐健锋.
基于差异性度量的基础聚类三支过滤算法
Three-way Filtering Algorithm of Basic Clustering Based on Differential Measurement
计算机科学, 2021, 48(1): 136-144. https://doi.org/10.11896/jsjkx.200700213
[15] 刘振鹏, 苏楠, 秦益文, 卢家欢, 李小菲.
FS-CRF:基于特征切分与级联随机森林的异常点检测模型
FS-CRF:Outlier Detection Model Based on Feature Segmentation and Cascaded Random Forest
计算机科学, 2020, 47(8): 185-188. https://doi.org/10.11896/jsjkx.190600162
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!