计算机科学 ›› 2018, Vol. 45 ›› Issue (9): 65-69.doi: 10.11896/j.issn.1002-137X.2018.09.009

• 第十六届全国软件与应用学术会议 • 上一篇    下一篇

融合SMOTE与Filter-Wrapper的朴素贝叶斯决策树算法及其应用

许召召1, 李京华1, 陈同林1, 李昕洁1,2   

  1. 云南大学软件学院 昆明6500911
    云南省软件工程重点实验室 昆明6500912
  • 收稿日期:2017-10-09 出版日期:2018-09-20 发布日期:2018-10-10
  • 通讯作者: 李昕洁(1974-),男,博士,教授,主要研究方向为机器学习、计算智慧与决策支援系统,E-mail:camhero@hotmail.com
  • 作者简介:许召召(1991-),男,硕士生,主要研究方向为数据挖掘、机器学习,E-mail:243642549@qq.com;李京华(1994-),女,硕士生,主要研究方向为模糊逻辑系统,E-mail:chinghwali@hotmail.com;陈同林(1992-),男,硕士生,主要研究方向为数据挖掘、机器学习,E-mail:tonglinchen@hotmail.com
  • 基金资助:
    本文受国家自然科学基金:云计算环境下双模型驱动面向软件动态演化的建模与分析(61379032)资助。

Naive Bayesian Decision TreeAlgorithm Combining SMOTE and Filter-Wrapper and It’s Application

XU Zhao-zhao1, LI Ching-hwa1, CHEN Tong-lin1, LEE Shin-jye1,2   

  1. School of Software,Yunnan University,Kunming 650091,China1
    Key Laboratory in Software Engineering of Yunan Province,Kunming 650091,China2
  • Received:2017-10-09 Online:2018-09-20 Published:2018-10-10

摘要: 如何对以“工业4.0”为背景的物联网智慧医疗系统所产生的医疗数据进行高效且准确的挖掘仍然是一个十分严峻的问题。而医疗数据往往是高维的、不平衡的和有噪声的,因此提出一种新的数据处理方法——将SMOTE方法与Filter-Wrapper特征选择算法融合,并将其应用于支持临床医疗决策。特别地,所提方法不仅克服了朴素贝叶斯在属性实际应用中因属性独立假设而造成的预测不佳的情况,而且避免了C4.5决策树在构建模型时的过拟合问题。将所提算法应用于ECG临床医疗决策中,取得了很好的效果。

关键词: Wrapper特征选择, 决策树, 朴素贝叶斯, 数据平衡

Abstract: How to efficiently and accurately dig out the medical data generated by the Internet-based wisdom medical system with “Industrial 4.0” is still a very serious problem.However,the medical data is often high-dimensional,unba-lanced and noisy,so this paper proposed a new data processing method combining SMOTE method with Filter-Wrapper feature selection algorithm to support clinical decision-making.In particular,the proposed method not only overcomes the situation of bad prediction result of the independent assumptions in the practical attribute application of Naive Bayesian,but also avoids over-fitting problem caused by constructing the model of C4.5 decision tree.What’s more,when the proposed algorithm is applied to ECG clinical decision-making,good results can be obtained.

Key words: Data balance, Decision tree, Naive Bayesian, Wrapper feature selection

中图分类号: 

  • TP391
[1] CHENG Y Y,QU H B,ZHANG B L.Chinese medicine industry 4.0:advancing digital pharmaceutical manufacture toward intelligent pharmaceutical manufacture[J].China Journal of Chinese Materia Medica,2016,41(1):1.
[2]LI X,LI D,WAN J,et al.A review of industrial wireless networks in the context of Industry 4.0[J].Wireless Networks,2017,23(1):23-41.
[3]WILK S,SLOWINSKI R,MICHALOWSKI W,et al.Supporting triage of children with abdominal Pain in the emergency room[J].European Journal of Operationl Research,2005,160(3):696-709.
[4]CHEN J M,SUN Y X.Experiments study on a dynamic priority scheduling for wireless sensor networks[C]∥Proceedings of Mobile Ad-hoc and Sensor Networks.Wuhan,2005:613-622.
[5]QUINLAN J R.Induction of decision tree[J].Machine Lear-ning,1986,1(1):81-106.
[6]QUINLAN J R.Learning Efficient Classification Procedures and Their Application to Chess End Games[M]∥Machine Lear-ning.Springer Berlin Heidelberg,1984.
[7]MICHALSKI R S,CARBONELL J G,MITCHELL T M.Machine learning: an artificial intelligence approach[M].London:Morgan Kaufmann,1984:463-482.
[8]PALACIOS-ALONSO M A,BRIZUELA C A,SUCAR L E.Evo-lutionary learning of dynamic Nave Bayesian classifiers[J].Journal of Automated Reasoning,2010,45(1):21-37.
[9]CHAWLA N V,BOWYER K W,HALL L O,et al.SMOTE:synthetic minority over-sampling technique[J].Journal of Artificial Intelligence Research,2011,16(1):321-357.
[10]HAN H,WANG W Y,MAO B H.Borderline-SMOTE:A new over-sampling method in imbalanced data sets learning[C]∥Proceedings of the 2005 International Conference on Intelligent Computing.Berlin:Springer Press,2005:878-887.
[11]YEN S J,LEE Y S.Cluster-based under-sampling approaches for imbalanced data distributions[J].Expert Systems with Applications,2009,36(3):5718-5727.
[12]BATISTA G,PRATI R C,MONARD M C.A study of the behaviour of several methods for balancing machine learningtrai-ning data[J].SIGKDD Explor,2004,6(1):20-29.
[13]边肇祺,张学工.模式识别(第2版)[M].北京:清华大学出版社,2000.
[14]LANGLEY P.Selection of relevant features in machine learning[C]∥Proceedings of the AAAI Fall Symposium on Relevance.New Orleans,1994:1-5.
[15]ZHOU X B,WANG X D,DOUGHERTY E R.Nonlinear-Probit Gene Classification Using Mutual Information and WaveIet-Based Feature Selection[J].Biological Systems,2004,12(3):371-386.
[16]SINDHWANI V,RAKSHIT S,DEODHARE D,et al.Feature Selection In MLPs and SVMs Based on Maximum Output Information[J].IEEE Transactions on Neural Networks,2004,15(4):937-948.
[17]HSU W H.Genetic wrappers for feature selection in decision
tree induction and variable ordering in Bayesian network structure learning [J].Information Sciences,2004,163(17):103-122.
[18]LI L,WEINBERG C R,DARDEN T A,et al.Gene Selection for Sample Classification Based on Gene Expression Data:Study of Sensitivity to Choice of Parameters of the GA/KNN Method[J].Bioinformatics,2001,17(12):1131-1142.
[19]INZA l,LARRANAGA P,BLANCO E R,et al.Filter Versus Wrapper Gene Selection Approaches in DNA Microarray Domains[J].Artificial Intelligence in Medicine,2004,31(2):91-103.
[20]ZHANG Y Y,XIANG Y,JIANG R Q,et al.Analysis and Implementation of Map Reduce Parallelization of Naive Bayes Algorithm[J].Computer Technology and Development,2013,23(3):23-26.(in Chinese)
张依杨,向阳,蒋锐权,等.朴素贝叶斯算法的 MapReduce 并行化分析与实现[J].计算机技术与发展,2013,23(3):23-26.
[21]DOMINGOS P,PAZZANI M J.On The Optimality of The Simple Bayesian Classifier under Zero-One Loss[J].Machine Learning,1997,29(2/3):103-130.
[22]QUINLAN J R.Induction of decision trees[J].Machine Lear-ning,1986,1(1):81-106.
[23]SEGAL I E A. note on the concept of entropy[J].Journal of Mathematics and Mechanics,1960,9(4):623-629.
[24]QUINLAN J R.C4.5:Programming for machine learning[M].London,Morgan Kauffmann,1993.
[25]BREIMAN L,FRIEDMAN J H,STONE C J,et al.Classification and regression trees[M].Chapman and Hall,1984.
[26]FAN J C,ZHANG W Y,LIANG Y Q.Decision tree classification algorithm based on Bayesian method[J].Journal of Computer Applications,2005,25(12):2882-2884.(in Chinese)
樊建聪,张问银,梁永全.基于贝叶斯方法的决策树分类算法[J].计算机应用,2005,25(12):2882-2884.
[27]FRANK A,ASUNCION A.UCI Machine Learning Repository[DB/OL].http://archive.ics.uci.edu/ml/Irvine,CA:University of California,School of Information and Computer Science.
[28]YANG L Y,ZHANG J Y,WANG W J.Selecting and Combining Classifiers Simultaneously with Particle Swarm Optimization[J].Information Technology Journal,2009,8(2):241-245.
[29]SINGH R G,PANDEY A.The Impact of Randomization on Circular-Complex Extreme Learning Machine for Real Valued Classification Problems[J].International Journal of Computer Applications,2014,103(2):1-7.
[30]LIPITAKIS A D,ANTZOULATOS G S,KOTSIANTIS S,et
al.Integrating global and local boosting[C]∥2015 6th International Conference on Information,Intelligence,Systems and Applications(IISA).IEEE,2015:1-6.
[31]RAHMAN A,VERMA B.A novel ensemble classifier approach using weak classifier learning on overlapping clusters[C]∥International Joint Conference on Neural Networks.IEEE,2010:1-7.
[32]COELHO A L V,NASCIMENTO D S C.On the evolutionary design of heterogeneous bagging models [J].Neuro Computing,2010,73(16):3319-3322.
[33]CHEN J,JI S,CERAN B,et al.Learning subspace kernels for classification[C]∥Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mi-ning.ACM,2008:106-114.
[34]DO T N,POULET F.Enhancing svm with visualization[C]∥
International Conference on Discovery Science.Springer Berlin Heidelberg,2004:183-194.
[35]QUINLAN J R.Bagging,boosting,and C4.5[C]∥Association for the Advancement of Artificial Intelligence.1996:725-730.
[36]CLARK P,BOSWELL R.Rule induction with CN2:Some recent improvements[C]∥European Working Session on Learning.Springer Berlin Heidelberg,1991:151-163.
[37]JO H,NA Y,OH B,et al.Attribute value taxonomy generation through matrix based adaptive genetic algorithm[C]∥20th IEEE International Conference on Tools with Artificial Intelligence.IEEE,2008,1:393-400.
[38]SAEED A A,CAWLEY G C,BAGNALL A.Benchmarking the semi-supervised naïve Bayes classifier[C]∥International Joint Conference on Neural Networks.IEEE,2015:558-561.
[1] 任首朋, 李劲, 王静茹, 岳昆.
基于集成回归决策树的lncRNA-疾病关联预测方法
Ensemble Regression Decision Trees-based lncRNA-disease Association Prediction
计算机科学, 2022, 49(2): 265-271. https://doi.org/10.11896/jsjkx.201100132
[2] 刘振宇, 宋晓莹.
一种可用于分类型属性数据的多变量回归森林
Multivariate Regression Forest for Categorical Attribute Data
计算机科学, 2022, 49(1): 108-114. https://doi.org/10.11896/jsjkx.201200189
[3] 曹扬晨, 朱国胜, 祁小云, 邹洁.
基于随机森林的入侵检测分类研究
Research on Intrusion Detection Classification Based on Random Forest
计算机科学, 2021, 48(6A): 459-463. https://doi.org/10.11896/jsjkx.200600161
[4] 唐亮, 李飞.
基于决策树的车联网安全态势预测模型研究
Research on Forecasting Model of Internet of Vehicles Security Situation Based on Decision Tree
计算机科学, 2021, 48(6A): 514-517. https://doi.org/10.11896/jsjkx.200700158
[5] 韩丽霞, 张占营.
基于树增益朴素贝叶斯网络的服务定价策略
TAN-based Service Pricing Strategy
计算机科学, 2021, 48(6A): 203-. https://doi.org/10.11896/jsjkx.200900024
[6] 雷剑梅, 曾令秋, 牟洁, 陈立东, 王淙, 柴勇.
基于整车EMC标准测试和机器学习的反向诊断方法
Reverse Diagnostic Method Based on Vehicle EMC Standard Test and Machine Learning
计算机科学, 2021, 48(6): 190-195. https://doi.org/10.11896/jsjkx.200700204
[7] 丁思凡, 王锋, 魏巍.
一种基于标签相关度的Relief特征选择算法
Relief Feature Selection Algorithm Based on Label Correlation
计算机科学, 2021, 48(4): 91-96. https://doi.org/10.11896/jsjkx.200800025
[8] 董明刚, 黄宇扬, 敬超.
基于遗传实例和特征选择的K近邻训练集优化方法
K-Nearest Neighbor Classification Training Set Optimization Method Based on Genetic Instance and Feature Selection
计算机科学, 2020, 47(8): 178-184. https://doi.org/10.11896/jsjkx.190700089
[9] 朱涤尘, 夏换, 杨秀璋, 于小民, 张亚成, 武帅.
基于文本挖掘和决策树分析的中国手游产业发展研究
Research on Mobile Game Industry Development in China Based on Text Mining and Decision Tree Analysis
计算机科学, 2020, 47(6A): 530-534. https://doi.org/10.11896/JsJkx.190700124
[10] 邹洁, 朱国胜, 祁小云, 曹扬晨.
基于C4.5决策树的HTTPS加密流量分类方法
HTTPS Encrypted Traffic Classification Method Based on C4.5 Decision Tree
计算机科学, 2020, 47(6A): 381-385. https://doi.org/10.11896/JsJkx.191200155
[11] 余孟池, 牟甲鹏, 蔡剑, 徐建.
噪声标签重标注方法
Noisy Label Classification Learning Based on Relabeling Method
计算机科学, 2020, 47(6): 79-84. https://doi.org/10.11896/jsjkx.190600041
[12] 董本清, 李凤坤.
基于加权划分非平衡决策树的诗歌朗读情感度分析
Analysis of Emotional Degree of Poetry Reading Based on WDOUDT
计算机科学, 2020, 47(11A): 46-51. https://doi.org/10.11896/jsjkx.200600055
[13] 钟熙, 孙祥娥.
基于Kmeans++聚类的朴素贝叶斯集成方法研究
Research on Naive Bayes Ensemble Method Based on Kmeans++ Clustering
计算机科学, 2019, 46(6A): 439-441.
[14] 吕明琪, 李一帆, 陈铁明.
一种基于地形因素的空气质量空间估计方法
Spatial Estimation Method of Air Quality Based on Terrain Factors LV Ming-qi LI Yi-fan CHEN Tie-ming
计算机科学, 2019, 46(1): 265-270. https://doi.org/10.11896/j.issn.1002-137X.2019.01.041
[15] 南世慧, 魏伟, 吴华清, 邹金蓉, 赵志文.
基于KNN和GBDT的Web服务器指纹识别技术
Web Server Fingerprint Identification Technology Based on KNN and GBDT
计算机科学, 2018, 45(8): 141-145. https://doi.org/10.11896/j.issn.1002-137X.2018.08.025
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!