风险最小化加权朴素贝叶斯分类器

doi:10.11896/jsjkx.240600045

计算机科学 ›› 2025, Vol. 52 ›› Issue (3): 137-151.doi: 10.11896/jsjkx.240600045

• 数据库&大数据&数据科学 • 上一篇下一篇

风险最小化加权朴素贝叶斯分类器

欧桂良¹, 何玉林^1,2, 张曼静¹, 黄哲学^1,2, PhilippeFournier-Viger²

1 人工智能与数字经济广东省实验室(深圳) 广东深圳 518107
2 深圳大学计算机与软件学院广东深圳 518060

收稿日期:2024-06-04 修回日期:2024-09-05 出版日期:2025-03-15 发布日期:2025-03-07
通讯作者: 何玉林(yulinhe@gml.ac.cn)
作者简介:(ouguiliang@gml.ac.cn)
基金资助:
广东省自然科学基金面上项目(2023A1515011667);深圳市基础研究重点项目(JCYJ20220818100205012);深圳市基础研究面上项目(JCYJ20210324093609026);深圳市科技重大专项项目(202302D074)

Risk Minimization-Based Weighted Naive Bayesian Classifier

OU Guiliang¹, HE Yulin^1,2, ZHANG Manjing¹, HUANG Zhexue^1,2 , Philippe FOURNIER-VIGER²

1 Guangdong Laboratory of Artificial Intelligence and Digital Economy(SZ),Shenzhen,Guangdong 518107,China
2 College of Computer Science & Software Engineering,Shenzhen University,Shenzhen,Guangdong 518060,China

Received:2024-06-04 Revised:2024-09-05 Online:2025-03-15 Published:2025-03-07
About author:OU Guiliang,born in 1998,postgra-duate,is a member of CCF(No.U2330M).His main research interests include data mining and machine lear-ning algorithms and their applications.
HE Yulin,born in 1982,Ph.D,research fellow,is a member of CCF(No.97303M).His main research interests include big data approximate computing technologies,multi-sample statistics theo-ries and methods,data mining and machine algorithms and their applications.
Supported by:
Natural Science Foundation of Guangdong Province(2023A1515011667),Key Basic Research Foundation of Shenzhen(JCYJ20220818100205012),Basic Research Foundation of Shenzhen(JCYJ20210324093609026) and Science and Technology Major Project of Shenzhen(202302D074).

摘要/Abstract

摘要： 朴素贝叶斯分类器被誉为机器学习领域的十大经典算法之一,其以完备的理论基础和简单的模型结构而闻名,在许多的实际应用中取得了良好的分类效果。然而条件属性独立性假设在一定程度上限制了朴素贝叶斯分类器的性能,因此大量的改进工作被提出来缓解这一问题,加权朴素贝叶斯分类器便是其中之一。在对边缘概率权重作用深入分析的基础之上,文中提出了一种基于风险最小化的加权朴素贝叶斯分类器(Risk Minimization-Based Weighted Naive Bayesian Classifier,RM-WNBC),即在权重确定的过程中同时考虑分类器的经验风险和权重的结构风险。不同于现有的过分关注朴素贝叶斯分类器外在泛化性能的改进策略,RM-WNBC是从朴素贝叶斯分类器的内在概率分布出发改善其泛化性能。经验风险度量了加权朴素贝叶斯分类器的分类能力,采用后验概率的估计质量表示;结构风险刻画了加权朴素贝叶斯分类器对属性相关性的处理,采用类条件概率的均方差表示。经验风险最小化保证了RM-WNBC可以获得良好的训练精度,同时结构风险最小化又使得RM-WNBC能够取得最佳的属性相关表达能力。为了获得RM-WNBC的最优权重,推导了高效且收敛的权重更新策略来保证结构风险和经验风险的最小化。在31个UCI和KEEL标准分类数据集上对RM-WNBC的可行性、合理性和有效性进行了验证。实验结果表明:1)RM-WNBC的训练和测试精度随着边缘概率权重的不断更新逐渐增加直至收敛;2)RM-WNBC具有比现有加权朴素贝叶斯分类器更好的属性相关性表达能力;3)在给定的显著性水平下,RM-WNBC在31个数据集上能够获得比经典朴素贝叶斯分类器、3种贝叶斯网络、4种加权朴素贝叶斯分类器和1种特征选择朴素贝叶斯分类器更好的训练和测试表现。

关键词: 朴素贝叶斯, 独立性假设, 加权朴素贝叶斯, 结构风险, 经验风险, 贝叶斯网络

Abstract: Naive Bayesian classifier(NBC),which is famous for its sound theoretical basis and simple model structure,is a classical classification algorithm which has been deemed as one of the top 10 algorithms in the fields of data mining and machine lear-ning.However,the dependence assumption of NBC limits its prediction performance when attribute dependence exists.Weighted NBC(WNBC) is an improved version of NBC,which has good generalization performance and low training complexity.This paper proposes a risk minimization-based WNBC(RM-WNBC) by considering both empirical risk and structural risk,in which the empirical risk measures the classification performance of RM-WNBC and structural risk depicts the dependence expression capability of RM-WNBC.Unlike existing improvements to NBC,RM-WNBC alleviates the dependence assumption and further enhances the generalization capability of NBC by considering with the internal characteristics of NBC rather than its external characteristics.The empirical risk is represented by the estimation quality of posterior probabilities,while the structural risk is represented by the mean squared error of joint probabilities.The minimization of empirical risk and structural risk guarantees that RM-WNBC can achieve both good classification performance and appropriate dependence representation.To obtain the optimal weights of marginal probabilities,an efficient and convergent updating strategy is designed by minimizing the empirical and structural risks.A series of persuasive experiments is conducted to validate the feasibility,rationality and effectiveness of RM-WNBC on 31 benchmark data sets.The experimental results show that the optimization process of RM-WNBC weights is convergent and RM-WNBC not only well deals with the attribute dependence but also obtains better training and testing accuracies than the classical NBC,three typical Bayesian networks,four WNBCs and feature selection-based NBC.

Key words: Naive Bayesian, Independence assumption, Weighted naive Bayeisan, Structural Risk, Empirical risk, Bayesian network

中图分类号:

TP391

欧桂良, 何玉林, 张曼静, 黄哲学, PhilippeFournier-Viger. 风险最小化加权朴素贝叶斯分类器[J]. 计算机科学, 2025, 52(3): 137-151. https://doi.org/10.11896/jsjkx.240600045

OU Guiliang, HE Yulin, ZHANG Manjing, HUANG Zhexue , Philippe FOURNIER-VIGER. Risk Minimization-Based Weighted Naive Bayesian Classifier[J]. Computer Science, 2025, 52(3): 137-151. https://doi.org/10.11896/jsjkx.240600045

参考文献

[1]MCCALLUM A,NIGAM K.A comparison of event models for naive bayes text classification[C]//AAAI-98 workshop on learning for text categorization.1998:41-48.
[2]KIM S B,HAN K S,RIM H C,et al.Some effective techniques for naive bayes text classification[J].IEEE transactions on knowledge and data engineering,2006,18(11):1457-1466.
[3]DIAB D M,El HINDI K M.Using differential evolution for fine tuning naïve Bayesian classifiers and its application for text classification[J].Applied Soft Computing,2017,54:183-199.
[4]MOGHADDAM B,JEBARA T,PENTLAND A.Bayesian face recognition[J].Pattern recognition,2000,33(11):1771-1782.
[5]LI Y,WANG G,NIE L,et al.Distance metric optimization dri-ven convolutional neural network for age invariant face recognition[J].Pattern Recognition,2018,75:51-62.
[6]WANG S T,ZHOU Z,JIN W,et al.Saliency Detection for RGB-D Images Under Bayesian Framework[J].Acta Automatica Si-nica,2020,46(4):695-720.
[7]CAI B,HUANG L,XIE M.Bayesian networks in fault diagnosis[J].IEEE Transactions on industrial informatics,2017,13(5):2227-2240.
[8]BHAGYA SHREE S R,SHESHADRI H S.Diagnosis of Alzhei-mer’s disease using naive Bayesian classifier[J].Neural Computing and Applications,2018,29:123-132.
[9]CAO C L,WANG Y Y,YUAN Y,et al.Research of a spam filter based on improved naive Bayes algorithm[J].Chinese Journal of Network and Information Security,2017,3(3):64-70.
[10]ZHANG H,CHENG N,ZHANG Y,et al.Label flipping attacks against Naive Bayes on spam filtering systems[J].Applied Intelligence,2021,51(7):4503-4514.
[11]CHEN X,SUN L.Bayesian temporal factorization for multidimensional time series prediction[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2021,44(9):4659-4673.
[12]WANG S C,GAO R,DU R J.With Super Parent Node Bayesian Network Ensemble Regression Model for Time Series[J].Chinese Journal of Computers,2017,40(12):2748-2761.
[13]CUI J X,YANG B.Survey on Bayesian Optimization Methodo-logy and Applications[J].Journal of Software,2018,29(10):3068-3090.
[14]NOUR M,CÖMERT Z,POLAT K.A novel medical diagnosis model for COVID-19 infection detection based on deep features and Bayesian optimization[J].Applied Soft Computing,2020,97:106580.
[15]CHEN W,ZHU B,ZHANG H X.BN-Mapping:Visual Analysis of Geospatial Data with Bayesian Network[J].Chinese Journal of Computers,2016,39(7):1281-1293.
[16]RISH I.An empirical study of the naive Bayes classifier[C]//IJCAI 2001 Workshop on Empirical Methods in Artificial Intelligence.2001:41-46.
[17]QUINLAN J R.C4.5:programs for machine learning[M].Elsevier,2014.
[18]JOHN G H,LANGLEY P.Estimating continuous distributions in Bayesian classifiers[J].arXiv:1302.4964,2013.
[19]PÉREZ A,LARRAÑAGA P,INZA I.Bayesian classifiers based on kernel density estimation:Flexible classifiers[J].Interna-tional Journal of Approximate Reasoning,2009,50(2):341-362.
[20]WANG X Z,HE Y L,WANG D D.Non-naive Bayesian classi-fiers for classification problems with continuous attributes[J].IEEE Transactions on Cybernetics,2013,44(1):21-39.
[21]ZHANG H,SHENG S.Learning weighted naive Bayes with accurate ranking[C]//Fourth IEEE International Conference on Data Mining(ICDM’04).IEEE,2004:567-570.
[22]LEE C H,GUTIERREZ F,DOU D.Calculating feature weights in naive bayes with kullback-leibler measure[C]//2011 IEEE 11th International Conference on Data Mining.IEEE,2011:1146-1151.
[23]JIANG L,LI C,WANG S,et al.Deep feature weighting fornaive Bayes and its application to text classification[J].Enginee-ring Applications of Artificial Intelligence,2016,52:26-39.
[24]JIANG L,ZHANG L,LI C,et al.A correlation-based featureweighting filter for naive Bayes[J].IEEE Transactions on Knowledge and Data Engineering,2018,31(2):201-213.
[25]ZHANG H,JIANG L,YU L.Class-specific attribute valueweighting for Naive Bayes[J].Information Sciences,2020,508:260-274.
[26]ZHANG H,JIANG L,YU L.Attribute and instance weighted naive Bayes[J].Pattern Recognition,2021,111:107674.
[27]ZHANG H,JIANG L,LI C.Collaboratively weighted naiveBayes[J].Knowledge and Information Systems,2021,63:3159-3182.
[28]ZHANG H,JIANG L.Fine tuning attribute weighted naiveBayes[J].Neurocomputing,2022,488:402-411.
[29]ZHANG M L,PEÑA J M,ROBLES V.Feature selection formulti-label naive Bayes classification[J].Information Sciences,2009,179(19):3218-3229.
[30]JU Z Y,WANG Z H.A Bayesian Classification Algorithm Based on Selective Patterns[J].Journal of Computer Research and Development,2020,57(8):1605-1616.
[31]FRIEDMAN N,GEIGER D,GOLDSZMIDT M.Bayesian network classifiers[J].Machine learning,1997,29:131-163.
[32]JIANG L,CAI Z,WANG D,et al.Improving Tree augmented Naive Bayes for class probability estimation[J].Knowledge-Based Systems,2012,26:239-245.
[33]WEBB G I,BOUGHTON J R,WANG Z.Not so naive Bayes:aggregating one-dependence estimators[J].Machine Learning,2005,58:5-24.
[34]JIANG L,ZHANG H,CAI Z.A novel bayes model:Hiddennaive bayes[J].IEEE Transactions on Knowledge and Data Engineering,2008,21(10):1361-1371.
[35]WANG X,HE Y.Learning from uncertainty for big data:future analytical challenges and strategies[J].IEEE Systems,Man,and Cybernetics Magazine,2016,2(2):26-31.
[36]DEMŠAR J.Statistical comparisons of classifiers over multiple data sets[J].The Journal of Machine learning research,2006,7:1-30.
[37]SALLOUM S,HUANG J Z,HE Y.Random sample partition:a distributed data model for big data analysis[J].IEEE Transactions on Industrial Informatics,2019,15(11):5846-5854.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

风险最小化加权朴素贝叶斯分类器

Risk Minimization-Based Weighted Naive Bayesian Classifier

PDF (PC)

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 0

Metrics

本文评价

推荐阅读 0