云计算环境下的一种改进的贝叶斯文本分类算法

计算机科学 ›› 2014, Vol. 41 ›› Issue (Z6): 339-342.

云计算环境下的一种改进的贝叶斯文本分类算法

张琳,邵天昊

南京邮电大学计算机学院南京210003;南京邮电大学计算机学院南京210003

出版日期:2018-11-14 发布日期:2018-11-14
基金资助:
本文受省属高校自然科学基金(13KJB520017),南京邮电大学科研基金(NY213155)资助

Improved Bayesian Text Classification Algorithm in Cloud Computing Environment

ZHANG Lin and SHAO Tian-hao

Online:2018-11-14 Published:2018-11-14

摘要/Abstract

摘要： 基于云计算的思想运用MapReduce模型解决了传统贝叶斯分类算法不适应大规模数据的缺陷,很大程度地提高了分类速度。结合并行化的特点对算法进行了相应的改进,加入了同义词合并和词频过滤等方法,使得向量维数降低,减少了误判。然后对其中特殊的关键词进行加权,增强了分类准确性。最后在 Hadoop 云计算平台上进行了实验,证明了传统的文本分类算法并行化后在 Hadoop上运行具有较好的加速比,并且改进后的算法能够提高分类精确度。

关键词: 云计算,文本分类,并行化,Hadoop 中图法分类号TP391．1文献标识码A

Abstract: Used the idea of cloud computing,according to MapReduce model to solve the traditional Bayesian classification algorithm suited to large-scale data deficiencies,greatly improved the speed of classification．And the combination of the characteristics of the parallel algorithm was improved accordingly.Adding synonyms and word frequency filtering combined approach allows vector dimensionality reduction,reducing false positives．Wherein the particular keyword was then weighted to enhance the accuracy of classification．Finally,the Hadoop cloud computing platform was experimentally proved that the traditional text classification algorithm after parallelization on Hadoop cloud computing platforms,has better speedup,and the improved algorithm can improve the classification accuracy.

Key words: Cloud computing,Text classification,Parallel,Hadoop

张琳,邵天昊. 云计算环境下的一种改进的贝叶斯文本分类算法[J]. 计算机科学, 2014, 41(Z6): 339-342. https://doi.org/

ZHANG Lin and SHAO Tian-hao. Improved Bayesian Text Classification Algorithm in Cloud Computing Environment[J]. Computer Science, 2014, 41(Z6): 339-342. https://doi.org/

参考文献

[1] Jing Y S,Pavlovic V,Rehg J M．Boosted Bayesian network classifiers[J]．Machine Learning,2008,73(2):155-184
[2] Webb G I,Boughton J R,Zheng F,et al．Learning by extrapolation from marginal to full-multivariate probability distributions:Decreasingly naive Bayesian classification[J]．Machine Lear-ning,2012,86(2):233-272
[3] Tillman R E．Structure learning with independent non-identically distributed data[C]∥Proceedings of the 26th Annual International Conference on Machine Learning．New York,2009:1041-1048
[4] Su J,Zhang H,Ling C X,et al．Discriminative parameter learning for Bayesian networks[C]∥Proceedings of the 25th International Conference on Machine Learning(ICML 2008)．Helsinki,Finland,2008:1014-1023
[5] Ekanayake J,Li H,Zhang B,et al．Twister:A runtime for iterative MapReduce[C]∥Proceedings of the 19th ACM International Symposium onHigh Performance Distributed Computing．Chicago,Illinois,USA,2010:810-818
[6] Dean J,Ghemawat S．Mapreduce:Smiplified data processingonlarge clusters[C]∥Proceedings of the 6th Symposium onOpe-rating System Design and Implementation．SanFrancisco,California,USA:USENIX Association,2004:137-150
[7] Thusoo A,Sarma J S,Jain N,et al．Hive:A warehousing so-lution over a map-reduce framework[C]∥Proceedings of the Conference on Very Large Databases (VLDB.09)．Lyon,France,2009:1626-1629
[8] Dean J,Ghemawat S．Map/Reduce advantages overparallel databases include storage-systemindependenceand fine-grain fault tolerance for large jobs[J]．Communi-cations of the ACM,2010,3(1):72-77
[9] Dittrich J,Quiane-Ruiz J-A,Jindal A,et al．Hadoop+ :Ma-king a yellow elephant run like a cheetah(without it evennoti-cing)[J]．Proceedings of the VLDB Endowment,2010,3(1):518-529
[10] Bu Y,Howe B,Balazinska M,et al．HaLoop:Efficient itera-tive data processing on large clusters[J]．Proceedings of theVLDB Endowment,2010,3(1):285-296

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed