Computer Science ›› 2016, Vol. 43 ›› Issue (4): 256-259, 269.doi: 10.11896/j.issn.1002-137X.2016.04.052

Previous Articles     Next Articles

Bayesian Chinese Spam Filtering Method Based on Phrases

WANG Qing-song and WEI Ru-yu   

  • Online:2018-12-01 Published:2018-12-01

Abstract: Naive Bayesian has been widely used in the field of spam filtering,in which the feature extraction is one of the essential links in the algorithm.In the past,only words were used as text features for the extraction in the method of Chinese spam filtering.In face of large-scale email training samples,time efficiency of this algorithm will become a bottleneck of spam filtering technology.A Bayesian spam filtering algorithm based on phrases was proposed here which combines a new phrase analysis method put forward in text classification field.Phrases are extracted as the unit accor-ding to the rules of basic noun phrases,verb phrases and semantic analysis.Through comparison test experiment of spam filtering based on words and phrases as unit,the effectiveness of the proposed method was confirmed.

Key words: Spam filtering,Bayesian,Feature extraction,Phrased-based,Chinese word segmentation

[1] China Internet Network Information Center.China Internet network development state statistic report[R].Beijing:China Internet Network Information Center,2004(in Chinese) 中国互联网络信息中心.中国互联网络发展状况统计报告[R].北京:中国互联网信息中心,2004
[2] Zhai Jun-chang,Qin Yu-ping,Che Wei-wei.Improvement of Information Gain in Spam Filtering[J].Computer Science,2014,1(6):214-224(in Chinese) 翟军昌,秦玉平,车伟伟.垃圾邮件过滤中信息增益的改进研究[J].计算机科学,2014,1(6):214-224
[3] Xu Ji,Gong Jian.An Integrated Way to Filter Spam[J].Computer Science,2005,2(2):69-72,6(in Chinese) 徐激,龚俭.垃圾邮件的综合过滤方法[J].计算机科学,2005,2(2):69-72,6
[4] Li Yu-feng,Gao Xiao-jing.Comprehensive Approach For Chinese Spam Email Filtering[J].Computer Applications and Software,2011,8(8):220-226(in Chinese) 李玉峰,郜晓晶.中文垃圾邮件过滤综合方法[J].计算机应用与软件,2011,8(8):220-226
[5] Androutsopoulos I,Sakkis G,Paliouras G,et al.Learning to Filter Spam E-Mail[C]∥European Conference on Principles and Practice of Knowledge Discovery in Databases.Lyon,France,2000:1-13
[6] Zhang Ming-feng,Li Yun-chun,Li Wei.Survey of Application of Bayesian Classifying Method to Spam Filtering[J].Application Research of Computers,2005(8):14-19(in Chinese) 张铭锋,李云春,李巍.垃圾邮件过滤的贝叶斯方法综述[J].计算机应用研究,2005(8):14-19
[7] Androutsopoulos I,Koutsias J,Chandrinos K V,et al.An Ex-perimental Comparison of Naive Bayesian and Keyword-Based Anti-Spam Filtering with Encrypted Personal E-mail Messages[C]∥Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.Athens,Greece,2000:160-167
[8] Etzold D.Improving spam filtering by combining Naive Bayeswith simple k-nearest neighbor searches .
[9] Lu Jian-jiang,Zhang Wen-xian.Design for Chinese Text Classier[J].Computer Engineering and Applications,2002,5:49-51(in Chinese) 陆建江,张文献.中文文本分类器的设计[J].计算机工程与应用,2002,5:49-51
[10] Huang Zhi-gang.Chinese Spam Filtering System Design and Implementation Based on Bayesian[D].Chengdu:University of Electronic Science and Technology of China,2007(in Chinese) 黄志刚.基于贝叶斯的中文垃圾邮件过滤系统的设计与实现[D].成都:电子科技大学,2007
[11] Zhao Jun,Huang Chang-ning.The Model for Chinese Basenp Structure Analysis[J].Chinese Journal of Computers,1999,22(2):141-146(in Chinese) 赵军,黄昌宁.汉语基本名词短语结构分析模型[J].计算机学报,1999,22(2):141-146
[12] Zhan Wei-dong.A Study of Constructing Rules of Phrases in Contemporary Chinese for Chinese Information Processing [M].Beijing:Tsinghua University Press,2000:106-115(in Chinese) 詹卫东.面向中文信息处理的现代汉语短语结构规则研究[M].北京:清华大学出版社,2000:106-115
[13] Zhou Ming,Huang Chang-ning.Approach to the Chinese Dependency Formalism[J].Journal of Chinese Information Processing,1994,3:35-52(in Chinese) 周明,黄昌宁.面向语料库标注的汉语依存体系的探讨[J].中文信息学报,1994,3:35-52
[14] Androutsopoulos I,Koutsias J,Chandrinos K V,et al.An evalua-tion of Naive Bayesian anti-spam filtering[C] ∥Proc of the Workshop on Machine Learning in the New Information Age Joint 11th European Conference on Machine Learning.Barcelona,Spain,2000:9-17
[15] Agrawal R,Srikant R.Fast Algorithms for Mining Association Rules in Large Databases[C]∥Proceedings of 20th Internatio-nal Conference on Very Large Data Bases(VLDB 1994).Santiagode Chile,Morgan Kaufmann,1994:487-499
[16] Park J S,Chen M S,Yu P S.An Effective Hash-Based Algorithm for Mining Association Rules[C]∥ Proceedings of the ACM SIGMOD International Conference on Management of Data(SIGMOD’95).San Jose,1995:175-186
[17] Savasere A,Omiecinski E,Navathe S.An Efficient Algorithm for Mining Association Rules in Large Databases[C]∥21st VLDB Conf.Zurich,Switzerland,1995:432-444
[18] Zhang Yu-qi,Zhou Qiang.Automatic Identification of Chinese Base Phrases [J].Journal of Chinese Information Processing,2002,16(6):1-8(in Chinese) 张昱琪,周强.汉语基本短语的自动识别[J].中文信息学报,2002,16(6):1-8
[19] Croft W.Syntactic Categories and Grammatical Relations:The Cognitive Organization of Information [M].Chicago and London:The University of Chicago Press,1991:66-78
[20] Zhao Jun,Huang Chang-ning.A Probabilistic Chinese BaseNPRecognition Model Combined with Syntactic Composition Templates[J].Journal of Computer Research and Development,1999,36(11):1384-1390(in Chinese) 赵军,黄昌宁.结合句法组成模板识别汉语基本名词短语的概率模型[J].计算机研究与发展,1999,36(11):1384-1390
[21] Li Mu,Gao Jian-feng,Huang Chang-ning,et al.UnsupervisedTraining for Overlapping Ambiguity Resolution in Chinese Word Segmentation[J].Proceedings of the Second SIGHAN Workshop on Chinese Language Processing(SIGHAN’03).2003:1-7
[22] Zhao Lei-lei.Feature extraction method based on the pattern of words and basic phrases[D].Baoding:Hebei University,2009(in Chinese) 赵蕾蕾.基于词和基本短语模式的特征提取方法[D].保定:河北大学,2009
[23] Langley P,Wayne L,Thompson K.An analysis of Bayesian classifiers[C]∥Proc of the 10th National Conf on Artificial Intelligence.San Jose,California,1992:223-227
[24] Dominggos P,Pazzani M.On the Optimality of the Simple Ba-yesian Classifier under Zero-One Loss[J].Machine Learning,1997,9:103-130
[25] Wang Guo-yin,Zheng Zheng,Zhang Yi.RIDAS-A Rough SetBased Intelligent Data Analysis System [C]∥Proc of First IEEE International Conference on Machine Learning and Cybernitics (ICMLC2002).Beijing,2002:646-649
[26] Gu Yi-jun,Fan Xiao-zhong,Wang Jian-hua,et al.Automatic Selection of Chinese Stoplist [J].Transactions of Beijing Institute of Technology,2005,25(4):337-340(in Chinese) 顾益军,樊孝忠,王建华,等.中文停用词表的自动选取[J].北京理工大学学报,2005,25(4):337-340

No related articles found!
Full text



[1] LEI Li-hui and WANG Jing. Parallelization of LTL Model Checking Based on Possibility Measure[J]. Computer Science, 2018, 45(4): 71 -75, 88 .
[2] XIA Qing-xun and ZHUANG Yi. Remote Attestation Mechanism Based on Locality Principle[J]. Computer Science, 2018, 45(4): 148 -151, 162 .
[3] LI Bai-shen, LI Ling-zhi, SUN Yong and ZHU Yan-qin. Intranet Defense Algorithm Based on Pseudo Boosting Decision Tree[J]. Computer Science, 2018, 45(4): 157 -162 .
[4] WANG Huan, ZHANG Yun-feng and ZHANG Yan. Rapid Decision Method for Repairing Sequence Based on CFDs[J]. Computer Science, 2018, 45(3): 311 -316 .
[5] SUN Qi, JIN Yan, HE Kun and XU Ling-xuan. Hybrid Evolutionary Algorithm for Solving Mixed Capacitated General Routing Problem[J]. Computer Science, 2018, 45(4): 76 -82 .
[6] ZHANG Jia-nan and XIAO Ming-yu. Approximation Algorithm for Weighted Mixed Domination Problem[J]. Computer Science, 2018, 45(4): 83 -88 .
[7] WU Jian-hui, HUANG Zhong-xiang, LI Wu, WU Jian-hui, PENG Xin and ZHANG Sheng. Robustness Optimization of Sequence Decision in Urban Road Construction[J]. Computer Science, 2018, 45(4): 89 -93 .
[8] LIU Qin. Study on Data Quality Based on Constraint in Computer Forensics[J]. Computer Science, 2018, 45(4): 169 -172 .
[9] ZHONG Fei and YANG Bin. License Plate Detection Based on Principal Component Analysis Network[J]. Computer Science, 2018, 45(3): 268 -273 .
[10] SHI Wen-jun, WU Ji-gang and LUO Yu-chun. Fast and Efficient Scheduling Algorithms for Mobile Cloud Offloading[J]. Computer Science, 2018, 45(4): 94 -99, 116 .