计算机科学 ›› 2014, Vol. 41 ›› Issue (6): 214-216.doi: 10.11896/j.issn.1002-137X.2014.06.042

• 人工智能 • 上一篇    下一篇

垃圾邮件过滤中信息增益的改进研究

翟军昌,秦玉平,车伟伟   

  1. 渤海大学 锦州121000;渤海大学 锦州121000;沈阳大学 沈阳110044
  • 出版日期:2018-11-14 发布日期:2018-11-14
  • 基金资助:
    本文受国家自然科学基金(61104106)资助

Improvement of Information Gain in Spam Filtering

ZHAI Jun-chang,QIN Yu-ping and CHE Wei-wei   

  • Online:2018-11-14 Published:2018-11-14

摘要: 针对垃圾邮件过滤中的特征项选择问题,提出了一种改进的信息增益方法。首先利用特征词的先验概率定义增益比,然后利用增益比对特征词为整个分类所提供的信息量进行放大或弱化,从而对特征词的类别条件熵计算作了改进,采用极大后验假设朴素贝叶斯决策方法在英文语料库上进行实验,通过召回率、正确率、精确率和错误率对算法进行评价分析。实验结果表明,改进后的算法提高了过滤器的分类精度,降低了过滤器对合法邮件的误判给用户带来的损失。

关键词: 信息增益,特征选择,垃圾邮件,朴素贝叶斯 中图法分类号TP391文献标识码A

Abstract: The paper put forward a kind of improved information gain for the feature words selection in spam filtering.Firstly,defined gain ratio according to the probability of feature words,and then amplifed or weakened the amount of information of the feature words for classification,thereby improving the calculation method of category conditional entropy. Finally, combining with the naive Bayes decision method of maximum a posteriori hypothesis,carried out an experiment on the English Corpus to analyze the algorithm through recall,correct,accuracy and error.The experimental results show that the improved algorithm can enhance classification precision and reduce user loss.

Key words: Information gain,Feature selection,Spam,Naive Bayes

[1] Guzella T S,Caminhas W M.A review of machine learning approaches to spam filtering[J].Expert Systems with Application,2009,6(7):10206-10222
[2] Lai Chih-chin.An Empirical Study of Three Machine LearningMethods for Spam Filtering[J].Knowledge-Based System,2007,20(3):249-254
[3] 黄国伟,许昱玮.基于用户反馈的混合型垃圾邮件过滤方法[J].计算机应用,2013,33(7):1861-1865
[4] 邓维斌,王国胤,洪智勇.基于粗糙集的加权朴素贝叶斯邮件过滤方法[J].计算机科学,2011,38(2):218-221
[5] Sanchez F,Duan Zhen-hai,Dong Ying-fei.Understanding Forgery Properties of Spam Delivery Paths[C]∥CEAS 2010Se-venth annual Collaboration,Electronic messaging,AntiAbuse and Spam Conference(CEAS 2010).Redmond,Washington,US,July 2010
[6] 陈孝礼,刘培玉.应用于垃圾邮件过滤的词序列核[J].计算机应用,2011,31(3):698-701
[7] Sahami M,Dumais S,Heckerman D,et al.A Bayesian approach to filtering Junk e-mail [C]∥Learning for Text Categorization:Papers from AAAI Workshop.Madison,Wisconsin,1998:55-62
[8] Androutsopoulos I,Koutsias J,Chandrinos K V,et al.An Evalua-tion of Naive Bayesian Anti-Spam Filtering[C]∥Proc of the Workshop on Machine learning in the New Information Age,11th European Conference on Machine Learning(ECML’00).Barcelona,Spain,June 2000:9-17
[9] Schneider K.A Comparison of Event Models for Naive BayesAnti-spam E-mail Filtering[C]∥Procedings of the 10th Confe-rence of the European Chapter of the Association for Computational Linguistics(EACL’03).2003:307-314
[10] Vangelis M,Androutsopoulos I,Georgios P.Spam filtering with Naive Bayes-which Naive Bayes?[C]∥CEAS 2006Third Conference on Email and AntiSpam(CEAS 2006).Mountain View,California USA,July 2006:27-28
[11] Chen Bin,Dong Shou-bin,Fang Wei-dong.Introduction of Fin-gerprint Vector based Bayesian Method for Spam Filtering [C]∥CEAS 2007Fourth Conference on Email and Anti-Spam(CEAS 2007).Mountain View,California USA,August 2007

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!