Computer Science ›› 2017, Vol. 44 ›› Issue (Z6): 61-67.doi: 10.11896/j.issn.1002-137X.2017.6A.012

Previous Articles     Next Articles

Spam Filter Algorithm with Improved Porter Stemmer and Kernels Methods

SUN Han-bo and FENG Guo-can   

  • Online:2017-12-01 Published:2018-12-01

Abstract: At present,statistical learning methods have been widely used in spam classification in which Bayesian classifier and SVM are favorable.To face the challenge of spams,a number of novel ideas and improved algorithms were proposed.We proposed improved Porter Stemmer algorithm to extract text features thoroughly and tailored it for spam classifiers.Compared with original algorithm,linear kernel SVM,gaussian kernel SVM,polynomial SVM and Nave Bayes classifiers obtain 63.7%,63.1%,61.3% and 11.4% decrease of error rate respectively based on proposed improved Porter Stemmer.Besides,experimental results justify that SVM has significant advantages when applied to spam classification compared to Nave Bayes,while SVMs also obtain greater improvements facilitated by improved Porter Stemmer.We also conducted a shallow analysis from the perspectives of linguistics and illustrated the potential value of spam classifier with personalized customization.

Key words: Spam,SVM,Kernel function,SMO algorithm,Porter Stemmer

[1] WANG D,IRANI D,PU C.A study on evolution of email spam over fifteen years[C]∥2013 9th International Conference Conference on Collaborative Computing:Networking,Applications and Worksharing (Collaboratecom),2013.Austin,TX,USA:IEEE,2013:1-10.
[2] 秦逸.基于行为的垃圾邮件检测技术[J].计算机科学,2012,39(11):86-89.
[3] SAHAMI M,DUMAIS S,HECHERMAN D,et al.A Bayeslan approach to filtering junk E-Mail[C]∥Proceeding of Learning for Text Categorization Workshop-held in Conjunction with ICML/AAAI-98.Madison,WI,USA,1998:3256-3260.
[4] 王青松,魏如玉.基于短语的贝叶斯中文垃圾邮件过滤方法[J].计算机科学,2016,43(4):256-259.
[5] ALMEIDA T A,YAMAKAMI A.Advances in spam filtering techniques[J].Computational Intelligence for Privacy and Security,2012,394:199-214.
[6] DRUCKER H,D W,VAPNIK V N.Support Vector Machines for Spam Categorization[J].IEEE Transactions on Neural Networks and Learning Systems,1999,20(5):1048-1054.
[7] ANDROUTSOPOULOS I,PALIOURAS G, et al.Learning to Filter Unsolicited Commercial E-Mail[J].International Procee-dings of Computer Science & Information Tech,2004(2):1-52.
[8] KOLCZ A,ALSPECTOR J.SVM-based Filtering of E-mailSpam with Content-specific Misclassification Costs[C]∥Proc of Textdm01 Workshop on Text Mining-held at the 2001 IEEE International Conference on Data Mining,2001.San Jose CA USA:IEEE,2001:1-14.
[9] CARRERAS X,MARQUEZ L.Boosting Trees for Anti-SpamEmail Filtering[C]∥Proceedings of Euro Conference Recent Advances in NLP(RANLP-2001).TzigovChark,Bulgari:RANLP,2001:58-64.
[10] NICHOLAS T.Using AdaBoost and Decision Stumps to Identify Spam E-mail[J].Natural Language Processing,2003:1-7.
[11] 刘洋,杜孝平,周二胜,等.“垃圾邮件”的智能分析、过滤及Rough集讨论[C]∥中国计算机学会网络与数据通信学术会议,2002.武汉,2002:515-521.
[12] 潘文锋.基于内容的垃圾邮件过滤研究[D].北京:中国科学院计算技术研究所,2004.
[13] SOONTHORNPHISAJ N,CHAIKULSERIWAT K,TANG OP.Anti-Spam Filtering A Centroid-Based classification Approach[C]∥Proceedings of International Conference on Signal Processing (ICSP),2002.Pattaya Thailand:ICSP,2002:1096-1099.
[14] ODA T,WHITE T.Increasing the accuracy of a spam-detecting artificial immune system[J].IEEE Transactions on Evolutionary Computation,2004,1:390-396.
[15] 张泽明,罗文坚,王煦法.一种基于人工免疫的多层垃圾邮件过滤算法[J].电子学报,2006,34(9):1616-1620.
[16] CHHABRA S,YERAZUNIS W,SIEFKES C.Spam filteringusing a Markov random field model with variable weighting schemas[C]∥Proceedings of 4th IEEE International Conference on Data Mining,2014.Hong Kong,China:IEEE,2014:347-350.
[17] 李渊,廖闻剑,彭艳兵,等.复杂网络性质探讨及在垃圾邮件过滤中的运用[J].计算机科学,2013,40(S1):145-148.
[18] ANDROUTSOPOULOS I,KOUTSIAS J,C HANDRINOS K,et al.An Experimental Comparison of Naive Bayesian and Keyword-Based Anti-Spam Filtering with Encrypted Personal E-mail Messages[C]∥Proceedings of the 23rd Annual International ACMSIGIR Conference on Research and Development in Information Retrieval(SIGIR),2000.Athens Greece:ACM,2000:160-167.
[19] RENUKA D,HAMSAPRIYA T,CHAKKARAVARTHI M R,et al.Spam Classification Based on Supervised Learning Using Machine Learning Techniques[C]∥Proceedings of Process Automation,Control and Computing (PACC),2011.Coimbatore,India:PACC,2011:1-7.
[20] RIJSBERGEN C J V,ROBERTSON S E,PORTER M F.New models in probabilistic information retrieval:British Library Research and Development Report,no.5587[R].Cambridge:Computer Laboratory University of Cambridge,1980.
[21] SHEN H Y,LI Z.Leveraging Social Networks for EffectiveSpam Filtering[J].IEEE Transactions on Computers,2013,63(11):2743-2759.
[22] DEBARR D,SUN H,WECHSLER H.Adversarial Spam Detection Using the Randomized Hough Transform Support Vector Machine[C]∥Proceedings of 2013 12th International Confe-rence on Machine Learning and Applications (ICMLA’12).Miami,FL,USA,2013:299-304.
[23] SHARAM A,SAHNI S.A Comparative Study of Classification Algorithms for Spam Email Data Analysis[J].International Journal on Computer Science & Engineering,2011,3(5):111-117.
[24] ZHOU B,YAO Y Y,LUO J G.A Three-Way Decision Ap-proach to Email Spam Filtering[C]∥Advances in Artificial Intelligence,Canadian Conference on Artificial Intelligence.Canadian,Ottawa,Canada,2010:28-39.
[25] ZHANG Y D,WANG S G,PHILLIPS P,et al.Binary PSO with mutation operator for feature selection using decision tree applied to spam detection[J].Knowledge-Based Systems,2014,64:22-31.
[26] KAYA Y,ERTUˇRUL  F.A novel approach for spam email detection based on shifted binary patterns[J].Security & Communication Networks,2016,9(10):1216-1225.
[27] ALQATAWNA J,FARIS H,JARADAT K,et al.ImprovingKnowledge Based Spam Detection Methods:The Effect of Malicious Related Features in Imbalance Data Distribution[J].International Journal of Communications,Network and System Sciences,2015,8(5):118-129.
[28] NAKSOMBOON S,WATTANAPONGSAKORN N.Conside-ring behavior of sender in spam mail detection[C]∥Proceedings of International Conference on Networked Computing (INC).Gyeongju,Korea (South),2010:1-5.

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!