Computer Science ›› 2015, Vol. 42 ›› Issue (1): 239-243.doi: 10.11896/j.issn.1002-137X.2015.01.053

Previous Articles     Next Articles

Web Spam Detection Based on Integrated Classifier with Bagging-SVM

TANG Shou-hong, ZHU Yan and YANG Fan   

  • Online:2018-11-14 Published:2018-11-14

Abstract: Web spam not only declines the quality of information retrieval,but also causes troubles to the security of Internet.This paper proposed a Bagging-based integration of SVM to detect Web spam.In preprocessing stage,a technique referring to K-means is introduced to solve the class-imbalance problem of dataset firstly,and then an optimal feature subset is culled by using CFS.Finally the optimal feature subset is discretized by the information entropy.In the stage of classifier training,several training datasets are obtained by Bagging and each training dataset is utilized to produce weak classifier respectively after SVM learning.In detection stage,test samples are voted by weak classifiers obtained before detemining their categories.Experimental results on the WEBSPAM-UK2006 reveal that the proposed method can achieve better results with less number of features.

Key words: Web spam,Integrated classifier,Feature selection,Information entropy,Weak classifier

[1] 中国互联网信息中心.《第33次中国互联网络发展状况统计报告》 [R].2014.http://www.cnnic.net.cn/hlwfzyj/hlwxzbg/hlwtjbg/201401/t20140116_43820.htm
[2] Gyngyi Z,Garcia-Molina H.Web spam taxonomy[C]∥Pro-ceedings of the 1st International Workshop on Adversarial Information Retrieval (AIRWeb 2005).2005:39-47
[3] Egele M,Kolbitsch C,Platzer C.Removing web spam links from search engine results[J].Journal in Computer Virology,2011,7(1):51-62
[4] 360互联网安全中心.2013年中国网站安全研究报告[R].[2014-01-01].http://awterbbwfk.l5.yunpan.cn/lk/QpvTmqTwb9ci7
[5] 360互联网安全中心.2013年中国网购安全报告[R].[2014-03-12].http://aqv4kwspvd.l5.yunpan.cn/lk/Q4zjDEguzcwnx
[6] Henzinger M R,Motwani R,Silverstein C.Challenges in Websearch engines[C]∥ACM SIGIR Forum.ACM,2002:11-22
[7] Gyngyi Z,Garcia-Molina H,Pedersen J.Combating web spam with TrustRank[C]∥Proceedings of the 30th international conference on Very large data bases(VLDB 2004).2004:576-587
[8] Wu B,Davison B D.Identifying link farm spam pages[C]∥Special Interest Tracks and Posters of the 14th International Conference on World Wide Web.ACM,2005:820-829
[9] Suhara Y,Toda H,Nishioka S,et al.Automatically generatedspam detection based on sentence-level topic information[C]∥Proceedings of the 22nd International Conference on World Wide Web Companion.2013:1157-1160
[10] Chung Y,Toyoda M.A Method for Detecting Hijacked Sites by Web Spammer using Link-based Algorithms[J].IEICE Tran-sactions on Information and Systems,2010,E93-D(6):1414-1421
[11] Araujo L,Martinez-Romo J.Web spam detection:new classification features based on qualified link analysis and language mo-dels[J].IEEE Transactions on Information Forensics and Security,2010,5(3):581-590
[12] Shen G,Gao B,Liu T Y,et al.Detecting link spam using temporal information[C]∥Proceedings of the 6th IEEE International Conference on Data Mining.2006:1049-1053
[13] Yu H,Liu Y,Zhang M,et al.Web spam identification with user browsing graph[M]∥Information Retrieval Technology.Springer Berlin Heidelberg,2009:38-49
[14] Liu Y,Chen F,Kong W,et al.Identifying Web Spam with the Wisdom of the Crowds[J].ACM Transactions on the Web (TWEB),2012,6(1):2-32
[15] Fayyad U,Irani K.Multi-interval discretization of continuous-valued attributes for classification learning[C]∥ International Joint Conference on Artificial Intelligence(IJCAI).1993:1022-1027
[16] Castillo C,Donato D,Becchetti L,et al.A reference collection for web spam[J].ACM Sigir Forum.ACM,2006,40(2):11-24

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!