Computer Science ›› 2015, Vol. 42 ›› Issue (1): 239-243.doi: 10.11896/j.issn.1002-137X.2015.01.053

Previous Articles     Next Articles

Web Spam Detection Based on Integrated Classifier with Bagging-SVM

TANG Shou-hong, ZHU Yan and YANG Fan   

  • Online:2018-11-14 Published:2018-11-14

Abstract: Web spam not only declines the quality of information retrieval,but also causes troubles to the security of Internet.This paper proposed a Bagging-based integration of SVM to detect Web spam.In preprocessing stage,a technique referring to K-means is introduced to solve the class-imbalance problem of dataset firstly,and then an optimal feature subset is culled by using CFS.Finally the optimal feature subset is discretized by the information entropy.In the stage of classifier training,several training datasets are obtained by Bagging and each training dataset is utilized to produce weak classifier respectively after SVM learning.In detection stage,test samples are voted by weak classifiers obtained before detemining their categories.Experimental results on the WEBSPAM-UK2006 reveal that the proposed method can achieve better results with less number of features.

Key words: Web spam,Integrated classifier,Feature selection,Information entropy,Weak classifier

[1] 中国互联网信息中心.《第33次中国互联网络发展状况统计报告》 [R].2014.http://www.cnnic.net.cn/hlwfzyj/hlwxzbg/hlwtjbg/201401/t20140116_43820.htm
[2] Gyngyi Z,Garcia-Molina H.Web spam taxonomy[C]∥Pro-ceedings of the 1st International Workshop on Adversarial Information Retrieval (AIRWeb 2005).2005:39-47
[3] Egele M,Kolbitsch C,Platzer C.Removing web spam links from search engine results[J].Journal in Computer Virology,2011,7(1):51-62
[4] 360互联网安全中心.2013年中国网站安全研究报告[R].[2014-01-01].http://awterbbwfk.l5.yunpan.cn/lk/QpvTmqTwb9ci7
[5] 360互联网安全中心.2013年中国网购安全报告[R].[2014-03-12].http://aqv4kwspvd.l5.yunpan.cn/lk/Q4zjDEguzcwnx
[6] Henzinger M R,Motwani R,Silverstein C.Challenges in Websearch engines[C]∥ACM SIGIR Forum.ACM,2002:11-22
[7] Gyngyi Z,Garcia-Molina H,Pedersen J.Combating web spam with TrustRank[C]∥Proceedings of the 30th international conference on Very large data bases(VLDB 2004).2004:576-587
[8] Wu B,Davison B D.Identifying link farm spam pages[C]∥Special Interest Tracks and Posters of the 14th International Conference on World Wide Web.ACM,2005:820-829
[9] Suhara Y,Toda H,Nishioka S,et al.Automatically generatedspam detection based on sentence-level topic information[C]∥Proceedings of the 22nd International Conference on World Wide Web Companion.2013:1157-1160
[10] Chung Y,Toyoda M.A Method for Detecting Hijacked Sites by Web Spammer using Link-based Algorithms[J].IEICE Tran-sactions on Information and Systems,2010,E93-D(6):1414-1421
[11] Araujo L,Martinez-Romo J.Web spam detection:new classification features based on qualified link analysis and language mo-dels[J].IEEE Transactions on Information Forensics and Security,2010,5(3):581-590
[12] Shen G,Gao B,Liu T Y,et al.Detecting link spam using temporal information[C]∥Proceedings of the 6th IEEE International Conference on Data Mining.2006:1049-1053
[13] Yu H,Liu Y,Zhang M,et al.Web spam identification with user browsing graph[M]∥Information Retrieval Technology.Springer Berlin Heidelberg,2009:38-49
[14] Liu Y,Chen F,Kong W,et al.Identifying Web Spam with the Wisdom of the Crowds[J].ACM Transactions on the Web (TWEB),2012,6(1):2-32
[15] Fayyad U,Irani K.Multi-interval discretization of continuous-valued attributes for classification learning[C]∥ International Joint Conference on Artificial Intelligence(IJCAI).1993:1022-1027
[16] Castillo C,Donato D,Becchetti L,et al.A reference collection for web spam[J].ACM Sigir Forum.ACM,2006,40(2):11-24

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] LEI Li-hui and WANG Jing. Parallelization of LTL Model Checking Based on Possibility Measure[J]. Computer Science, 2018, 45(4): 71 -75, 88 .
[2] XIA Qing-xun and ZHUANG Yi. Remote Attestation Mechanism Based on Locality Principle[J]. Computer Science, 2018, 45(4): 148 -151, 162 .
[3] LI Bai-shen, LI Ling-zhi, SUN Yong and ZHU Yan-qin. Intranet Defense Algorithm Based on Pseudo Boosting Decision Tree[J]. Computer Science, 2018, 45(4): 157 -162 .
[4] WANG Huan, ZHANG Yun-feng and ZHANG Yan. Rapid Decision Method for Repairing Sequence Based on CFDs[J]. Computer Science, 2018, 45(3): 311 -316 .
[5] SUN Qi, JIN Yan, HE Kun and XU Ling-xuan. Hybrid Evolutionary Algorithm for Solving Mixed Capacitated General Routing Problem[J]. Computer Science, 2018, 45(4): 76 -82 .
[6] ZHANG Jia-nan and XIAO Ming-yu. Approximation Algorithm for Weighted Mixed Domination Problem[J]. Computer Science, 2018, 45(4): 83 -88 .
[7] WU Jian-hui, HUANG Zhong-xiang, LI Wu, WU Jian-hui, PENG Xin and ZHANG Sheng. Robustness Optimization of Sequence Decision in Urban Road Construction[J]. Computer Science, 2018, 45(4): 89 -93 .
[8] LIU Qin. Study on Data Quality Based on Constraint in Computer Forensics[J]. Computer Science, 2018, 45(4): 169 -172 .
[9] ZHONG Fei and YANG Bin. License Plate Detection Based on Principal Component Analysis Network[J]. Computer Science, 2018, 45(3): 268 -273 .
[10] SHI Wen-jun, WU Ji-gang and LUO Yu-chun. Fast and Efficient Scheduling Algorithms for Mobile Cloud Offloading[J]. Computer Science, 2018, 45(4): 94 -99, 116 .