Computer Science ›› 2015, Vol. 42 ›› Issue (2): 118-122.doi: 10.11896/j.issn.1002-137X.2015.02.025

Previous Articles     Next Articles

Research on Focused Crawling Technology Based on SVM

LI Lu, ZHANG Guo-yin and LI Zheng-wen   

  • Online:2018-11-14 Published:2018-11-14

Abstract: With the rapid development of Internet,network information comes to be massive and diversity.How to provide the information required by users rapidly and exactly is the first task of the search engine.Traditional general search engine can provide the information in the general area,but in the special area,it cannot provide the professional and in-depth information for users.In this paper,the focused crawler based on the SVM classification algorithm was proposed for a solution to the problem of information retrieval in the special area,which makes use of the topic relevance predict algorithm based on the content and partial link information,the SVM classification algorithm and the HITS algorithm.The experiment shows that the crawling strategy based on the SVM classification algorithm can distinguish the topic related pages and topic unrelated Web pages better,improve the harvest rate and recall rate,and furthermore,the retrieval efficiency of search engines is improved.

Key words: SVM,Focused crawler,Crawling strategy,HITS

[1] Boanjak M,Oliveira E,et al.TwitterEcho:a distributed focused crawler to support open research with twitter data[C]∥WWW’12 Companion Proceedings of the 21st International Conference Companion on World Wide Web.2012
[2] Kazai G.In Search of Quality in Crowdsourcing for Search Engine Evaluation[J].Advances in information retrieval,Lecture Notes in Computer Science,2011,66(11):165-176
[3] 许笑,张伟哲,张宏莉,等.广域网分布式Web爬虫[J].软件学报,2010,1(5):1067-1082
[4] 张宪超,徐雯,高亮,等.一种结合文本和链接分析的局部Web社区识别技术[J].计算机研究与发展,2012,49(11):2352-2358
[5] de Groc C.Babouk:Focused Web Crawling for Corpus Compilation and Automatic Terminology Extraction[J].Web Intelligence and Intelligent Agent Technology (WI-IAT),IEEE/WIC/ACM International Conference,2011,3(1),497-498
[6] 张伟哲,张宏莉,许笑,等.分布式搜索引擎系统效能建模与评价[J].软件学报,2012,23(2):253-265
[7] 王上,于海,王钲旋.Deep Web垂直搜索引擎设计与实现[J].计算机研究与发展,2009,46:359-365
[8] 蒋华荣,郁雪.应用遗传算法优化子空间的SVM分类算法[J].计算机科学,2013,0(11):255-260,5
[9] 黄仁,王良伟.基于主题相关概念和网页分块的主题爬虫研究[J].计算机应用研究,2013,30(8):2377-2380
[10] 李晓明,闫鸿飞,王继民.搜索引擎:原理技术与系统[M].北京:科学技术出版社,2004:29-33
[11] Chang Chih-chung,Lin Chih-jen.LIBSVM:A library for support vector machines[J].ACM Transactions on Intelligent Systems and Technology (TIST),2011,2(3):280-292
[12] 李稚楹,杨武,谢治军.PageRank算法研究综述[J].计算机科学,2011,8(Z10):185-188

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!