计算机科学 ›› 2015, Vol. 42 ›› Issue (2): 118-122.doi: 10.11896/j.issn.1002-137X.2015.02.025

• 信息安全 • 上一篇    下一篇

基于SVM的主题爬虫技术研究

李璐,张国印,李正文   

  1. 军工保密资格审查认证中心实验室 北京100089,哈尔滨工程大学计算机科学与技术学院 哈尔滨150001,哈尔滨工程大学计算机科学与技术学院 哈尔滨150001
  • 出版日期:2018-11-14 发布日期:2018-11-14

Research on Focused Crawling Technology Based on SVM

LI Lu, ZHANG Guo-yin and LI Zheng-wen   

  • Online:2018-11-14 Published:2018-11-14

摘要: 随着互联网的快速发展,网络信息呈现海量和多元化的趋势。如何为互联网用户快速、准确地提取其所需信息,已成为搜索引擎面临的首要问题。传统的通用搜索引擎虽然能够在较大的信息范围内获取目标,但在某些特定领域无法给用户提供专业而深入的信息。提出基于SVM分类的主题爬虫技术,其将基于文字内容和部分链接信息的主题相关度预测算法、SVM分类算法和HITS算法相结合,解决了特定信息检索的难题。实验结果表明,使用基于SVM分类算法的爬取策略,能够较好地区分主题相关网页和不相关网页,提高了主题相关网页的收获率和召回率,进而提高了搜索引擎的检索效率。

关键词: SVM,主题爬虫,爬取策略,HITS

Abstract: With the rapid development of Internet,network information comes to be massive and diversity.How to provide the information required by users rapidly and exactly is the first task of the search engine.Traditional general search engine can provide the information in the general area,but in the special area,it cannot provide the professional and in-depth information for users.In this paper,the focused crawler based on the SVM classification algorithm was proposed for a solution to the problem of information retrieval in the special area,which makes use of the topic relevance predict algorithm based on the content and partial link information,the SVM classification algorithm and the HITS algorithm.The experiment shows that the crawling strategy based on the SVM classification algorithm can distinguish the topic related pages and topic unrelated Web pages better,improve the harvest rate and recall rate,and furthermore,the retrieval efficiency of search engines is improved.

Key words: SVM,Focused crawler,Crawling strategy,HITS

[1] Boanjak M,Oliveira E,et al.TwitterEcho:a distributed focused crawler to support open research with twitter data[C]∥WWW’12 Companion Proceedings of the 21st International Conference Companion on World Wide Web.2012
[2] Kazai G.In Search of Quality in Crowdsourcing for Search Engine Evaluation[J].Advances in information retrieval,Lecture Notes in Computer Science,2011,66(11):165-176
[3] 许笑,张伟哲,张宏莉,等.广域网分布式Web爬虫[J].软件学报,2010,1(5):1067-1082
[4] 张宪超,徐雯,高亮,等.一种结合文本和链接分析的局部Web社区识别技术[J].计算机研究与发展,2012,49(11):2352-2358
[5] de Groc C.Babouk:Focused Web Crawling for Corpus Compilation and Automatic Terminology Extraction[J].Web Intelligence and Intelligent Agent Technology (WI-IAT),IEEE/WIC/ACM International Conference,2011,3(1),497-498
[6] 张伟哲,张宏莉,许笑,等.分布式搜索引擎系统效能建模与评价[J].软件学报,2012,23(2):253-265
[7] 王上,于海,王钲旋.Deep Web垂直搜索引擎设计与实现[J].计算机研究与发展,2009,46:359-365
[8] 蒋华荣,郁雪.应用遗传算法优化子空间的SVM分类算法[J].计算机科学,2013,0(11):255-260,5
[9] 黄仁,王良伟.基于主题相关概念和网页分块的主题爬虫研究[J].计算机应用研究,2013,30(8):2377-2380
[10] 李晓明,闫鸿飞,王继民.搜索引擎:原理技术与系统[M].北京:科学技术出版社,2004:29-33
[11] Chang Chih-chung,Lin Chih-jen.LIBSVM:A library for support vector machines[J].ACM Transactions on Intelligent Systems and Technology (TIST),2011,2(3):280-292
[12] 李稚楹,杨武,谢治军.PageRank算法研究综述[J].计算机科学,2011,8(Z10):185-188

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!