计算机科学 ›› 2016, Vol. 43 ›› Issue (8): 254-257.doi: 10.11896/j.issn.1002-137X.2016.08.051

• 人工智能 • 上一篇    下一篇

自适应遗传算法在主题爬虫搜索策略中的应用研究

荆文鹏,王育坚,董伟伟   

  1. 北京联合大学信息学院 北京100101,北京联合大学信息学院 北京100101,北京联合大学信息学院 北京100101
  • 出版日期:2018-12-01 发布日期:2018-12-01
  • 基金资助:
    本文受国家自然科学基金项目:基于超图形XGML的图像半结构化研究(61271369)资助

Research on Adaptive Genetic Algorithm in Application of Focused Crawler Search Strategy

JING Wen-peng, WANG Yu-jian and DONG Wei-wei   

  • Online:2018-12-01 Published:2018-12-01

摘要: 如何提高爬虫覆盖率和准确率是主题爬虫的研究热点之一。目前大多采用最佳优先搜索策略,针对该类主题爬虫易陷入局部最优的不足,设计结合遗传算法的主题爬虫搜索策略,并设计动态适应度函数和遗传算子使得爬虫具有一定的自适应性。与其他搜索策略以及结合非自适应遗传算法的搜索策略进行了比较,结果表明该算法能够在一定程度上提高爬虫性能。

关键词: 主题爬虫,重要度,遗传算法,遗传算子,适应度函数

Abstract: How to design the crawler search strategy to improve the crawler’s coverage and accuracy has become a hot research point in the focused crawler.Mostly crawler uses best-first search algorithm.Based on the focused crawler which uses this search strategy will easily plunge into local optimum,we combined genetic algorithm with focused crawler search strategy.We set dynamic fitness function and genetic-operators to make the crawlers have certain adaptive searching adaptability.By comparing with those crawlers which use the other search strategy or which combine with traditional genetic algorithm search strategy,the experimental results show that this algorithm can partly improve the crawler search ability.

Key words: Focused crawler,Important degree,Genetic algorithm,Genetic operators,Fitness function

[1] Xian Xiao-ping.An algorithm based on a comprehensive improvement of PageRank algorithm[D].Xi’an:Northwest University,2010(in Chinese) 县小平.搜索引擎PageRank算法研究[D].西安:西北大学,2010
[2] Zou Yong-bin,et al.Research on focused crawler based on Bayes classifier[J].Application Research of Computers, 2009,6(9):3418-3420,3439(in Chinese) 邹永斌,等.基于贝叶斯分类器的主题爬虫研究[J].计算机应用研究,2009,6(9):3418-3420,3439
[3] Luo Lin-bo,et al.Research on Topical Crawler of Shark-Search Algorithm and HITS Algorithm[J].Computer Technology and Development,2010,0(11):76-79(in Chinese) 罗林波,等.基于Shark-Search和Hits算法的主题爬虫研究[J].计算机技术与发展,2010,0(11):76-79
[4] Song Hai-yang,et al.A Novel Crawling Strategy of FocusedWeb Crawler[J].Computer Application and Software, 2011,8(11):264-267,293(in Chinese) 宋海洋,等.一种新的主题网络爬虫爬行策略[J].计算机应用与软件,2011,8(11):264-267,293
[5] Wei Jing-jing,et al.Focused Crawler Based on Improved Algorithm of Web Content Similarity[J].Computer and Modernization,2011,3(9):1-4(in Chinese) 魏晶晶,等.基于网页内容相似度改进算法的主题网络爬虫[J].计算机与现代化,2011,3(9):1-4
[6] Bai Yu-zhao,et al.Research and implementation for focused cra-wler based on probabilistic model[J].Computer Engineering & Science,2013,5(1):160-165(in Chinese) 白玉昭,等.基于概率模型的主题爬虫的研究和实现[J].计算机工程与科学,2013,5(1):160-165
[7] Liu Zuo-da,et al.Focused Crawling Algorithm for BBS Information Retrieval[J].Journal of Zhengzhou University(Natural Science Edition),2010,2(2):22-25(in Chinese) 刘佐达,等.一种面向BBS信息检索的主题网络爬虫算法[J].郑州大学学报(理学版),2010,2(2):22-25
[8] Deng Yue-gui.Heuristic Search in Network Crawler Application Analysis[J].Software Guide,2008(2):80-82(in Chinese) 邓岳贵.启发式搜索在网络爬虫中应用的分析[J].软件导刊,2008(2):80-82
[9] Salton G.Automatic Text Processing:The Transformation,Analysis,and Tetrieval of Information by Computer[M].Addison-Wesley,Reading,Pennsylvania,1989
[10] 玄光男,程润传.遗传算法与工程设计[M].汪定伟,等译.北京:科学出版社,2000
[11] Li Lu,Zhang Guo-yin,et al.Defence Industry Secrecy Examination and Certification Center Laboratory[J].Computer Science,2015,2(2):118-122(in Chinese) 李璐,张国印,等.基于SVM的主题爬虫技术研究[J].计算机科学,2015,42(2):118-122
[12] Li Dong,Pan Zhi-song.Research on Parallel Genetic Algorithms Based on MapReduce[J].Computer Science,2012,9(7):182-184,4(in Chinese) 李东,潘志松.一种适用于大规模变量的并行遗传算法研究[J].计算机科学,2012,39(7):182-184,204
[13] Srinivas M,PatnaikI M.Adaptive Probabilities of Crososverand Mutationin Genetie Algorithm [J],IEEE Trans.on Systems.Manand Cybenreties,1994(4):656-667

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!