计算机科学 ›› 2019, Vol. 46 ›› Issue (2): 215-222.doi: 10.11896/j.issn.1002-137X.2019.02.033

• 人工智能 • 上一篇    下一篇

基于综合优先度和主机信息的暴雨灾害主题退火爬虫算法

刘景发1,2, 李帆1, 蒋盛益2   

  1. 南京信息工程大学计算机与软件学院 南京2100441
    广东外语外贸大学信息科学与技术学院 广州5100062
  • 收稿日期:2018-07-12 出版日期:2019-02-25 发布日期:2019-02-25
  • 通讯作者: 刘景发(1972-),男,博士,教授,CCF高级会员,主要研究方向为智能计算、网络爬虫、本体、人工智能,E-mail:jfliu@nuist.edu.cn
  • 作者简介:李 帆(1994-),女,硕士生,主要研究方向为智能计算、网络爬虫;蒋盛益(1963-),男,博士,教授,主要研究方向为数据挖掘、自然语言处理。
  • 基金资助:
    本文受国家社会科学基金重大招标项目(16ZDA047),国家自然科学基金项目(61373016),江苏省自然科学基金项目(BK20181409,BK20171458)资助。

Focused Annealing Crawler Algorithm for Rainstorm Disasters Based on Comprehensive Priority and Host Information

LIU Jing-fa1,2, LI Fan1, JIANG Sheng-yi2   

  1. School of Computer & Software,Nanjing University of Information Science & Technology,Nanjing 210044,China1
    School of Information Science and Technology,Guangdong University of Foreign Studies,Guangzhou 510006,China2
  • Received:2018-07-12 Online:2019-02-25 Published:2019-02-25

摘要: 如今,互联网集成的与暴雨灾害相关的信息多种多样,然而人工搜索网页信息的效率不高,因此网络主题爬虫显得十分重要。在通用网络爬虫的基础上,为提高主题相关度的计算精度并预防主题漂移,通过对链接锚文本主题相关度、链接所在网页的主题相关度、链接指向网页PR值和该网页主题相关度的综合计算,提出了基于网页内容和链接结构相结合的超链接综合优先度评估方法。同时,针对搜索过程易陷入局部最优的不足,首次设计了结合爬虫记忆历史主机信息和模拟退火的网络主题爬虫算法。以暴雨灾害为主题进行爬虫实验的结果表明,在爬取相同网页数的情况下,相比于广度优先搜索策略(Breadth First Search,BFS)和最佳优先搜索策略(Optimal Priority Search,OPS),所提出的算法能抓取到更多与主题相关的网页,爬虫算法的准确率得到明显提升。

关键词: 暴雨灾害, 模拟退火算法, 网络主题爬虫, 主机信息, 综合优先度

Abstract: Nowadays,Internet integrates a lot of information related to rainstorm disasters.However,the efficiency of manual search is low,so the web focused crawler becomes very important.On the basis of the general web crawler,in order to improve the computational precision of topic relevance for webpages and prevent the topic drift,this paper proposed a comprehensive priority evaluation method based on webpage content and link structure for the hyperlink.The method consists of a combined effect of four parts,including the topic relevance of the anchor text,the topic relevance of all webpages that contain links,the PR value and the topic relevance of the webpage which the link points to.At the same time,to avoid the search falling into local optimum,a new focused crawler algorithm combining the memory historical host information and the simulated annealing algorithm (SA) was designed for the first time.The experimental results of the focused crawler about rainstorm disaster show that the proposed algorithm outperforms the breadth first search (BFS) strategy and the optimal priority search (OPS) strategy,and the crawling accuracy rate is significantly improved.

Key words: Comprehensive priority, Host information, Rainstorm disasters, Simulated annealing algorithm, Web focused crawler

中图分类号: 

  • TP391
[1]KHAN M A,SHARMA D K.Self-adaptive ontology-based focused crawling:A literature survey[C]∥ 5th International Conference on Reliability,INFOCOM Technologies and Optimization.IEEE,2016:595-601.
[2]DONG H,HUSSAIN F K,CHANG E.A transport service ontology-based focused crawler[C]∥Proceedings of 4th International Conference on Semantics,Knowledge and Grid.Washington:IEEE,2008:49-56.
[3]DU Y J,LI C X,HU Q,et al.Ranking webpages using a path trust knowledge graph[J].Neurocomputing,2017,269:58-72.
[4]GUAN W G,LUO Y C.Design and implementation of focused crawler based on concept context graph[J].Computer Enginee-ring and Design,2016,37(10),2679-2684.(in Chinese)
关卫国,骆永成.基于概念背景图的主题爬虫设计与实现[J].计算机工程与设计,2016,37(10):2679-2684.
[5]LIU W J,DU Y J.A novel focused crawler based on cell-like membrane computing optimization algorithm[J].Neurocompu-ting,2014,123:266-280.
[6]VIDAL M L A,SILVA A S D,DE MOURA E S,et al.Structure-driven crawler generation by example[C]∥International ACM SIGIR Conference on Research and Development in Information Retrieval.ACM,2006:292-299.
[7]LI L,ZHANG G Y,LI Z W.Research on focused crawling technology based on SVM[J].Computer Science,2015,42(2):118-122.(in Chinese)
李璐,张国印,李正文.基于SVM的主题爬虫技术研究[J].计算机科学,2015,42(2):118-122.
[8]RAWAT S,PATIL D R.Efficient focused crawling based on best first search[C]∥Advance Computing Conference.IEEE,2013:908-911.
[9]JING W P,WANG Y J,DONG W W.Research on adaptive genetic algorithm in application of focused crawler search strategy[J].Computer Science,2016,43(8):254-257.(in Chinese)
荆文鹏,王育坚,董伟伟.自适应遗传算法在主题爬虫搜索策略中的应用研究[J].计算机科学,2016,43(8):254-257.
[10]ZHENG S.Genetic and ant algorithms based focused crawler design[C]∥International Conference on Innovations in Bio-Inspired Computing & Applications.IEEE,2011:374-378.
[11]YANG R G,SONG Y,MENG X Z.Multimedia topic search algorithm based on improved Shark-Search[J].Computer Engineering and Applications,2010,46(14):152-154.(in Chinese)
杨仁广,宋宇,孟祥增.一种改进Shark-Search的多媒体主题搜索算法[J].计算机工程与应用,2010,46(14):152-154.
[12]PRAKASH J,KUMAR R.Web Crawling through shark-search using PageRank [J].Procedia Computer Science,2015,48:210-216.
[13]CHENG Y,LIAO W,CHENG G.Strategy of focused crawler with word embedding clustering weighted in shark-search algorithm[J].Computer & Digital Engineering,2018,46(1),144-148.
[14]CHEN C,ZHAN Y W,LI Y.PageRank parallel algorithm based on Web link classification[J].Journal of Computer Applications,2015,35(1):48-52.(in Chinese)
陈诚,战荫伟,李鹰.基于网页链接分类的PageRank并行算法[J].计算机应用,2015,35(1):48-52.
[15]HU P R,LI S J.Focused crawler based on URL patterns[J].Application Research of Computers,2018,35(3):694-699.(in Chinese)
胡萍瑞,李石君.基于URL模式集的主题爬虫[J].计算机应用研究,2018,35(3):694-699.
[16]PATEL A,SCHMIDT N.Application of structured document parsing to focused web crawling [J].Computer Standards & Interfaces,2011,33(3):325-331.
[17]LIU J F,LI G,CHEN D B,et al.Two-dimensional equilibrium constraint layout using simulated annealing [J].Computers & Industrial Engineering,2010,59(4):530-536.
[18]LIU J F,ZHANG Z,XUE Y,et al.Heuristic simulated annealing algorithm for orthogonal rectangle packing problem with static non-equilibrium constraints[J].Pattern Recognition and Artificial Intelligence,2015,28(7):626-632.(in Chinese)
刘景发,张振,薛羽,等.带静不平衡约束的正交矩形布局问题的启发式模拟退火算法[J].模式识别与人工智能,2015,28(7):626-632.
[19]JIANG Q,ZHANG Y.SiteRank-Based crawling ordering strategy for search engines[C]∥ IEEE International Conference on Computer and Information Technology.IEEE,2007:259-263.
[20]WU J,ABERER K.Using siteRank for P2P web retrieval[OL].http://www.docin.com/p-833478187.html.
[21]DERRAC J,GARCíA S,MOLINA D,et al.A practical tutorial on the use of nonparametric statistical tests as a methodology for comparing evolutionary and swarm intelligence algorithms[J].Swarm & Evolutionary Computation,2011,1(1):3-18.
[1] 高士顺, 赵海涛, 张晓瀛, 魏急波.
一种自适应于不同场景的智能无线传播模型
Self-adaptive Intelligent Wireless Propagation Model to Different Scenarios
计算机科学, 2021, 48(7): 324-332. https://doi.org/10.11896/jsjkx.201000181
[2] 王国武, 陈元琰.
基于跳数修正和遗传模拟退火优化DV-Hop定位算法
Improvement of DV-Hop Location Algorithm Based on Hop Correction and Genetic Simulated Annealing Algorithm
计算机科学, 2021, 48(6A): 313-316. https://doi.org/10.11896/jsjkx.201000101
[3] 许飞翔,叶霞,李琳琳,曹军博,王馨.
基于SA-BP算法的本体概念语义相似度综合计算
Comprehensive Calculation of Semantic Similarity of Ontology Concept Based on SA-BP Algorithm
计算机科学, 2020, 47(1): 199-204. https://doi.org/10.11896/jsjkx.181202351
[4] 王改云, 王磊杨, 路皓翔.
基于混合群智能算法优化的RSSI质心定位算法
RSSI-based Centroid Localization Algorithm Optimized by Hybrid Swarm Intelligence Algorithm
计算机科学, 2019, 46(9): 125-129. https://doi.org/10.11896/j.issn.1002-137X.2019.09.017
[5] 熊志利,瞿少成.
无线传感器网络节点的自定位技术研究
Self Localization Technology of Wireless Sensor Network Node
计算机科学, 2017, 44(Z6): 319-321. https://doi.org/10.11896/j.issn.1002-137X.2017.6A.073
[6] 郑晶晶 张 晶 武继刚.
分布式交互应用中服务器放置问题的启发式算法
Heuristic Algorithm for Server Placement in Distributed Interactive Applications
计算机科学, 2015, 42(7): 95-98. https://doi.org/10.11896/j.issn.1002-137X.2015.07.020
[7] 杨旭,邱菡,朱俊虎,王清贤.
基于综合散列度的拓扑探测源选取方法
Method for Probing Sources Selection Based on General Dispersity of Sources in Network Topology Discovery
计算机科学, 2014, 41(Z6): 265-269.
[8] 闫乔,覃志东,王绍宇,闫红曼.
同构多核/众核处理器任务分配自适应模拟退火算法
Adaptive Simulated Annealing Algorithm for Task Assignment on Homogeneous Multi/Many-core Processors
计算机科学, 2014, 41(6): 18-21. https://doi.org/10.11896/j.issn.1002-137X.2014.06.004
[9] 王超,秦小麟,刘亚丽.
对改进LMAP+协议的启发式攻击策略
Heuristic Attack Strategy Against Improved LMAP+ Protocol
计算机科学, 2014, 41(5): 143-149. https://doi.org/10.11896/j.issn.1002-137X.2014.05.031
[10] 刘刚,黎放,狄鹏.
基于融合算法的测试优化选择问题研究
Research on Optimal Test Selection Based on Fused Algorithm
计算机科学, 2013, 40(Z6): 54-57.
[11] 赵鑫业 唐帅 杨妹 黄柯棣.
基于赋时影响网的模拟退火与粒子群混合改进算法
Hybrid Algorithm Based on Particle Swarm Optimization and Simulated in Timed Influence Nets
计算机科学, 2012, 39(Z11): 63-66.
[12] 杨卫波,王万良,介靖,赵燕伟.
优化皮革裁剪加工空行程路径的混合算法
Hybrid Algorithm for Tool-path Airtime Optimization during Multi-contour Processing in Leather Cutting
计算机科学, 2011, 38(3): 254-256.
[13] 王兴伟 刘聪 黄敏.
IP/DWDM光Internet中基于软计算的智能多约束波长分配算法的研究

计算机科学, 2005, 32(1): 31-33.
[14] 王兴伟 程辉 李佳 郑露滴 黄敏.
一种IP/DWDM光因特网中的QoS组播路由算法

计算机科学, 2004, 31(6): 25-28.
[15] 张少中 王秀坤 丁华.
基于模拟退火的贝叶斯网络结构学习算法

计算机科学, 2004, 31(10): 196-199.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!