Computer Science ›› 2019, Vol. 46 ›› Issue (2): 215-222.doi: 10.11896/j.issn.1002-137X.2019.02.033

• Artificial Intelligence • Previous Articles     Next Articles

Focused Annealing Crawler Algorithm for Rainstorm Disasters Based on Comprehensive Priority and Host Information

LIU Jing-fa1,2, LI Fan1, JIANG Sheng-yi2   

  1. School of Computer & Software,Nanjing University of Information Science & Technology,Nanjing 210044,China1
    School of Information Science and Technology,Guangdong University of Foreign Studies,Guangzhou 510006,China2
  • Received:2018-07-12 Online:2019-02-25 Published:2019-02-25

Abstract: Nowadays,Internet integrates a lot of information related to rainstorm disasters.However,the efficiency of manual search is low,so the web focused crawler becomes very important.On the basis of the general web crawler,in order to improve the computational precision of topic relevance for webpages and prevent the topic drift,this paper proposed a comprehensive priority evaluation method based on webpage content and link structure for the hyperlink.The method consists of a combined effect of four parts,including the topic relevance of the anchor text,the topic relevance of all webpages that contain links,the PR value and the topic relevance of the webpage which the link points to.At the same time,to avoid the search falling into local optimum,a new focused crawler algorithm combining the memory historical host information and the simulated annealing algorithm (SA) was designed for the first time.The experimental results of the focused crawler about rainstorm disaster show that the proposed algorithm outperforms the breadth first search (BFS) strategy and the optimal priority search (OPS) strategy,and the crawling accuracy rate is significantly improved.

Key words: Comprehensive priority, Host information, Rainstorm disasters, Simulated annealing algorithm, Web focused crawler

CLC Number: 

  • TP391
[1]KHAN M A,SHARMA D K.Self-adaptive ontology-based focused crawling:A literature survey[C]∥ 5th International Conference on Reliability,INFOCOM Technologies and Optimization.IEEE,2016:595-601.
[2]DONG H,HUSSAIN F K,CHANG E.A transport service ontology-based focused crawler[C]∥Proceedings of 4th International Conference on Semantics,Knowledge and Grid.Washington:IEEE,2008:49-56.
[3]DU Y J,LI C X,HU Q,et al.Ranking webpages using a path trust knowledge graph[J].Neurocomputing,2017,269:58-72.
[4]GUAN W G,LUO Y C.Design and implementation of focused crawler based on concept context graph[J].Computer Enginee-ring and Design,2016,37(10),2679-2684.(in Chinese)
关卫国,骆永成.基于概念背景图的主题爬虫设计与实现[J].计算机工程与设计,2016,37(10):2679-2684.
[5]LIU W J,DU Y J.A novel focused crawler based on cell-like membrane computing optimization algorithm[J].Neurocompu-ting,2014,123:266-280.
[6]VIDAL M L A,SILVA A S D,DE MOURA E S,et al.Structure-driven crawler generation by example[C]∥International ACM SIGIR Conference on Research and Development in Information Retrieval.ACM,2006:292-299.
[7]LI L,ZHANG G Y,LI Z W.Research on focused crawling technology based on SVM[J].Computer Science,2015,42(2):118-122.(in Chinese)
李璐,张国印,李正文.基于SVM的主题爬虫技术研究[J].计算机科学,2015,42(2):118-122.
[8]RAWAT S,PATIL D R.Efficient focused crawling based on best first search[C]∥Advance Computing Conference.IEEE,2013:908-911.
[9]JING W P,WANG Y J,DONG W W.Research on adaptive genetic algorithm in application of focused crawler search strategy[J].Computer Science,2016,43(8):254-257.(in Chinese)
荆文鹏,王育坚,董伟伟.自适应遗传算法在主题爬虫搜索策略中的应用研究[J].计算机科学,2016,43(8):254-257.
[10]ZHENG S.Genetic and ant algorithms based focused crawler design[C]∥International Conference on Innovations in Bio-Inspired Computing & Applications.IEEE,2011:374-378.
[11]YANG R G,SONG Y,MENG X Z.Multimedia topic search algorithm based on improved Shark-Search[J].Computer Engineering and Applications,2010,46(14):152-154.(in Chinese)
杨仁广,宋宇,孟祥增.一种改进Shark-Search的多媒体主题搜索算法[J].计算机工程与应用,2010,46(14):152-154.
[12]PRAKASH J,KUMAR R.Web Crawling through shark-search using PageRank [J].Procedia Computer Science,2015,48:210-216.
[13]CHENG Y,LIAO W,CHENG G.Strategy of focused crawler with word embedding clustering weighted in shark-search algorithm[J].Computer & Digital Engineering,2018,46(1),144-148.
[14]CHEN C,ZHAN Y W,LI Y.PageRank parallel algorithm based on Web link classification[J].Journal of Computer Applications,2015,35(1):48-52.(in Chinese)
陈诚,战荫伟,李鹰.基于网页链接分类的PageRank并行算法[J].计算机应用,2015,35(1):48-52.
[15]HU P R,LI S J.Focused crawler based on URL patterns[J].Application Research of Computers,2018,35(3):694-699.(in Chinese)
胡萍瑞,李石君.基于URL模式集的主题爬虫[J].计算机应用研究,2018,35(3):694-699.
[16]PATEL A,SCHMIDT N.Application of structured document parsing to focused web crawling [J].Computer Standards & Interfaces,2011,33(3):325-331.
[17]LIU J F,LI G,CHEN D B,et al.Two-dimensional equilibrium constraint layout using simulated annealing [J].Computers & Industrial Engineering,2010,59(4):530-536.
[18]LIU J F,ZHANG Z,XUE Y,et al.Heuristic simulated annealing algorithm for orthogonal rectangle packing problem with static non-equilibrium constraints[J].Pattern Recognition and Artificial Intelligence,2015,28(7):626-632.(in Chinese)
刘景发,张振,薛羽,等.带静不平衡约束的正交矩形布局问题的启发式模拟退火算法[J].模式识别与人工智能,2015,28(7):626-632.
[19]JIANG Q,ZHANG Y.SiteRank-Based crawling ordering strategy for search engines[C]∥ IEEE International Conference on Computer and Information Technology.IEEE,2007:259-263.
[20]WU J,ABERER K.Using siteRank for P2P web retrieval[OL].http://www.docin.com/p-833478187.html.
[21]DERRAC J,GARCíA S,MOLINA D,et al.A practical tutorial on the use of nonparametric statistical tests as a methodology for comparing evolutionary and swarm intelligence algorithms[J].Swarm & Evolutionary Computation,2011,1(1):3-18.
[1] GAO Shi-shun, ZHAO Hai-tao, ZHANG Xiao-ying, WEI Ji-bo. Self-adaptive Intelligent Wireless Propagation Model to Different Scenarios [J]. Computer Science, 2021, 48(7): 324-332.
[2] WANG Guo-wu, CHEN Yuan-yan. Improvement of DV-Hop Location Algorithm Based on Hop Correction and Genetic Simulated Annealing Algorithm [J]. Computer Science, 2021, 48(6A): 313-316.
[3] XU Fei-xiang,YE Xia,LI Lin-lin,CAO Jun-bo,WANG Xin. Comprehensive Calculation of Semantic Similarity of Ontology Concept Based on SA-BP Algorithm [J]. Computer Science, 2020, 47(1): 199-204.
[4] XIONG Zhi-li and QU Shao-cheng. Self Localization Technology of Wireless Sensor Network Node [J]. Computer Science, 2017, 44(Z6): 319-321.
[5] ZHENG Jing-jing ZHANG Jing WU Ji-gang. Heuristic Algorithm for Server Placement in Distributed Interactive Applications [J]. Computer Science, 2015, 42(7): 95-98.
[6] YANG Xu,QIU Han,ZHU Jun-hu and WANG Qing-xian. Method for Probing Sources Selection Based on General Dispersity of Sources in Network Topology Discovery [J]. Computer Science, 2014, 41(Z6): 265-269.
[7] YAN Qiao,QIN Zhi-dong,WANG Shao-yu and YAN Hong-man. Adaptive Simulated Annealing Algorithm for Task Assignment on Homogeneous Multi/Many-core Processors [J]. Computer Science, 2014, 41(6): 18-21.
[8] WANG Chao,QIN Xiao-lin and LIU Ya-li. Heuristic Attack Strategy Against Improved LMAP+ Protocol [J]. Computer Science, 2014, 41(5): 143-149.
[9] LIU Gang,LI Fang and DI Peng. Research on Optimal Test Selection Based on Fused Algorithm [J]. Computer Science, 2013, 40(Z6): 54-57.
[10] YANG Wei-bo,WANG Wan-liang,JIE Jing,ZHAO Yan-wei. Hybrid Algorithm for Tool-path Airtime Optimization during Multi-contour Processing in Leather Cutting [J]. Computer Science, 2011, 38(3): 254-256.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!