计算机科学 ›› 2019, Vol. 46 ›› Issue (2): 215-222.doi: 10.11896/j.issn.1002-137X.2019.02.033
刘景发1,2, 李帆1, 蒋盛益2
LIU Jing-fa1,2, LI Fan1, JIANG Sheng-yi2
摘要: 如今,互联网集成的与暴雨灾害相关的信息多种多样,然而人工搜索网页信息的效率不高,因此网络主题爬虫显得十分重要。在通用网络爬虫的基础上,为提高主题相关度的计算精度并预防主题漂移,通过对链接锚文本主题相关度、链接所在网页的主题相关度、链接指向网页PR值和该网页主题相关度的综合计算,提出了基于网页内容和链接结构相结合的超链接综合优先度评估方法。同时,针对搜索过程易陷入局部最优的不足,首次设计了结合爬虫记忆历史主机信息和模拟退火的网络主题爬虫算法。以暴雨灾害为主题进行爬虫实验的结果表明,在爬取相同网页数的情况下,相比于广度优先搜索策略(Breadth First Search,BFS)和最佳优先搜索策略(Optimal Priority Search,OPS),所提出的算法能抓取到更多与主题相关的网页,爬虫算法的准确率得到明显提升。
中图分类号:
[1]KHAN M A,SHARMA D K.Self-adaptive ontology-based focused crawling:A literature survey[C]∥ 5th International Conference on Reliability,INFOCOM Technologies and Optimization.IEEE,2016:595-601. [2]DONG H,HUSSAIN F K,CHANG E.A transport service ontology-based focused crawler[C]∥Proceedings of 4th International Conference on Semantics,Knowledge and Grid.Washington:IEEE,2008:49-56. [3]DU Y J,LI C X,HU Q,et al.Ranking webpages using a path trust knowledge graph[J].Neurocomputing,2017,269:58-72. [4]GUAN W G,LUO Y C.Design and implementation of focused crawler based on concept context graph[J].Computer Enginee-ring and Design,2016,37(10),2679-2684.(in Chinese) 关卫国,骆永成.基于概念背景图的主题爬虫设计与实现[J].计算机工程与设计,2016,37(10):2679-2684. [5]LIU W J,DU Y J.A novel focused crawler based on cell-like membrane computing optimization algorithm[J].Neurocompu-ting,2014,123:266-280. [6]VIDAL M L A,SILVA A S D,DE MOURA E S,et al.Structure-driven crawler generation by example[C]∥International ACM SIGIR Conference on Research and Development in Information Retrieval.ACM,2006:292-299. [7]LI L,ZHANG G Y,LI Z W.Research on focused crawling technology based on SVM[J].Computer Science,2015,42(2):118-122.(in Chinese) 李璐,张国印,李正文.基于SVM的主题爬虫技术研究[J].计算机科学,2015,42(2):118-122. [8]RAWAT S,PATIL D R.Efficient focused crawling based on best first search[C]∥Advance Computing Conference.IEEE,2013:908-911. [9]JING W P,WANG Y J,DONG W W.Research on adaptive genetic algorithm in application of focused crawler search strategy[J].Computer Science,2016,43(8):254-257.(in Chinese) 荆文鹏,王育坚,董伟伟.自适应遗传算法在主题爬虫搜索策略中的应用研究[J].计算机科学,2016,43(8):254-257. [10]ZHENG S.Genetic and ant algorithms based focused crawler design[C]∥International Conference on Innovations in Bio-Inspired Computing & Applications.IEEE,2011:374-378. [11]YANG R G,SONG Y,MENG X Z.Multimedia topic search algorithm based on improved Shark-Search[J].Computer Engineering and Applications,2010,46(14):152-154.(in Chinese) 杨仁广,宋宇,孟祥增.一种改进Shark-Search的多媒体主题搜索算法[J].计算机工程与应用,2010,46(14):152-154. [12]PRAKASH J,KUMAR R.Web Crawling through shark-search using PageRank [J].Procedia Computer Science,2015,48:210-216. [13]CHENG Y,LIAO W,CHENG G.Strategy of focused crawler with word embedding clustering weighted in shark-search algorithm[J].Computer & Digital Engineering,2018,46(1),144-148. [14]CHEN C,ZHAN Y W,LI Y.PageRank parallel algorithm based on Web link classification[J].Journal of Computer Applications,2015,35(1):48-52.(in Chinese) 陈诚,战荫伟,李鹰.基于网页链接分类的PageRank并行算法[J].计算机应用,2015,35(1):48-52. [15]HU P R,LI S J.Focused crawler based on URL patterns[J].Application Research of Computers,2018,35(3):694-699.(in Chinese) 胡萍瑞,李石君.基于URL模式集的主题爬虫[J].计算机应用研究,2018,35(3):694-699. [16]PATEL A,SCHMIDT N.Application of structured document parsing to focused web crawling [J].Computer Standards & Interfaces,2011,33(3):325-331. [17]LIU J F,LI G,CHEN D B,et al.Two-dimensional equilibrium constraint layout using simulated annealing [J].Computers & Industrial Engineering,2010,59(4):530-536. [18]LIU J F,ZHANG Z,XUE Y,et al.Heuristic simulated annealing algorithm for orthogonal rectangle packing problem with static non-equilibrium constraints[J].Pattern Recognition and Artificial Intelligence,2015,28(7):626-632.(in Chinese) 刘景发,张振,薛羽,等.带静不平衡约束的正交矩形布局问题的启发式模拟退火算法[J].模式识别与人工智能,2015,28(7):626-632. [19]JIANG Q,ZHANG Y.SiteRank-Based crawling ordering strategy for search engines[C]∥ IEEE International Conference on Computer and Information Technology.IEEE,2007:259-263. [20]WU J,ABERER K.Using siteRank for P2P web retrieval[OL].http://www.docin.com/p-833478187.html. [21]DERRAC J,GARCíA S,MOLINA D,et al.A practical tutorial on the use of nonparametric statistical tests as a methodology for comparing evolutionary and swarm intelligence algorithms[J].Swarm & Evolutionary Computation,2011,1(1):3-18. |
[1] | 高士顺, 赵海涛, 张晓瀛, 魏急波. 一种自适应于不同场景的智能无线传播模型 Self-adaptive Intelligent Wireless Propagation Model to Different Scenarios 计算机科学, 2021, 48(7): 324-332. https://doi.org/10.11896/jsjkx.201000181 |
[2] | 王国武, 陈元琰. 基于跳数修正和遗传模拟退火优化DV-Hop定位算法 Improvement of DV-Hop Location Algorithm Based on Hop Correction and Genetic Simulated Annealing Algorithm 计算机科学, 2021, 48(6A): 313-316. https://doi.org/10.11896/jsjkx.201000101 |
[3] | 许飞翔,叶霞,李琳琳,曹军博,王馨. 基于SA-BP算法的本体概念语义相似度综合计算 Comprehensive Calculation of Semantic Similarity of Ontology Concept Based on SA-BP Algorithm 计算机科学, 2020, 47(1): 199-204. https://doi.org/10.11896/jsjkx.181202351 |
[4] | 王改云, 王磊杨, 路皓翔. 基于混合群智能算法优化的RSSI质心定位算法 RSSI-based Centroid Localization Algorithm Optimized by Hybrid Swarm Intelligence Algorithm 计算机科学, 2019, 46(9): 125-129. https://doi.org/10.11896/j.issn.1002-137X.2019.09.017 |
[5] | 熊志利,瞿少成. 无线传感器网络节点的自定位技术研究 Self Localization Technology of Wireless Sensor Network Node 计算机科学, 2017, 44(Z6): 319-321. https://doi.org/10.11896/j.issn.1002-137X.2017.6A.073 |
[6] | 郑晶晶 张 晶 武继刚. 分布式交互应用中服务器放置问题的启发式算法 Heuristic Algorithm for Server Placement in Distributed Interactive Applications 计算机科学, 2015, 42(7): 95-98. https://doi.org/10.11896/j.issn.1002-137X.2015.07.020 |
[7] | 杨旭,邱菡,朱俊虎,王清贤. 基于综合散列度的拓扑探测源选取方法 Method for Probing Sources Selection Based on General Dispersity of Sources in Network Topology Discovery 计算机科学, 2014, 41(Z6): 265-269. |
[8] | 闫乔,覃志东,王绍宇,闫红曼. 同构多核/众核处理器任务分配自适应模拟退火算法 Adaptive Simulated Annealing Algorithm for Task Assignment on Homogeneous Multi/Many-core Processors 计算机科学, 2014, 41(6): 18-21. https://doi.org/10.11896/j.issn.1002-137X.2014.06.004 |
[9] | 王超,秦小麟,刘亚丽. 对改进LMAP+协议的启发式攻击策略 Heuristic Attack Strategy Against Improved LMAP+ Protocol 计算机科学, 2014, 41(5): 143-149. https://doi.org/10.11896/j.issn.1002-137X.2014.05.031 |
[10] | 刘刚,黎放,狄鹏. 基于融合算法的测试优化选择问题研究 Research on Optimal Test Selection Based on Fused Algorithm 计算机科学, 2013, 40(Z6): 54-57. |
[11] | 赵鑫业 唐帅 杨妹 黄柯棣. 基于赋时影响网的模拟退火与粒子群混合改进算法 Hybrid Algorithm Based on Particle Swarm Optimization and Simulated in Timed Influence Nets 计算机科学, 2012, 39(Z11): 63-66. |
[12] | 杨卫波,王万良,介靖,赵燕伟. 优化皮革裁剪加工空行程路径的混合算法 Hybrid Algorithm for Tool-path Airtime Optimization during Multi-contour Processing in Leather Cutting 计算机科学, 2011, 38(3): 254-256. |
[13] | 王兴伟 刘聪 黄敏. IP/DWDM光Internet中基于软计算的智能多约束波长分配算法的研究 计算机科学, 2005, 32(1): 31-33. |
[14] | 王兴伟 程辉 李佳 郑露滴 黄敏. 一种IP/DWDM光因特网中的QoS组播路由算法 计算机科学, 2004, 31(6): 25-28. |
[15] | 张少中 王秀坤 丁华. 基于模拟退火的贝叶斯网络结构学习算法 计算机科学, 2004, 31(10): 196-199. |
|