计算机科学 ›› 2018, Vol. 45 ›› Issue (6A): 428-432.

• 大数据与数据挖掘 • 上一篇    下一篇

主动获取式的分布式网络爬虫集群方法研究

董禹龙,杨连贺,马欣   

  1. 天津工业大学计算机科学与软件学院 天津300387
  • 出版日期:2018-06-20 发布日期:2018-08-03
  • 作者简介:董禹龙(1991-),男,硕士生,主要研究方向为数据挖掘及分布式系统,E-mail:dongyulong1991@sina.com。

Study on Active Acquisition of Distributed Web Crawler Cluster

DONG Yu-long,YANG Lian-he,MA Xin   

  1. School of Computer Science and Software Engineering,Tianjin Polytechnic University,Tianjin 300387,China
  • Online:2018-06-20 Published:2018-08-03

摘要: 针对当前分布式网络爬虫方法遇到的处理效率、扩展性、可靠性、任务分配和负载平衡等问题,提出了一种主动获取任务式的分布式网络爬虫方法。该方法在子机节点中加入分控模块,评估节点负载及运行状况,并主动向中控节点申请任务队列。在此基础上,结合动态双向优先级任务分配算法,设计了一种具有负载平衡、任务分级分配、节点异常敏捷识别、节点安全退出等特性的分布式网络爬虫模型。实际测试表明,该主动获取式的分布式网络爬虫方法可有效地利用通用平台建立大型分布式爬虫集群。

关键词: 动态优先级, 多进程, 分布式爬虫, 负载平衡, 爬虫框架, 主动获取

Abstract: In this paper,in order to solve the processing efficiency,scalability,task allocation and load balance problem existed in the present distributed web crawler method,an active acquisition task distributed web crawler method was proposed,in which a sub-controlled module is added into the sub-node to evaluate the node load and operation status,and apply task queue for the central control node.Based on this method as well as the dynamic dual-directional priority task allocation algorithm,a distributed network crawler model was designed,which has the characteristics of load ba-lance,task hierarchical allocation,abnormal node smart identification and safe exit,etc.The practice test shows that the active acquisition task distributed web crawler method can be used to build large-scale distributed crawler cluster effectively.

Key words: Active obtain, Crawler framework, Distributed crawler, Dynamic priority, Load balancing, Multi process

中图分类号: 

  • TP301.6
[1]ZHOU D M.Survey of High-performance Web Crawler[J].Computer Science,2009,36(8):26-29.
[2]周孝锞.基于网络爬虫和改进的LCS算法的网站更新监测.计算机应用与软件,2017,34(1):222-229. [3]周德懋,李舟军.高性能网络爬虫:研究综述[J].计算机科学,2009,36(8):26-29.
[4]BRIN S,PAGE L.Reprint of:The anatomy of a large-scale hypertextual web search engine[J].Computer Networks,2012,56(18):3825-3833.
[5]MADAAN R,SHARMA A K,DIXIT A.A novel architecture for a blog crawler[C]∥IEEE International Conference on Paral-lel Distributed and Grid Computing.IEEE,2012:452-456.
[6]THAU B,OWEN L,KRISHNAMURTHY C S.Distributed Web Crawling over DHTs:UC Berkeley Technical Report UCB,CSD-4-1305[R].2004.
[7]ZHU K,XU Z,WANG X,et al.A Full Distributed Web Crawler Based on Structured Network[C]∥Asia Information Retrieval Conference on Information Retrieval Technology.Springer-Verlag,2008:478-483.
[8]刘爽,姜春祥,张伟哲,等.基于GNP算法的分布式爬虫调度策略[J].计算机应用研究,2010,27(2):446-449. [9]龚跃,张真真,黄小珂,等.基于动态双向优先级的任务分配与调度算法[J].计算机应用,2009,29(4):1131-1134. [10]黄志敏,曾学文,陈君.一种基于Kademlia的全分布式爬虫集群方法[J].计算机科学,2014,41(3):124-128.
[11]陶耀东,向中希.基于改进Kademlia协议的分布式爬虫[J].计算机系统应用,2016,25(4):156-161. [12]RAO A,LAKSHMINARAYANAN K,SURANA S,et al.Load Balancing in STRUCTURED P2P Systems[M]∥Peer-to-Peer Systems II.Springer Berlin Heidelberg,2003:68-79.
[13]KARGER D R,RUHL M.Simple Efficient Load-Balancing Algorithms for Peer-to-Peer Systems[M]∥Peer-to-Peer Systems III.Springer Berlin Heidelberg,2005:131-140.
[14]RIECHE S,PETRAK L,WEHRLE K.A thermal-dissipation-based approach for balancing data load in distributed hash tables[C]∥IEEE International Conference on Local Computer Networks.IEEE Computer Society,2004:15-23.
[1] 张忆文, 林铭炜.
基于动态优先级设备低能耗调度算法
Devices Low Energy Consumption Scheduling Algorithm Based on Dynamic Priority
计算机科学, 2021, 48(11A): 471-475. https://doi.org/10.11896/jsjkx.210100080
[2] 陶洋,纪瑞娟,杨理,王进.
异构无线网络中动态优先级接纳控制算法研究
Study on Dynamic Priority Admission Control Algorithm in Heterogeneous Wireless Networks
计算机科学, 2020, 47(3): 242-247. https://doi.org/10.11896/jsjkx.190100089
[3] 曾金晶, 张建山, 林兵, 张文德.
基于无线城域网的微云负载均衡算法
Cloudlet Workload Balancing Algorithm in Wireless Metropolitan Area Networks
计算机科学, 2019, 46(8): 163-170. https://doi.org/10.11896/j.issn.1002-137X.2019.08.027
[4] 刘春玲, 施玉鑫, 张然.
基于权值与平均连接度的导弹组网设计
Design of Missile Networking Based on Weights and Average Connectivity
计算机科学, 2019, 46(6A): 325-328.
[5] 杨飞,马昱春,侯金,徐宁.
基于MPSoC并行调度的矩阵乘法加速算法研究
Research on Acceleration of Matrix Multiplication Based on Parallel Scheduling on MPSoC
计算机科学, 2017, 44(8): 36-41. https://doi.org/10.11896/j.issn.1002-137X.2017.08.007
[6] 王溪波,葛宏帅,王瑞全,林海.
电梯远程监控系统中高并发通信服务器的设计
Design of High Concurrent Communication Server of Elevator Remote Monitoring System
计算机科学, 2017, 44(4): 157-160. https://doi.org/10.11896/j.issn.1002-137X.2017.04.034
[7] 刘旭,莫则尧,安恒斌,曹小林,张爱清.
一种基于实测的自动负载建模算法
Automatic Load Modeling Algorithm Based on Real Time Measuring
计算机科学, 2015, 42(1): 63-66. https://doi.org/10.11896/j.issn.1002-137X.2015.01.014
[8] 黄志敏,曾学文,陈君.
一种基于Kademlia的全分布式爬虫集群方法
Method for Fully Distributed Crawler Cluster Based on Kademlia
计算机科学, 2014, 41(3): 124-128.
[9] 罗香玉,汪芸,陈笑梅.
存储系统负载平衡机制的评价与分析
Evaluation and Analysis of Load Balancing Mechanisms in Storage Systems
计算机科学, 2013, 40(9): 55-60.
[10] 余鑫,张斌.
一种支持邻居负载感知的动态负载平衡扩散算法
Dynamic Load Balancing Diffusion Algorithm with Neighbors Loading Awareness
计算机科学, 2013, 40(3): 167-169.
[11] 葛青,白光伟,沈航,张芃,曹磊.
无线网络链路质量感知的机会网络编码机制
Link-quality-aware Opportunistic Network Coding Mechanism in Wireless Networks
计算机科学, 2013, 40(11): 29-34.
[12] 孙熙领,陈超,丁治明,许佳捷,袁栋.
云计算环境中基于访问量和依赖性评价的数据分配算法
Data Allocation Algorithm Based on Visit Capacity and Dependency Evaluation in Cloud
计算机科学, 2012, 39(5): 141-146.
[13] 于荣欢,邓宝松,吴玲达,瞿师.
时变三维标量场并行计算与绘制框架研究
Parallel Computing and Rendering Framework of Time Variable 3D Scalar Fields
计算机科学, 2012, 39(3): 187-191.
[14] 刘君瑞,陈颖图,樊晓娅.
基于先到先服务的二维动态优先级信令排队算法
Two Dimensional Dynamic Priority-based FCFS Token-Queuing Algorithm
计算机科学, 2011, 38(5): 89-92.
[15] 魏文红,向菲,王文丰,王高才.
一种结构化P2P系统的负载平衡算法
Load Balancing Algorithm in Structure P2P Systems
计算机科学, 2010, 37(4): 82-.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!