计算机科学 ›› 2018, Vol. 45 ›› Issue (6A): 428-432.
董禹龙,杨连贺,马欣
DONG Yu-long,YANG Lian-he,MA Xin
摘要: 针对当前分布式网络爬虫方法遇到的处理效率、扩展性、可靠性、任务分配和负载平衡等问题,提出了一种主动获取任务式的分布式网络爬虫方法。该方法在子机节点中加入分控模块,评估节点负载及运行状况,并主动向中控节点申请任务队列。在此基础上,结合动态双向优先级任务分配算法,设计了一种具有负载平衡、任务分级分配、节点异常敏捷识别、节点安全退出等特性的分布式网络爬虫模型。实际测试表明,该主动获取式的分布式网络爬虫方法可有效地利用通用平台建立大型分布式爬虫集群。
中图分类号:
[1]ZHOU D M.Survey of High-performance Web Crawler[J].Computer Science,2009,36(8):26-29. [2]周孝锞.基于网络爬虫和改进的LCS算法的网站更新监测.计算机应用与软件,2017,34(1):222-229. [3]周德懋,李舟军.高性能网络爬虫:研究综述[J].计算机科学,2009,36(8):26-29. [4]BRIN S,PAGE L.Reprint of:The anatomy of a large-scale hypertextual web search engine[J].Computer Networks,2012,56(18):3825-3833. [5]MADAAN R,SHARMA A K,DIXIT A.A novel architecture for a blog crawler[C]∥IEEE International Conference on Paral-lel Distributed and Grid Computing.IEEE,2012:452-456. [6]THAU B,OWEN L,KRISHNAMURTHY C S.Distributed Web Crawling over DHTs:UC Berkeley Technical Report UCB,CSD-4-1305[R].2004. [7]ZHU K,XU Z,WANG X,et al.A Full Distributed Web Crawler Based on Structured Network[C]∥Asia Information Retrieval Conference on Information Retrieval Technology.Springer-Verlag,2008:478-483. [8]刘爽,姜春祥,张伟哲,等.基于GNP算法的分布式爬虫调度策略[J].计算机应用研究,2010,27(2):446-449. [9]龚跃,张真真,黄小珂,等.基于动态双向优先级的任务分配与调度算法[J].计算机应用,2009,29(4):1131-1134. [10]黄志敏,曾学文,陈君.一种基于Kademlia的全分布式爬虫集群方法[J].计算机科学,2014,41(3):124-128. [11]陶耀东,向中希.基于改进Kademlia协议的分布式爬虫[J].计算机系统应用,2016,25(4):156-161. [12]RAO A,LAKSHMINARAYANAN K,SURANA S,et al.Load Balancing in STRUCTURED P2P Systems[M]∥Peer-to-Peer Systems II.Springer Berlin Heidelberg,2003:68-79. [13]KARGER D R,RUHL M.Simple Efficient Load-Balancing Algorithms for Peer-to-Peer Systems[M]∥Peer-to-Peer Systems III.Springer Berlin Heidelberg,2005:131-140. [14]RIECHE S,PETRAK L,WEHRLE K.A thermal-dissipation-based approach for balancing data load in distributed hash tables[C]∥IEEE International Conference on Local Computer Networks.IEEE Computer Society,2004:15-23. |
[1] | 张忆文, 林铭炜. 基于动态优先级设备低能耗调度算法 Devices Low Energy Consumption Scheduling Algorithm Based on Dynamic Priority 计算机科学, 2021, 48(11A): 471-475. https://doi.org/10.11896/jsjkx.210100080 |
[2] | 陶洋,纪瑞娟,杨理,王进. 异构无线网络中动态优先级接纳控制算法研究 Study on Dynamic Priority Admission Control Algorithm in Heterogeneous Wireless Networks 计算机科学, 2020, 47(3): 242-247. https://doi.org/10.11896/jsjkx.190100089 |
[3] | 曾金晶, 张建山, 林兵, 张文德. 基于无线城域网的微云负载均衡算法 Cloudlet Workload Balancing Algorithm in Wireless Metropolitan Area Networks 计算机科学, 2019, 46(8): 163-170. https://doi.org/10.11896/j.issn.1002-137X.2019.08.027 |
[4] | 刘春玲, 施玉鑫, 张然. 基于权值与平均连接度的导弹组网设计 Design of Missile Networking Based on Weights and Average Connectivity 计算机科学, 2019, 46(6A): 325-328. |
[5] | 杨飞,马昱春,侯金,徐宁. 基于MPSoC并行调度的矩阵乘法加速算法研究 Research on Acceleration of Matrix Multiplication Based on Parallel Scheduling on MPSoC 计算机科学, 2017, 44(8): 36-41. https://doi.org/10.11896/j.issn.1002-137X.2017.08.007 |
[6] | 王溪波,葛宏帅,王瑞全,林海. 电梯远程监控系统中高并发通信服务器的设计 Design of High Concurrent Communication Server of Elevator Remote Monitoring System 计算机科学, 2017, 44(4): 157-160. https://doi.org/10.11896/j.issn.1002-137X.2017.04.034 |
[7] | 刘旭,莫则尧,安恒斌,曹小林,张爱清. 一种基于实测的自动负载建模算法 Automatic Load Modeling Algorithm Based on Real Time Measuring 计算机科学, 2015, 42(1): 63-66. https://doi.org/10.11896/j.issn.1002-137X.2015.01.014 |
[8] | 黄志敏,曾学文,陈君. 一种基于Kademlia的全分布式爬虫集群方法 Method for Fully Distributed Crawler Cluster Based on Kademlia 计算机科学, 2014, 41(3): 124-128. |
[9] | 罗香玉,汪芸,陈笑梅. 存储系统负载平衡机制的评价与分析 Evaluation and Analysis of Load Balancing Mechanisms in Storage Systems 计算机科学, 2013, 40(9): 55-60. |
[10] | 余鑫,张斌. 一种支持邻居负载感知的动态负载平衡扩散算法 Dynamic Load Balancing Diffusion Algorithm with Neighbors Loading Awareness 计算机科学, 2013, 40(3): 167-169. |
[11] | 葛青,白光伟,沈航,张芃,曹磊. 无线网络链路质量感知的机会网络编码机制 Link-quality-aware Opportunistic Network Coding Mechanism in Wireless Networks 计算机科学, 2013, 40(11): 29-34. |
[12] | 孙熙领,陈超,丁治明,许佳捷,袁栋. 云计算环境中基于访问量和依赖性评价的数据分配算法 Data Allocation Algorithm Based on Visit Capacity and Dependency Evaluation in Cloud 计算机科学, 2012, 39(5): 141-146. |
[13] | 于荣欢,邓宝松,吴玲达,瞿师. 时变三维标量场并行计算与绘制框架研究 Parallel Computing and Rendering Framework of Time Variable 3D Scalar Fields 计算机科学, 2012, 39(3): 187-191. |
[14] | 刘君瑞,陈颖图,樊晓娅. 基于先到先服务的二维动态优先级信令排队算法 Two Dimensional Dynamic Priority-based FCFS Token-Queuing Algorithm 计算机科学, 2011, 38(5): 89-92. |
[15] | 魏文红,向菲,王文丰,王高才. 一种结构化P2P系统的负载平衡算法 Load Balancing Algorithm in Structure P2P Systems 计算机科学, 2010, 37(4): 82-. |
|