Computer Science ›› 2018, Vol. 45 ›› Issue (6A): 428-432.

• Big Date & Date Mining • Previous Articles     Next Articles

Study on Active Acquisition of Distributed Web Crawler Cluster

DONG Yu-long,YANG Lian-he,MA Xin   

  1. School of Computer Science and Software Engineering,Tianjin Polytechnic University,Tianjin 300387,China
  • Online:2018-06-20 Published:2018-08-03

Abstract: In this paper,in order to solve the processing efficiency,scalability,task allocation and load balance problem existed in the present distributed web crawler method,an active acquisition task distributed web crawler method was proposed,in which a sub-controlled module is added into the sub-node to evaluate the node load and operation status,and apply task queue for the central control node.Based on this method as well as the dynamic dual-directional priority task allocation algorithm,a distributed network crawler model was designed,which has the characteristics of load ba-lance,task hierarchical allocation,abnormal node smart identification and safe exit,etc.The practice test shows that the active acquisition task distributed web crawler method can be used to build large-scale distributed crawler cluster effectively.

Key words: Active obtain, Crawler framework, Distributed crawler, Dynamic priority, Load balancing, Multi process

CLC Number: 

  • TP301.6
[1]ZHOU D M.Survey of High-performance Web Crawler[J].Computer Science,2009,36(8):26-29.
[2]周孝锞.基于网络爬虫和改进的LCS算法的网站更新监测.计算机应用与软件,2017,34(1):222-229. [3]周德懋,李舟军.高性能网络爬虫:研究综述[J].计算机科学,2009,36(8):26-29.
[4]BRIN S,PAGE L.Reprint of:The anatomy of a large-scale hypertextual web search engine[J].Computer Networks,2012,56(18):3825-3833.
[5]MADAAN R,SHARMA A K,DIXIT A.A novel architecture for a blog crawler[C]∥IEEE International Conference on Paral-lel Distributed and Grid Computing.IEEE,2012:452-456.
[6]THAU B,OWEN L,KRISHNAMURTHY C S.Distributed Web Crawling over DHTs:UC Berkeley Technical Report UCB,CSD-4-1305[R].2004.
[7]ZHU K,XU Z,WANG X,et al.A Full Distributed Web Crawler Based on Structured Network[C]∥Asia Information Retrieval Conference on Information Retrieval Technology.Springer-Verlag,2008:478-483.
[8]刘爽,姜春祥,张伟哲,等.基于GNP算法的分布式爬虫调度策略[J].计算机应用研究,2010,27(2):446-449. [9]龚跃,张真真,黄小珂,等.基于动态双向优先级的任务分配与调度算法[J].计算机应用,2009,29(4):1131-1134. [10]黄志敏,曾学文,陈君.一种基于Kademlia的全分布式爬虫集群方法[J].计算机科学,2014,41(3):124-128.
[11]陶耀东,向中希.基于改进Kademlia协议的分布式爬虫[J].计算机系统应用,2016,25(4):156-161. [12]RAO A,LAKSHMINARAYANAN K,SURANA S,et al.Load Balancing in STRUCTURED P2P Systems[M]∥Peer-to-Peer Systems II.Springer Berlin Heidelberg,2003:68-79.
[13]KARGER D R,RUHL M.Simple Efficient Load-Balancing Algorithms for Peer-to-Peer Systems[M]∥Peer-to-Peer Systems III.Springer Berlin Heidelberg,2005:131-140.
[14]RIECHE S,PETRAK L,WEHRLE K.A thermal-dissipation-based approach for balancing data load in distributed hash tables[C]∥IEEE International Conference on Local Computer Networks.IEEE Computer Society,2004:15-23.
[1] TIAN Zhen-zhen, JIANG Wei, ZHENG Bing-xu, MENG Li-min. Load Balancing Optimization Scheduling Algorithm Based on Server Cluster [J]. Computer Science, 2022, 49(6A): 639-644.
[2] GAO Jie, LIU Sha, HUANG Ze-qiang, ZHENG Tian-yu, LIU Xin, QI Feng-bin. Deep Neural Network Operator Acceleration Library Optimization Based on Domestic Many-core Processor [J]. Computer Science, 2022, 49(5): 355-362.
[3] TAN Shuang-jie, LIN Bao-jun, LIU Ying-chun, ZHAO Shuai. Load Scheduling Algorithm for Distributed On-board RTs System Based on Machine Learning [J]. Computer Science, 2022, 49(2): 336-341.
[4] XIA Zhong, XIANG Min, HUANG Chun-mei. Hierarchical Management Mechanism of P2P Video Surveillance Network Based on CHBL [J]. Computer Science, 2021, 48(9): 278-285.
[5] SONG Hai-ning, JIAO Jian, LIU Yong. Research on Mobile Edge Computing in Expressway [J]. Computer Science, 2021, 48(6A): 383-386.
[6] WANG Zheng, JIANG Chun-mao. Cloud Task Scheduling Algorithm Based on Three-way Decisions [J]. Computer Science, 2021, 48(6A): 420-426.
[7] QU Wei, YU Fei-hong. Survey of Research on Asymmetric Embedded System Based on Multi-core Processor [J]. Computer Science, 2021, 48(6A): 538-542.
[8] ZHENG Zeng-qian, WANG Kun, ZHAO Tao, JIANG Wei, MENG Li-min. Load Balancing Mechanism for Bandwidth and Time-delay Constrained Streaming Media Server Cluster [J]. Computer Science, 2021, 48(6): 261-267.
[9] YAO Ze-wei, LIU Jia-wen, HU Jun-qin, CHEN Xing. PSO-GA Based Approach to Multi-edge Load Balancing [J]. Computer Science, 2021, 48(11A): 456-463.
[10] ZHANG Yi-wen, LIN Ming-wei. Devices Low Energy Consumption Scheduling Algorithm Based on Dynamic Priority [J]. Computer Science, 2021, 48(11A): 471-475.
[11] YANG Zi-qi, CAI Ying, ZHANG Hao-chen, FAN Yan-fang. Computational Task Offloading Scheme Based on Load Balance for Cooperative VEC Servers [J]. Computer Science, 2021, 48(1): 81-88.
[12] GUO Fei-yan, TANG Bing. Mobile Edge Server Placement Method Based on User Latency-aware [J]. Computer Science, 2021, 48(1): 103-110.
[13] GAO Zi-yan and WANG Yong. Load Balancing Strategy of Distributed Messaging System for Cloud Services [J]. Computer Science, 2020, 47(6A): 318-324.
[14] HUANG Mei-gen, WANG Tao, LIU Liang, PANG Rui-qin and DU Huan. Virtual Network Function Deployment Strategy Based on Software Defined Network Resource Optimization [J]. Computer Science, 2020, 47(6A): 404-408.
[15] ZHOU Jian-xin, ZHANG Zhi-peng, ZHOU Ning. Load Balancing Technology of Segment Routing Based on CKSP [J]. Computer Science, 2020, 47(4): 256-261.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!