Computer Science ›› 2014, Vol. 41 ›› Issue (3): 124-128.

Previous Articles     Next Articles

Method for Fully Distributed Crawler Cluster Based on Kademlia

HUANG Zhi-ming,ZENG Xue-weng and CHENG Jun   

  • Online:2018-11-14 Published:2018-11-14

Abstract: For solving the issues of efficiency,balance,reliability and scalability encountered in organizing a mass of crawler node to form a fully distributed crawler cluster,we proposed a fully distributed crawler cluster method based on kademlia.The method establishes the underlying communication mechanism between crawler nodes by improving the method of kademlia technology.On this basis,we designed and implemented a distributed crawler cluster model with task partitioning,exception handling,node join and exit process and load balance,based on the XOR characteristics in kademlia and available resources of the node.Experiments in the actual system show that this method can take advantages of computing,storage,and bandwidth resources of massive weak terminal to successfully build a fully distributed crawler cluster with efficient,balanced,reliable,and has large-scale development properties.

Key words: Kademlia,Distributed crawler,Weak computing terminal,Massive nodes,Structured P2P

[1] Loo B T,Cooper O,Krishnamurthy S.Distributed web crawling over DHTs[R].University of California,Berkeley,2004
[2] Singh A,et al.Apoidea:A Decentralized Peer-to-Peer Architec-ture for Crawling the World Wide Web Distributed Multimedia Information Retrieval[J].Distributed Multimedia Information Retrieval(Lecture Notes in Computer Science),2004,4:126-142
[3] Boldi P,et al.UbiCrawler:a scalable fully distributed Webcrawler[J].Software:Practice and Experience,2004,34(8):711-726
[4] Zhu K,et al.A Full Distributed Web Crawler Based on Structured NetworkInformation Retrieval Technology[J].Information Retrieval Technology(Lecture Notes in Computer Science),2008,4993:478-483
[5] 许笑,张伟哲,张宏莉,等.广域网分布式Web爬虫[J].Journal of Software,2010,21(5):1067-1082
[6] 吴黎兵,柯亚林,何炎祥,等.分布式网络爬虫的设计与实现[J].计算机应用与软件,2011,28(11):176-179
[7] 刘爽,姜春祥,张伟哲,等.基于 GNP 算法的分布式爬虫调度策略[J].计算机应用研究,2010(2):446-449
[8] 袁理锋.分布式视频搜索爬虫系统的设计与实现[D].大连:大连理工大学,2009
[9] 李伟.分布式搜索引擎设计与实现[D].安徽:中国科学技术大学,2006
[10] 金凡,顾进广.一种改进的 T-Spider 分布式爬虫[J].微电子学与计算机,2011,28(8):102-104
[11] 中国科学院声学研究所.一种网页爬虫协作方法:中国,CN201110375264.1[P].2012-05-30
[12] Maymounkov P,Mazieres D.Kademlia:A peer-to-peer information system based on the xor metric[C]∥Peer-to-Peer Systems.2002:53-65
[13] Rao A,et al.Load Balancing in Structured P2P Systems [C]∥ Proc.2nd Int.Workshop on Peer-to-Peer Systems.Berlin/Heidelberg:Springer,2003:68-79
[14] Karger D R,Ruhl M.Simple efficient load balancing algorithms for peer-to-peer systems[C]∥Proceedings of the sixteenth annual ACM symposium on Parallelism in algorithms and architectures 2004.ACM:Barcelona,Spain,2004:36-43
[15] Rieche S,Petrak L,Wehrle K.A thermal-dissipation-based approach for balancing data load in distributed hash tables[C]∥29th Annual IEEE International Conference on Local Computer Networks.2004

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!