一种基于Kademlia的全分布式爬虫集群方法

计算机科学 ›› 2014, Vol. 41 ›› Issue (3): 124-128.

一种基于Kademlia的全分布式爬虫集群方法

黄志敏,曾学文,陈君

中国科学院大学北京100049;中国科学院声学研究所国家网络新媒体工程技术研究中心北京100190;中国科学院声学研究所国家网络新媒体工程技术研究中心北京100190

出版日期:2018-11-14 发布日期:2018-11-14
基金资助:
本文受863重大项目课题:融合网络业务体系的开发(2011AA01A102),中科院先导专项课题:海端交互数据实时处理(XDA6030500),国家科技支撑计划课题:支持增强型搜索的重点新闻网站三屏融合服务(2011BAH11B05)资助

Method for Fully Distributed Crawler Cluster Based on Kademlia

HUANG Zhi-ming,ZENG Xue-weng and CHENG Jun

Online:2018-11-14 Published:2018-11-14

摘要/Abstract

摘要： 针对将海量爬虫节点组织成全分布式爬虫集群所遇到的高效、均衡、可靠、可拓展等问题,提出了一种基于Kademlia的全分布式爬虫集群方法。该方法通过改进的Kademlia技术建立起爬虫节点间的底层通信机制。在此基础上,根据Kademlia的异或特性及节点的可用资源情况,设计并实现具有任务划分、异常处理、节点加入退出处理及负载均衡的全分布式爬虫集群模型。在实际网络系统上的实验结果表明,该方法能有效利用海量弱计算终端的计算、存储和带宽资源,构建高效、均衡、可靠、可大规模拓展的全分布式爬虫集群。

关键词: Kademlia,分布式爬虫,弱计算终端,海量节点,结构化P2P 中图法分类号TP301.6文献标识码A

Abstract: For solving the issues of efficiency,balance,reliability and scalability encountered in organizing a mass of crawler node to form a fully distributed crawler cluster,we proposed a fully distributed crawler cluster method based on kademlia．The method establishes the underlying communication mechanism between crawler nodes by improving the method of kademlia technology．On this basis,we designed and implemented a distributed crawler cluster model with task partitioning,exception handling,node join and exit process and load balance,based on the XOR characteristics in kademlia and available resources of the node．Experiments in the actual system show that this method can take advantages of computing,storage,and bandwidth resources of massive weak terminal to successfully build a fully distributed crawler cluster with efficient,balanced,reliable,and has large-scale development properties.

Key words: Kademlia,Distributed crawler,Weak computing terminal,Massive nodes,Structured P2P

黄志敏,曾学文,陈君. 一种基于Kademlia的全分布式爬虫集群方法[J]. 计算机科学, 2014, 41(3): 124-128. https://doi.org/

HUANG Zhi-ming,ZENG Xue-weng and CHENG Jun. Method for Fully Distributed Crawler Cluster Based on Kademlia[J]. Computer Science, 2014, 41(3): 124-128. https://doi.org/

参考文献

[1] Loo B T,Cooper O,Krishnamurthy S．Distributed web crawling over DHTs[R]．University of California,Berkeley,2004
[2] Singh A,et al.Apoidea:A Decentralized Peer-to-Peer Architec-ture for Crawling the World Wide Web Distributed Multimedia Information Retrieval[J]．Distributed Multimedia Information Retrieval(Lecture Notes in Computer Science),2004,4:126-142
[3] Boldi P,et al.UbiCrawler:a scalable fully distributed Webcrawler[J]．Software:Practice and Experience,2004,34(8):711-726
[4] Zhu K,et al.A Full Distributed Web Crawler Based on Structured NetworkInformation Retrieval Technology[J]．Information Retrieval Technology(Lecture Notes in Computer Science),2008,4993:478-483
[5] 许笑,张伟哲,张宏莉,等．广域网分布式Web爬虫[J]．Journal of Software,2010,21(5):1067-1082
[6] 吴黎兵,柯亚林,何炎祥,等．分布式网络爬虫的设计与实现[J]．计算机应用与软件,2011,28(11):176-179
[7] 刘爽,姜春祥,张伟哲,等．基于 GNP 算法的分布式爬虫调度策略[J]．计算机应用研究,2010(2):446-449
[8] 袁理锋.分布式视频搜索爬虫系统的设计与实现[D].大连:大连理工大学,2009
[9] 李伟．分布式搜索引擎设计与实现[D]．安徽:中国科学技术大学,2006
[10] 金凡,顾进广．一种改进的 T-Spider 分布式爬虫[J]．微电子学与计算机,2011,28(8):102-104
[11] 中国科学院声学研究所.一种网页爬虫协作方法:中国,CN201110375264.1[P].2012-05-30
[12] Maymounkov P,Mazieres D．Kademlia:A peer-to-peer information system based on the xor metric[C]∥Peer-to-Peer Systems．2002:53-65
[13] Rao A,et al.Load Balancing in Structured P2P Systems [C]∥ Proc.2nd Int.Workshop on Peer-to-Peer Systems.Berlin/Heidelberg:Springer,2003:68-79
[14] Karger D R,Ruhl M．Simple efficient load balancing algorithms for peer-to-peer systems[C]∥Proceedings of the sixteenth annual ACM symposium on Parallelism in algorithms and architectures 2004．ACM:Barcelona,Spain,2004:36-43
[15] Rieche S,Petrak L,Wehrle K．A thermal-dissipation-based approach for balancing data load in distributed hash tables[C]∥29th Annual IEEE International Conference on Local Computer Networks．2004

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed