摘要: 针对将海量爬虫节点组织成全分布式爬虫集群所遇到的高效、均衡、可靠、可拓展等问题,提出了一种基于Kademlia的全分布式爬虫集群方法。该方法通过改进的Kademlia技术建立起爬虫节点间的底层通信机制。在此基础上,根据Kademlia的异或特性及节点的可用资源情况,设计并实现具有任务划分、异常处理、节点加入退出处理及负载均衡的全分布式爬虫集群模型。在实际网络系统上的实验结果表明,该方法能有效利用海量弱计算终端的计算、存储和带宽资源,构建高效、均衡、可靠、可大规模拓展的全分布式爬虫集群。
[1] Loo B T,Cooper O,Krishnamurthy S.Distributed web crawling over DHTs[R].University of California,Berkeley,2004 [2] Singh A,et al.Apoidea:A Decentralized Peer-to-Peer Architec-ture for Crawling the World Wide Web Distributed Multimedia Information Retrieval[J].Distributed Multimedia Information Retrieval(Lecture Notes in Computer Science),2004,4:126-142 [3] Boldi P,et al.UbiCrawler:a scalable fully distributed Webcrawler[J].Software:Practice and Experience,2004,34(8):711-726 [4] Zhu K,et al.A Full Distributed Web Crawler Based on Structured NetworkInformation Retrieval Technology[J].Information Retrieval Technology(Lecture Notes in Computer Science),2008,4993:478-483 [5] 许笑,张伟哲,张宏莉,等.广域网分布式Web爬虫[J].Journal of Software,2010,21(5):1067-1082 [6] 吴黎兵,柯亚林,何炎祥,等.分布式网络爬虫的设计与实现[J].计算机应用与软件,2011,28(11):176-179 [7] 刘爽,姜春祥,张伟哲,等.基于 GNP 算法的分布式爬虫调度策略[J].计算机应用研究,2010(2):446-449 [8] 袁理锋.分布式视频搜索爬虫系统的设计与实现[D].大连:大连理工大学,2009 [9] 李伟.分布式搜索引擎设计与实现[D].安徽:中国科学技术大学,2006 [10] 金凡,顾进广.一种改进的 T-Spider 分布式爬虫[J].微电子学与计算机,2011,28(8):102-104 [11] 中国科学院声学研究所.一种网页爬虫协作方法:中国,CN201110375264.1[P].2012-05-30 [12] Maymounkov P,Mazieres D.Kademlia:A peer-to-peer information system based on the xor metric[C]∥Peer-to-Peer Systems.2002:53-65 [13] Rao A,et al.Load Balancing in Structured P2P Systems [C]∥ Proc.2nd Int.Workshop on Peer-to-Peer Systems.Berlin/Heidelberg:Springer,2003:68-79 [14] Karger D R,Ruhl M.Simple efficient load balancing algorithms for peer-to-peer systems[C]∥Proceedings of the sixteenth annual ACM symposium on Parallelism in algorithms and architectures 2004.ACM:Barcelona,Spain,2004:36-43 [15] Rieche S,Petrak L,Wehrle K.A thermal-dissipation-based approach for balancing data load in distributed hash tables[C]∥29th Annual IEEE International Conference on Local Computer Networks.2004 |
No related articles found! |
|