计算机科学 ›› 2012, Vol. 39 ›› Issue (Z11): 143-145.

• 软件工程 • 上一篇    下一篇

基于MapReduce的微博文本采集平台

于留宝,胡长军,苏林晗   

  1. (北京科技大学计算机与通信工程学院 北京100083)
  • 出版日期:2018-11-16 发布日期:2018-11-16

Micro-blogs Data Collection Based on MapReduce

  • Online:2018-11-16 Published:2018-11-16

摘要: 微博不仅数据量大,而且实时性高,采用传统的W cb文本爬取方式,很难在短时间内获取足量的微博。为了解决研究微博数据面临的数据采集问题,提出了基于MapReduce的微博数据采集平台,将整个微博抓取系统部署在hadoop平台上,充分利用hadoop分布式框架的特点,实现多节点同时抓取微博,很大程度上提高了抓取速率;并就微博采集过程中因输入数据过小导致hadoop不能有效均衡负载的问题,提出了采用多个小文件的输入方式,有效地解决了负载不均衡的问题。最后以Sina微博为例进行结,结果表明,该系统成本低、扩展性好、效率高,可广泛应用于基于微博数据的舆情分析以及传播学和虚拟社会学等方面的研究,并作为其基础数据采集平台。

关键词: Hadoop,MapRcduce,微博,数据采集,Sina

Abstract: Micro-blogs is not only large volumes of data but also high real-time,while it is difficult to obtain sufficient micro-blogs in a short period of time by using traditional Web text crawling methods. To solve the problem about data collection when researching the micro-blogs, this paper presents a data collection platform based on MapReduce which is set up on hadoop platform, and takes full advantage of the characteristics of the hadoop distributed framework to crawler micro-blogs with multi-node at the same time, greatly improving the crawling rate. To solve the problem that the input data of micro-blogs collection is too small that hadoop cannot effectively balance load, this paper presents we can effectively solve the problem with the input of a number of small files. Finally we test sing micro-blogs as an example. The results show that the system is of low cost, scalable, and of high performance. I}his system can be widely used in public opinion analysis, communication and social network based on the data on micro-blogs, as their basic data collection platform.

Key words: Hadoop, MapReduce, Micro-blog, Data-collection, Sina

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!