Computer Science ›› 2012, Vol. 39 ›› Issue (Z11): 143-145.

Previous Articles     Next Articles

Micro-blogs Data Collection Based on MapReduce

  

  • Online:2018-11-16 Published:2018-11-16

Abstract: Micro-blogs is not only large volumes of data but also high real-time,while it is difficult to obtain sufficient micro-blogs in a short period of time by using traditional Web text crawling methods. To solve the problem about data collection when researching the micro-blogs, this paper presents a data collection platform based on MapReduce which is set up on hadoop platform, and takes full advantage of the characteristics of the hadoop distributed framework to crawler micro-blogs with multi-node at the same time, greatly improving the crawling rate. To solve the problem that the input data of micro-blogs collection is too small that hadoop cannot effectively balance load, this paper presents we can effectively solve the problem with the input of a number of small files. Finally we test sing micro-blogs as an example. The results show that the system is of low cost, scalable, and of high performance. I}his system can be widely used in public opinion analysis, communication and social network based on the data on micro-blogs, as their basic data collection platform.

Key words: Hadoop, MapReduce, Micro-blog, Data-collection, Sina

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!