基于Hadoop平台的海量文本分类的并行化

计算机科学 ›› 2011, Vol. 38 ›› Issue (10): 184-188.

基于Hadoop平台的海量文本分类的并行化

向小军,高阳,商琳,杨育彬

(南京大学计算机科学与技术系南京210093)

出版日期:2018-11-16 发布日期:2018-11-16

Parallel Text Categorization of Massive Text Based on Hadoop

XIANG Xiao-jun,GAO Yang,SHANE Lin,YANG Yu-bin

Online:2018-11-16 Published:2018-11-16

摘要/Abstract

摘要： 文本分类是信息检索与数据挖掘的研究热点与核心技术，近年来得到了广泛的关注和快速的发展。近来年随着文本数据呈指数增长，要有效地管理这些数据，就必须在分布式环境下用有效的算法来处理这些数据。在Ha- doop分布式平台下实现了一简单有效的文本分类算法—TFIDF分类算法，即一种基于向量空间模型的分类算法，它用余弦相似度得到分类结果。在两个数据集上做了实验，结果表明，这一并行化算法在大数据集上很有效并可以在实际领域中得到良好的应用。

关键词: 文本分类，并行化，海量数据，Hadoop

Abstract: In recent years, there have been extensive studies and rapid progresses in automatic text categorization, which is one of the hotspots and key techniques in the information retrieval and data mining field. In recent years,as the text data grows exponentially, to effectively manage the large storage of data, we must use efficient algorithm to process it in the distributed environment. In this paper, we implemented a simple and effective text categorization algorithm on ha- doop--TFIDF classifier, an algorithm based on vector space model, cosine similarity was applied as the metrics. The ex- periments on two datasets show that the parallel algorithm is effective on large storage of data and can be applied in practical application field.

Key words: Text categorization, Parallelization, Massive data, Hadoop

向小军,高阳,商琳,杨育彬. 基于Hadoop平台的海量文本分类的并行化[J]. 计算机科学, 2011, 38(10): 184-188. https://doi.org/

XIANG Xiao-jun,GAO Yang,SHANE Lin,YANG Yu-bin. Parallel Text Categorization of Massive Text Based on Hadoop[J]. Computer Science, 2011, 38(10): 184-188. https://doi.org/

参考文献

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed