计算机科学 ›› 2011, Vol. 38 ›› Issue (10): 166-168.

• 数据库与数据挖掘 • 上一篇    下一篇

基于云计算平台Hadoop的并行k-means聚类算法设计研究

赵卫中,马慧芳,傅燕翔,史忠植   

  1. (湘潭大学信息工程学院 湘潭411105);(西北师范大学数学与信息科学学院 兰州730070);(湘潭大学机械工程学院 湘潭411105);(中国科学院计算技术研究所智能信息处理重点实验室 北京100190)
  • 出版日期:2018-11-16 发布日期:2018-11-16

Research on Parallel k-means Algorithm Design Based on Hadoop Platform

ZHAO Wei-zhong,MA Hui-fang,FU Yan-xiang,SHI Zhong-zhi   

  • Online:2018-11-16 Published:2018-11-16

摘要: 随着数据库技术的发展和Intcrnct的迅速普及,实际应用中需要处理的数据量急剧地增长,致聚类研究面临 许多新的问题和挑战,如海量数据和新的计算环境等。深入研究了基于云计算平台Hadoop的并行k-means聚类算 法,给出了算法设计的方法和策略。在多个不同大小数据集上的实验表明,设计的并行聚类算法具有优良的加速比、 扩展率和数据伸缩率等性能,适合用于海量数据的分析和挖掘。

关键词: 云计算,Hadoop平台,并行k-means, MapReduce

Abstract: In the past decades, data clustering has been studied extensively and a mass of methods and theories have been achieved. However, with the development of database and popularity of Internet, a lot of new challenges such as massive data and new computing environment lie in the research on data clustering. We conducted a deep research on parallel k-means algorithm based onHadoop, which is a new cloud computing platform. We showed how to design parallel k-means algorithms on Hadoop. Experiments on different size of datasets demonstrate that our proposed algorithm shows good performance on speedup,scaleup and sizeup. Thus it fits to data clustering on huge datasets.

Key words: Cloud computing,Hadoop,Parallel k-means,MapReduce

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!