计算机科学 ›› 2018, Vol. 45 ›› Issue (5): 303-309.doi: 10.11896/j.issn.1002-137X.2018.05.053

• 交叉与前沿 • 上一篇    下一篇

基于索引偏移的MapReduce聚类负载均衡策略

周华平,刘光宗,张贝贝   

  1. 安徽理工大学计算机科学与工程学院 安徽 淮南232000,安徽理工大学计算机科学与工程学院 安徽 淮南232000,安徽理工大学计算机科学与工程学院 安徽 淮南232000
  • 出版日期:2018-05-15 发布日期:2018-07-25
  • 基金资助:
    本文受国家自然科学基金(51174257),安徽理工大学矿业企业安全管理研究中心招标项目(SK2015A084),安徽省高校优秀青年人才支持计划项目资助

Load Balancing Strategy of MapReduce Clustering Based on Index Shift

ZHOU Hua-ping, LIU Guang-zong and ZHANG Bei-bei   

  • Online:2018-05-15 Published:2018-07-25

摘要: MapReduce作为一种分布式编程模型,被广泛应用于大规模和高维度数据集的处理中。其采用原始Hash函数 划分 数据,当数据分布不均匀时,常会出现数据倾斜的问题。基于MapReduce的聚类算法,需要多次迭代且不清楚各阶段Reduce的输入数据分布,因此现有的解决数据倾斜的方法并不适用。为解决数据划分的不均衡问题,提出一种当存在数据倾斜时更改剩余分区索引的策略。该方法在Map运行的过程中统计将要分给各reducer的数据量,由JobTrackcr监控全局的分区信息并根据数据倾斜模型动态修改原分区函数;在接下来的分区过程中,Partitioner把即将导致倾斜的分区索引到其余负载较轻的reducer上,使各节点的负载达到均衡。基于Zipf分布数据集和真实数据集,将所提算法与现有的解决数据倾斜的方法进行对比,结果证明,所提策略解决了MapReduce聚类中的数据倾斜问题,且在稳定性与执行时间上优于Hash和基于采样的动态分区法。

关键词: MapReduce,数据倾斜,负载均衡,分布式聚类,索引偏移

Abstract: MapReduce has been widely used in large-scale and high-dimension datasets as a kind of distributed programming model.Original Hash partition function in MapReduce often occurs data skew when data distribution is not uniform.In the clustering algorithm based on MapReduce,existing solutions for data skew are not applicable because the input data distribution of Reduce is unclear at each stage of multiple iteration.To solve the imbalance problem of data partitioning,this paper proposed a strategy to change the remaining partition index when data is tilted.In Map phase,the amount of data which will be distributed to each reducer is counted,then the global partition information is monitored and the original partition function is dynamically modified according to the data skew model by JobTrackcr,so the Partitioner can change the index of these partitions which will cause data skew to the other reducer that has less load in the next partitioning process,and eventually balance the load of each node.Finally,this method was compared with exi-sting methods on both synthetic datasets and real datasets.The experimental results show this strategy can solve data skew on MapReduce clustering with better stability and efficiency than Hash method and dynamic partitioning method based on sampling.

Key words: MapReduce,Data skew,Load balance,Distributed clustering,Index shift

[1] DEAN J,GHEMAWAT S.MapReduce:simplified data proces-sing on large clusters[J].Communication of the ACM,2008,51(1):107-113.
[2] WANG S,WANG H J,QIN X P,et al.Architecting big data:challenges,studies and forecasts[J].Chinese Journal of Compu-ters,2011,4(10):1741-1752.(in Chinese) 王珊,王会举,覃雄派,等.架构大数据:挑战、现状与展望[J].计算机学报,2011,4(10):1741-1752.
[3] ZHANG C C.Design and Optimize Big-Data Join Algorithmsusing MapReduce[D].Anhui:University of Science and Techno-logy of China,2014.(in Chinese) 张常淳.基于MapReduce的大数据连接算法的设计与优化[D].安徽:中国科学技术大学,2014.
[4] LIN J.The curse of Zipf and limits to parallelization:A look at the stragglers problem in MapReduce[C]∥7th Workshop on Large-Scale Distributed Systems for Information Retrieval.Boston,USA,2009.
[5] GENG Y J.The Research of Skew with Counter Technique in MapReduce[D].Dalian:Dalian Maritime University,2013.(in Chinese) 耿玉娇.MapReduce中基于抽样技术的倾斜问题研究[D].大连:大连海事大学,2013.
[6] RACHA S C.Load Balancing Map-Reduce Communications for Efficient Executions of Applications in a Cloud[D].Bangalore:Indian Institute of Science, 2012.
[7] GUFLER B,AUGSTEN N,REISER A,Kemper A.Load balancing in MapReduce based on scalable cardinality estimates[C]∥the 2012 IEEE 28th International Conference on Data Enginee-ring(ICDE).Washington,USA,2012:522-533.
[8] GUFLER B,AUGSTEN N,REISER A,et al.Handling dataskew in MapReduce[C]∥1th International Conference on Cloud Computing and Service Science.Noordwijkerhout,Nether-lands,2011:574-583.
[9] ZHOU J S,WANG Q,GAO J.An approach for load balancing Computer in MapReduce via dynamic partitioning[J].Journal of Computer Research and Development,2013,0(S1):369-377.(in Chinese) 周家帅,王琦,高军.一种基于动态划分的MapReduce负载均衡方法[J].计算机研究与发展,2013,0(S1):369-377.
[10] WANG Z,CHEN Q,LI Z H,et al.An Incremental Partitioning Strategy for Data Balance on MapReduce[J].Chinese Journal of Computers,2016,9(1):19-35.(in Chinese) 王卓,陈群,李战怀,等.基于增量式分区策略的 MapReduce 数据均衡方法[J].计算机学报,2016,9(1):19-35.
[11] LI H C,QIN X L,SHEN X.Load Balancing Strategy Based on Pressure Feedback on MapReduce[J].Computer Science,2015,2(4):141-146.(in Chinese) 李航晨,秦小麟,沈尧.基于压力反馈的MapReduce负载均衡策略[J].计算机科学,2015,2(4):141-146.
[12] KWON Y,BALAZINSKA M,HOWE B,et al.Skew Tune:Mitigating skew in MapReduce applications[C]∥the 2012 ACM SIGMOD International Conference on Management of Data.Scottsdale,USA,2012:25-36.
[13] LI Q H.Research of Large-scale Data Mining Technologies on MapReduce[D].Shanghai:Fudan University,2013.(in Chinese) 李秋虹.基于MapReduce的大规模数据挖掘技术研究[D].上海:复旦大学,2013.
[14] FU J,DU Z H.Load Balancing Strategy on Peridical MapReduce Job[J].Computer Science,2013,40(3):38-40.(in Chinese) 傅杰,都志辉.一种周期性MapReduce作业的负载均衡策略[J].计算机科学,2013,40(3):38-40.
[15] XU Y J.The Research of Parallel Clustering Algorithm of Massive Data in Cloud Computing Environment[D].Dalian:Dalian Maritime University,2014.(in Chinese) 许玉杰.云计算环境下海量数据的并行聚类算法研究[D].大连:大连海事大学,2014.

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!