Computer Science ›› 2018, Vol. 45 ›› Issue (5): 303-309.doi: 10.11896/j.issn.1002-137X.2018.05.053

Previous Articles     Next Articles

Load Balancing Strategy of MapReduce Clustering Based on Index Shift

ZHOU Hua-ping, LIU Guang-zong and ZHANG Bei-bei   

  • Online:2018-05-15 Published:2018-07-25

Abstract: MapReduce has been widely used in large-scale and high-dimension datasets as a kind of distributed programming model.Original Hash partition function in MapReduce often occurs data skew when data distribution is not uniform.In the clustering algorithm based on MapReduce,existing solutions for data skew are not applicable because the input data distribution of Reduce is unclear at each stage of multiple iteration.To solve the imbalance problem of data partitioning,this paper proposed a strategy to change the remaining partition index when data is tilted.In Map phase,the amount of data which will be distributed to each reducer is counted,then the global partition information is monitored and the original partition function is dynamically modified according to the data skew model by JobTrackcr,so the Partitioner can change the index of these partitions which will cause data skew to the other reducer that has less load in the next partitioning process,and eventually balance the load of each node.Finally,this method was compared with exi-sting methods on both synthetic datasets and real datasets.The experimental results show this strategy can solve data skew on MapReduce clustering with better stability and efficiency than Hash method and dynamic partitioning method based on sampling.

Key words: MapReduce,Data skew,Load balance,Distributed clustering,Index shift

[1] DEAN J,GHEMAWAT S.MapReduce:simplified data proces-sing on large clusters[J].Communication of the ACM,2008,51(1):107-113.
[2] WANG S,WANG H J,QIN X P,et al.Architecting big data:challenges,studies and forecasts[J].Chinese Journal of Compu-ters,2011,4(10):1741-1752.(in Chinese) 王珊,王会举,覃雄派,等.架构大数据:挑战、现状与展望[J].计算机学报,2011,4(10):1741-1752.
[3] ZHANG C C.Design and Optimize Big-Data Join Algorithmsusing MapReduce[D].Anhui:University of Science and Techno-logy of China,2014.(in Chinese) 张常淳.基于MapReduce的大数据连接算法的设计与优化[D].安徽:中国科学技术大学,2014.
[4] LIN J.The curse of Zipf and limits to parallelization:A look at the stragglers problem in MapReduce[C]∥7th Workshop on Large-Scale Distributed Systems for Information Retrieval.Boston,USA,2009.
[5] GENG Y J.The Research of Skew with Counter Technique in MapReduce[D].Dalian:Dalian Maritime University,2013.(in Chinese) 耿玉娇.MapReduce中基于抽样技术的倾斜问题研究[D].大连:大连海事大学,2013.
[6] RACHA S C.Load Balancing Map-Reduce Communications for Efficient Executions of Applications in a Cloud[D].Bangalore:Indian Institute of Science, 2012.
[7] GUFLER B,AUGSTEN N,REISER A,Kemper A.Load balancing in MapReduce based on scalable cardinality estimates[C]∥the 2012 IEEE 28th International Conference on Data Enginee-ring(ICDE).Washington,USA,2012:522-533.
[8] GUFLER B,AUGSTEN N,REISER A,et al.Handling dataskew in MapReduce[C]∥1th International Conference on Cloud Computing and Service Science.Noordwijkerhout,Nether-lands,2011:574-583.
[9] ZHOU J S,WANG Q,GAO J.An approach for load balancing Computer in MapReduce via dynamic partitioning[J].Journal of Computer Research and Development,2013,0(S1):369-377.(in Chinese) 周家帅,王琦,高军.一种基于动态划分的MapReduce负载均衡方法[J].计算机研究与发展,2013,0(S1):369-377.
[10] WANG Z,CHEN Q,LI Z H,et al.An Incremental Partitioning Strategy for Data Balance on MapReduce[J].Chinese Journal of Computers,2016,9(1):19-35.(in Chinese) 王卓,陈群,李战怀,等.基于增量式分区策略的 MapReduce 数据均衡方法[J].计算机学报,2016,9(1):19-35.
[11] LI H C,QIN X L,SHEN X.Load Balancing Strategy Based on Pressure Feedback on MapReduce[J].Computer Science,2015,2(4):141-146.(in Chinese) 李航晨,秦小麟,沈尧.基于压力反馈的MapReduce负载均衡策略[J].计算机科学,2015,2(4):141-146.
[12] KWON Y,BALAZINSKA M,HOWE B,et al.Skew Tune:Mitigating skew in MapReduce applications[C]∥the 2012 ACM SIGMOD International Conference on Management of Data.Scottsdale,USA,2012:25-36.
[13] LI Q H.Research of Large-scale Data Mining Technologies on MapReduce[D].Shanghai:Fudan University,2013.(in Chinese) 李秋虹.基于MapReduce的大规模数据挖掘技术研究[D].上海:复旦大学,2013.
[14] FU J,DU Z H.Load Balancing Strategy on Peridical MapReduce Job[J].Computer Science,2013,40(3):38-40.(in Chinese) 傅杰,都志辉.一种周期性MapReduce作业的负载均衡策略[J].计算机科学,2013,40(3):38-40.
[15] XU Y J.The Research of Parallel Clustering Algorithm of Massive Data in Cloud Computing Environment[D].Dalian:Dalian Maritime University,2014.(in Chinese) 许玉杰.云计算环境下海量数据的并行聚类算法研究[D].大连:大连海事大学,2014.

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!