Computer Science ›› 2018, Vol. 45 ›› Issue (6A): 471-475.

• Big Date & Date Mining • Previous Articles     Next Articles

Hash Join in MapReduce Distributed Environment Based on Column-store

ZHANG Bin1,LE Jia-jin2   

  1. Zhejiang University of Finance & Economics,Hangzhou 310018,China1
    School of Computer Science and Technology,Donghua University,Shanghai 201620,China2
  • Online:2018-06-20 Published:2018-08-03

Abstract: The characters of big data are volume,variety,value,velocity,and common hardware and open source.Aiming at the system inefficiency and limited scalability of traditional relational database in big data analysis,this paper presented an algorithm of Hash joins in MapReduce distributed environment based on column-store by introducing MapReduce computing model.First of all,this paper proposed the design of large data-oriented distributed computing models.Then,it proposed the partition aggregation and the heuristic optimization strategy to realize the implementation of Hash join algorithm.Lastly,the experiments evaluated execution time and load capacity.The results show that the proposed method is effective and can provid good scalability in big data analysis.

Key words: Big data, Column-store, Hash join, MapReduce, Parallel computing

CLC Number: 

  • TP311
[1]DEAN J,GHEMAWAT S.MapReduce:Simplified Data Pro- cessing on Large Clusters.
[C]∥6th OSDI.San Francisco:USENIX Association,2004:137-150.
[2]ABADI D J,MADDEN S R,HACHEM N.Column-Stores vs.Row-Stores:How Different Are They Really? [C]∥The 2008 ACM SIGMOD Int Conf.Vancouver,BC,Canada:ACM,2008:967-980.
[3]STONEBRAKER M,ABADI D J,BATKIN A,et al.C-Store:A column-oriented DBMS.
[C]∥VLDB Conference.Trondheim,Norway:VLDB Endowment,2005:553-564.
[4]BONCZ P,ZUKOWSKI M,NES N.MonetDB/X100:Hyper- Pipelining Query Execution.
[C]∥The Biennial Conf on Innovative Data Systems Research (CIDR).Asilomar,CA,USA:ACM,2005:225-237.
[5]BLANAS S,PATEL J M,ERCEGOVAC V,et al.A comparison of join algorithms for log processing in MapReduce[C]∥The ACM SIGMOD International Conference on Management of Data.Indianapolis,Indiana,USA:ACM,2010:975-986.
[6]ABOUZEID A,BAJDA-PAWLIKOWSKI K,ABADI D J,et al.HadoopDB:An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads[C]∥VLDB Conference.Lyon,France,VLDB Endowment,2009:922-933.
[7]BAJDA-PAWLIKOWSKI K,ABADI D J,SILBERSCHATZ A,et al.Efficient Processing of Data Warehousing Queries[C]∥The ACM SIGMOD International Conference on Management of Data.Athens,Greece,ACM,2011:1165-1176.
[8]CHANG F,DEAN J,GHEMAWAT S,et al.Robert Gruber: Bigtable:A Distributed Storage System for Structured Data[C]∥OSDI.2006:205-218.
[9]MELNIK S,GUBAREV A,LONG J J,et al.Dremel:Interactive Analysis of Web-Scale Datasets[C]∥VLDB Conference.Singapore,VLDB Endowment,2010:330-339.
[10]LIN Y T,AGRAWAL D,CHEN C,et al.Llama:Leveraging Columnar Storage for Scalable Join Processing in the MapReduce Framework[C]∥The ACM SIGMOD International Conference on Management of Data.Athens,Greece:ACM,2011:961-972.
[11]FLORATOU A,PATEL J M,SHEKITA E J,et al.Column-Oriented Storage Techniques for MapReduce[J].PVLDB,2011,4(7):419-429.
[12]THUSOO A,SARMA J S,JAIN N,et al.Raghotham Murthy:Hive-A Warehousing Solution Over a Map-Reduce Framework.
[C]∥VLDB Conference.Lyon,France,VLDB Endowment,2009:1626-1629.
[13]HE Y Q,LEE R B,HUAI Y,et al.RCFile:A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems[C]∥IEEE International Conference on Data Engineering.Hannover,Germany,2011:1199-1208.
[14]HSIAO H,CHEN M S,YU P S.Parallel execution of hash joins in parallel databases[J].IEEE Trans.on Parallel and Distributed Systems,1997,8(8):872-883.
[15]BONCZ P,MANEGOLD S,KERSTEN M L.Database architecture optimized for the new bottleneck:memory access[C]∥The 25th Int’l Conf.on Very Large Data Bases.ACM Press,1999:231-246.
[16]O’NEIL P,O’NEIL B,CHEN X D.Star Schema Benchmark Revision[EB/OL].
[2010-2-9].http://www.cs.umb.edu/~poneil.
[1] CHEN Jing, WU Ling-ling. Mixed Attribute Feature Detection Method of Internet of Vehicles Big Datain Multi-source Heterogeneous Environment [J]. Computer Science, 2022, 49(8): 108-112.
[2] HE Qiang, YIN Zhen-yu, HUANG Min, WANG Xing-wei, WANG Yuan-tian, CUI Shuo, ZHAO Yong. Survey of Influence Analysis of Evolutionary Network Based on Big Data [J]. Computer Science, 2022, 49(8): 1-11.
[3] LIU Wei-ming, AN Ran, MAO Yi-min. Parallel Support Vector Machine Algorithm Based on Clustering and WOA [J]. Computer Science, 2022, 49(7): 64-72.
[4] CHEN Xin, LI Fang, DING Hai-xin, SUN Wei-ze, LIU Xin, CHEN De-xun, YE Yue-jin, HE Xiang. Parallel Optimization Method of Unstructured-grid Computing in CFD for DomesticHeterogeneous Many-core Architecture [J]. Computer Science, 2022, 49(6): 99-107.
[5] WANG Mei-shan, YAO Lan, GAO Fu-xiang, XU Jun-can. Study on Differential Privacy Protection for Medical Set-Valued Data [J]. Computer Science, 2022, 49(4): 362-368.
[6] SUN Xuan, WANG Huan-xiao. Capability Building for Government Big Data Safety Protection:Discussions from Technologicaland Management Perspectives [J]. Computer Science, 2022, 49(4): 67-73.
[7] WANG Jun, WANG Xiu-lai, PANG Wei, ZHAO Hong-fei. Research on Big Data Governance for Science and Technology Forecast [J]. Computer Science, 2021, 48(9): 36-42.
[8] YU Yue-zhang, XIA Tian-yu, JING Yi-nan, HE Zhen-ying, WANG Xiao-yang. Smart Interactive Guide System for Big Data Analytics [J]. Computer Science, 2021, 48(9): 110-117.
[9] WANG Li-mei, ZHU Xu-guang, WANG De-jia, ZHANG Yong, XING Chun-xiao. Study on Judicial Data Classification Method Based on Natural Language Processing Technologies [J]. Computer Science, 2021, 48(8): 80-85.
[10] FU Tian-hao, TIAN Hong-yun, JIN Yu-yang, YANG Zhang, ZHAI Ji-dong, WU Lin-ping, XU Xiao-wen. Performance Skeleton Analysis Method Towards Component-based Parallel Applications [J]. Computer Science, 2021, 48(6): 1-9.
[11] HE Ya-ru, PANG Jian-min, XU Jin-long, ZHU Yu, TAO Xiao-han. Implementation and Optimization of Floyd Parallel Algorithm Based on Sunway Platform [J]. Computer Science, 2021, 48(6): 34-40.
[12] LI Fan, YAN Xing, ZHANG Xiao-yu. Optimization of GPU-based Eigenface Algorithm [J]. Computer Science, 2021, 48(4): 197-204.
[13] ZHANG Yuan-ming, YU Jia-rui, JIANG Jian-bo, LU Jia-wei, XIAO Gang. Intermediate Data Transmission Pipeline Optimization Mechanism for MapReduce Framework [J]. Computer Science, 2021, 48(2): 41-46.
[14] WANG Xue-cen, ZHANG Yu, LIU Ying-jie, YU Ge. Evaluation of Quality of Interaction in Online Learning Based on Representation Learning [J]. Computer Science, 2021, 48(2): 207-211.
[15] HU Rong, YANG Wang-dong, WANG Hao-tian, LUO Hui-zhang, LI Ken-li. Parallel WMD Algorithm Based on GPU Acceleration [J]. Computer Science, 2021, 48(12): 24-28.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!