基于列存储的MapReduce分布式Hash连接算法

摘要/Abstract

摘要： 大数据具有规模大、深度大、宽度大、处理时间短、硬件系统普通化、软件系统开源化的特点。传统关系型数据库在对大数据进行操作时存在系统性能严重下降、计算效率提升有限以及可扩展性差等问题,因此引入MapReduce并行计算模型,提出一种大数据上基于列存储的MapReduce分布式Hash连接算法。首先,设计面向大数据的分布式计算模型,在设计的分片聚集并行连接的基础上,利用Hash连接以及动态探测方法优化了数据并行连接处理效率;然后,针对该算法开发了基于Hadoop的原型系统。通过实验证明,在大数据分析处理中,所提算法在执行时间和负载能力上都有很好的性能表现,也能提供良好的可扩展性。

关键词: Hash连接, MapReduce, 并行计算, 大数据, 列存储

Abstract: The characters of big data are volume,variety,value,velocity,and common hardware and open source.Aiming at the system inefficiency and limited scalability of traditional relational database in big data analysis,this paper presented an algorithm of Hash joins in MapReduce distributed environment based on column-store by introducing MapReduce computing model.First of all,this paper proposed the design of large data-oriented distributed computing models.Then,it proposed the partition aggregation and the heuristic optimization strategy to realize the implementation of Hash join algorithm.Lastly,the experiments evaluated execution time and load capacity.The results show that the proposed method is effective and can provid good scalability in big data analysis.

Key words: Big data, Column-store, Hash join, MapReduce, Parallel computing

中图分类号:

TP311

张滨, 乐嘉锦. 基于列存储的MapReduce分布式Hash连接算法[J]. 计算机科学, 2018, 45(6A): 471-475. https://doi.org/

ZHANG Bin, LE Jia-jin. Hash Join in MapReduce Distributed Environment Based on Column-store[J]. Computer Science, 2018, 45(6A): 471-475. https://doi.org/

参考文献

[1]DEAN J,GHEMAWAT S.MapReduce:Simplified Data Pro- cessing on Large Clusters.
[C]∥6th OSDI.San Francisco:USENIX Association,2004:137-150.
[2]ABADI D J,MADDEN S R,HACHEM N.Column-Stores vs.Row-Stores:How Different Are They Really? [C]∥The 2008 ACM SIGMOD Int Conf.Vancouver,BC,Canada:ACM,2008:967-980.
[3]STONEBRAKER M,ABADI D J,BATKIN A,et al.C-Store:A column-oriented DBMS.
[C]∥VLDB Conference.Trondheim,Norway:VLDB Endowment,2005:553-564.
[4]BONCZ P,ZUKOWSKI M,NES N.MonetDB/X100:Hyper- Pipelining Query Execution.
[C]∥The Biennial Conf on Innovative Data Systems Research (CIDR).Asilomar,CA,USA:ACM,2005:225-237.
[5]BLANAS S,PATEL J M,ERCEGOVAC V,et al.A comparison of join algorithms for log processing in MapReduce[C]∥The ACM SIGMOD International Conference on Management of Data.Indianapolis,Indiana,USA:ACM,2010:975-986.
[6]ABOUZEID A,BAJDA-PAWLIKOWSKI K,ABADI D J,et al.HadoopDB:An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads[C]∥VLDB Conference.Lyon,France,VLDB Endowment,2009:922-933.
[7]BAJDA-PAWLIKOWSKI K,ABADI D J,SILBERSCHATZ A,et al.Efficient Processing of Data Warehousing Queries[C]∥The ACM SIGMOD International Conference on Management of Data.Athens,Greece,ACM,2011:1165-1176.
[8]CHANG F,DEAN J,GHEMAWAT S,et al.Robert Gruber: Bigtable:A Distributed Storage System for Structured Data[C]∥OSDI.2006:205-218.
[9]MELNIK S,GUBAREV A,LONG J J,et al.Dremel:Interactive Analysis of Web-Scale Datasets[C]∥VLDB Conference.Singapore,VLDB Endowment,2010:330-339.
[10]LIN Y T,AGRAWAL D,CHEN C,et al.Llama:Leveraging Columnar Storage for Scalable Join Processing in the MapReduce Framework[C]∥The ACM SIGMOD International Conference on Management of Data.Athens,Greece:ACM,2011:961-972.
[11]FLORATOU A,PATEL J M,SHEKITA E J,et al.Column-Oriented Storage Techniques for MapReduce[J].PVLDB,2011,4(7):419-429.
[12]THUSOO A,SARMA J S,JAIN N,et al.Raghotham Murthy:Hive-A Warehousing Solution Over a Map-Reduce Framework.
[C]∥VLDB Conference.Lyon,France,VLDB Endowment,2009:1626-1629.
[13]HE Y Q,LEE R B,HUAI Y,et al.RCFile:A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems[C]∥IEEE International Conference on Data Engineering.Hannover,Germany,2011:1199-1208.
[14]HSIAO H,CHEN M S,YU P S.Parallel execution of hash joins in parallel databases[J].IEEE Trans.on Parallel and Distributed Systems,1997,8(8):872-883.
[15]BONCZ P,MANEGOLD S,KERSTEN M L.Database architecture optimized for the new bottleneck:memory access[C]∥The 25^th Int’l Conf.on Very Large Data Bases.ACM Press,1999:231-246.
[16]O’NEIL P,O’NEIL B,CHEN X D.Star Schema Benchmark Revision[EB/OL].
[2010-2-9].http://www.cs.umb.edu/~poneil.

相关文章 15

[1]	何强, 尹震宇, 黄敏, 王兴伟, 王源田, 崔硕, 赵勇. 基于大数据的进化网络影响力分析研究综述 Survey of Influence Analysis of Evolutionary Network Based on Big Data 计算机科学, 2022, 49(8): 1-11. https://doi.org/10.11896/jsjkx.210700240
[2]	陈晶, 吴玲玲. 多源异构环境下的车联网大数据混合属性特征检测方法 Mixed Attribute Feature Detection Method of Internet of Vehicles Big Datain Multi-source Heterogeneous Environment 计算机科学, 2022, 49(8): 108-112. https://doi.org/10.11896/jsjkx.220300273
[3]	刘卫明, 安冉, 毛伊敏. 基于聚类和WOA的并行支持向量机算法 Parallel Support Vector Machine Algorithm Based on Clustering and WOA 计算机科学, 2022, 49(7): 64-72. https://doi.org/10.11896/jsjkx.210500040
[4]	陈鑫, 李芳, 丁海昕, 孙唯哲, 刘鑫, 陈德训, 叶跃进, 何香. 面向国产异构众核架构的CFD非结构网格计算并行优化方法 Parallel Optimization Method of Unstructured-grid Computing in CFD for DomesticHeterogeneous Many-core Architecture 计算机科学, 2022, 49(6): 99-107. https://doi.org/10.11896/jsjkx.210400157
[5]	王美珊, 姚兰, 高福祥, 徐军灿. 面向医疗集值数据的差分隐私保护技术研究 Study on Differential Privacy Protection for Medical Set-Valued Data 计算机科学, 2022, 49(4): 362-368. https://doi.org/10.11896/jsjkx.210300032
[6]	孙轩, 王焕骁. 政务大数据安全防护能力建设:基于技术和管理视角的探讨 Capability Building for Government Big Data Safety Protection:Discussions from Technologicaland Management Perspectives 计算机科学, 2022, 49(4): 67-73. https://doi.org/10.11896/jsjkx.211000010
[7]	王俊, 王修来, 庞威, 赵鸿飞. 面向科技前瞻预测的大数据治理研究 Research on Big Data Governance for Science and Technology Forecast 计算机科学, 2021, 48(9): 36-42. https://doi.org/10.11896/jsjkx.210500207
[8]	余乐章, 夏天宇, 荆一楠, 何震瀛, 王晓阳. 面向大数据分析的智能交互向导系统 Smart Interactive Guide System for Big Data Analytics 计算机科学, 2021, 48(9): 110-117. https://doi.org/10.11896/jsjkx.200900083
[9]	王立梅, 朱旭光, 汪德嘉, 张勇, 邢春晓. 基于深度学习的民事案件判决结果分类方法研究 Study on Judicial Data Classification Method Based on Natural Language Processing Technologies 计算机科学, 2021, 48(8): 80-85. https://doi.org/10.11896/jsjkx.210300130
[10]	傅天豪, 田鸿运, 金煜阳, 杨章, 翟季冬, 武林平, 徐小文. 一种面向构件化并行应用程序的性能骨架分析方法 Performance Skeleton Analysis Method Towards Component-based Parallel Applications 计算机科学, 2021, 48(6): 1-9. https://doi.org/10.11896/jsjkx.201200115
[11]	何亚茹, 庞建民, 徐金龙, 朱雨, 陶小涵. 基于神威平台的Floyd并行算法的实现和优化 Implementation and Optimization of Floyd Parallel Algorithm Based on Sunway Platform 计算机科学, 2021, 48(6): 34-40. https://doi.org/10.11896/jsjkx.201100051
[12]	冯凯, 马鑫玉. (n,k)-冒泡排序网络的子网络可靠性 Subnetwork Reliability of (n,k)-bubble-sort Networks 计算机科学, 2021, 48(4): 43-48. https://doi.org/10.11896/jsjkx.201100139
[13]	张元鸣, 虞家睿, 蒋建波, 陆佳炜, 肖刚. 面向MapReduce的中间数据传输流水线优化机制 Intermediate Data Transmission Pipeline Optimization Mechanism for MapReduce Framework 计算机科学, 2021, 48(2): 41-46. https://doi.org/10.11896/jsjkx.191000103
[14]	王雪岑, 张昱, 刘迎婕, 于戈. 基于表示学习的在线学习交互质量评价方法 Evaluation of Quality of Interaction in Online Learning Based on Representation Learning 计算机科学, 2021, 48(2): 207-211. https://doi.org/10.11896/jsjkx.201000042
[15]	胡蓉, 阳王东, 王昊天, 罗辉章, 李肯立. 基于GPU加速的并行WMD算法 Parallel WMD Algorithm Based on GPU Acceleration 计算机科学, 2021, 48(12): 24-28. https://doi.org/10.11896/jsjkx.210600213

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed