计算机科学 ›› 2018, Vol. 45 ›› Issue (6A): 471-475.
张滨1,乐嘉锦2
ZHANG Bin1,LE Jia-jin2
摘要: 大数据具有规模大、深度大、宽度大、处理时间短、硬件系统普通化、软件系统开源化的特点。传统关系型数据库在对大数据进行操作时存在系统性能严重下降、计算效率提升有限以及可扩展性差等问题,因此引入MapReduce并行计算模型,提出一种大数据上基于列存储的MapReduce分布式Hash连接算法。首先,设计面向大数据的分布式计算模型,在设计的分片聚集并行连接的基础上,利用Hash连接以及动态探测方法优化了数据并行连接处理效率;然后,针对该算法开发了基于Hadoop的原型系统。通过实验证明,在大数据分析处理中,所提算法在执行时间和负载能力上都有很好的性能表现,也能提供良好的可扩展性。
中图分类号:
[1]DEAN J,GHEMAWAT S.MapReduce:Simplified Data Pro- cessing on Large Clusters. [C]∥6th OSDI.San Francisco:USENIX Association,2004:137-150. [2]ABADI D J,MADDEN S R,HACHEM N.Column-Stores vs.Row-Stores:How Different Are They Really? [C]∥The 2008 ACM SIGMOD Int Conf.Vancouver,BC,Canada:ACM,2008:967-980. [3]STONEBRAKER M,ABADI D J,BATKIN A,et al.C-Store:A column-oriented DBMS. [C]∥VLDB Conference.Trondheim,Norway:VLDB Endowment,2005:553-564. [4]BONCZ P,ZUKOWSKI M,NES N.MonetDB/X100:Hyper- Pipelining Query Execution. [C]∥The Biennial Conf on Innovative Data Systems Research (CIDR).Asilomar,CA,USA:ACM,2005:225-237. [5]BLANAS S,PATEL J M,ERCEGOVAC V,et al.A comparison of join algorithms for log processing in MapReduce[C]∥The ACM SIGMOD International Conference on Management of Data.Indianapolis,Indiana,USA:ACM,2010:975-986. [6]ABOUZEID A,BAJDA-PAWLIKOWSKI K,ABADI D J,et al.HadoopDB:An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads[C]∥VLDB Conference.Lyon,France,VLDB Endowment,2009:922-933. [7]BAJDA-PAWLIKOWSKI K,ABADI D J,SILBERSCHATZ A,et al.Efficient Processing of Data Warehousing Queries[C]∥The ACM SIGMOD International Conference on Management of Data.Athens,Greece,ACM,2011:1165-1176. [8]CHANG F,DEAN J,GHEMAWAT S,et al.Robert Gruber: Bigtable:A Distributed Storage System for Structured Data[C]∥OSDI.2006:205-218. [9]MELNIK S,GUBAREV A,LONG J J,et al.Dremel:Interactive Analysis of Web-Scale Datasets[C]∥VLDB Conference.Singapore,VLDB Endowment,2010:330-339. [10]LIN Y T,AGRAWAL D,CHEN C,et al.Llama:Leveraging Columnar Storage for Scalable Join Processing in the MapReduce Framework[C]∥The ACM SIGMOD International Conference on Management of Data.Athens,Greece:ACM,2011:961-972. [11]FLORATOU A,PATEL J M,SHEKITA E J,et al.Column-Oriented Storage Techniques for MapReduce[J].PVLDB,2011,4(7):419-429. [12]THUSOO A,SARMA J S,JAIN N,et al.Raghotham Murthy:Hive-A Warehousing Solution Over a Map-Reduce Framework. [C]∥VLDB Conference.Lyon,France,VLDB Endowment,2009:1626-1629. [13]HE Y Q,LEE R B,HUAI Y,et al.RCFile:A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems[C]∥IEEE International Conference on Data Engineering.Hannover,Germany,2011:1199-1208. [14]HSIAO H,CHEN M S,YU P S.Parallel execution of hash joins in parallel databases[J].IEEE Trans.on Parallel and Distributed Systems,1997,8(8):872-883. [15]BONCZ P,MANEGOLD S,KERSTEN M L.Database architecture optimized for the new bottleneck:memory access[C]∥The 25th Int’l Conf.on Very Large Data Bases.ACM Press,1999:231-246. [16]O’NEIL P,O’NEIL B,CHEN X D.Star Schema Benchmark Revision[EB/OL]. [2010-2-9].http://www.cs.umb.edu/~poneil. |
[1] | 何强, 尹震宇, 黄敏, 王兴伟, 王源田, 崔硕, 赵勇. 基于大数据的进化网络影响力分析研究综述 Survey of Influence Analysis of Evolutionary Network Based on Big Data 计算机科学, 2022, 49(8): 1-11. https://doi.org/10.11896/jsjkx.210700240 |
[2] | 陈晶, 吴玲玲. 多源异构环境下的车联网大数据混合属性特征检测方法 Mixed Attribute Feature Detection Method of Internet of Vehicles Big Datain Multi-source Heterogeneous Environment 计算机科学, 2022, 49(8): 108-112. https://doi.org/10.11896/jsjkx.220300273 |
[3] | 刘卫明, 安冉, 毛伊敏. 基于聚类和WOA的并行支持向量机算法 Parallel Support Vector Machine Algorithm Based on Clustering and WOA 计算机科学, 2022, 49(7): 64-72. https://doi.org/10.11896/jsjkx.210500040 |
[4] | 陈鑫, 李芳, 丁海昕, 孙唯哲, 刘鑫, 陈德训, 叶跃进, 何香. 面向国产异构众核架构的CFD非结构网格计算并行优化方法 Parallel Optimization Method of Unstructured-grid Computing in CFD for DomesticHeterogeneous Many-core Architecture 计算机科学, 2022, 49(6): 99-107. https://doi.org/10.11896/jsjkx.210400157 |
[5] | 王美珊, 姚兰, 高福祥, 徐军灿. 面向医疗集值数据的差分隐私保护技术研究 Study on Differential Privacy Protection for Medical Set-Valued Data 计算机科学, 2022, 49(4): 362-368. https://doi.org/10.11896/jsjkx.210300032 |
[6] | 孙轩, 王焕骁. 政务大数据安全防护能力建设:基于技术和管理视角的探讨 Capability Building for Government Big Data Safety Protection:Discussions from Technologicaland Management Perspectives 计算机科学, 2022, 49(4): 67-73. https://doi.org/10.11896/jsjkx.211000010 |
[7] | 王俊, 王修来, 庞威, 赵鸿飞. 面向科技前瞻预测的大数据治理研究 Research on Big Data Governance for Science and Technology Forecast 计算机科学, 2021, 48(9): 36-42. https://doi.org/10.11896/jsjkx.210500207 |
[8] | 余乐章, 夏天宇, 荆一楠, 何震瀛, 王晓阳. 面向大数据分析的智能交互向导系统 Smart Interactive Guide System for Big Data Analytics 计算机科学, 2021, 48(9): 110-117. https://doi.org/10.11896/jsjkx.200900083 |
[9] | 王立梅, 朱旭光, 汪德嘉, 张勇, 邢春晓. 基于深度学习的民事案件判决结果分类方法研究 Study on Judicial Data Classification Method Based on Natural Language Processing Technologies 计算机科学, 2021, 48(8): 80-85. https://doi.org/10.11896/jsjkx.210300130 |
[10] | 傅天豪, 田鸿运, 金煜阳, 杨章, 翟季冬, 武林平, 徐小文. 一种面向构件化并行应用程序的性能骨架分析方法 Performance Skeleton Analysis Method Towards Component-based Parallel Applications 计算机科学, 2021, 48(6): 1-9. https://doi.org/10.11896/jsjkx.201200115 |
[11] | 何亚茹, 庞建民, 徐金龙, 朱雨, 陶小涵. 基于神威平台的Floyd并行算法的实现和优化 Implementation and Optimization of Floyd Parallel Algorithm Based on Sunway Platform 计算机科学, 2021, 48(6): 34-40. https://doi.org/10.11896/jsjkx.201100051 |
[12] | 冯凯, 马鑫玉. (n,k)-冒泡排序网络的子网络可靠性 Subnetwork Reliability of (n,k)-bubble-sort Networks 计算机科学, 2021, 48(4): 43-48. https://doi.org/10.11896/jsjkx.201100139 |
[13] | 张元鸣, 虞家睿, 蒋建波, 陆佳炜, 肖刚. 面向MapReduce的中间数据传输流水线优化机制 Intermediate Data Transmission Pipeline Optimization Mechanism for MapReduce Framework 计算机科学, 2021, 48(2): 41-46. https://doi.org/10.11896/jsjkx.191000103 |
[14] | 王雪岑, 张昱, 刘迎婕, 于戈. 基于表示学习的在线学习交互质量评价方法 Evaluation of Quality of Interaction in Online Learning Based on Representation Learning 计算机科学, 2021, 48(2): 207-211. https://doi.org/10.11896/jsjkx.201000042 |
[15] | 胡蓉, 阳王东, 王昊天, 罗辉章, 李肯立. 基于GPU加速的并行WMD算法 Parallel WMD Algorithm Based on GPU Acceleration 计算机科学, 2021, 48(12): 24-28. https://doi.org/10.11896/jsjkx.210600213 |
|