计算机科学 ›› 2019, Vol. 46 ›› Issue (11A): 208-211.

• 数据科学 • 上一篇    下一篇

一种基于Storm平台的ETL方案实现

梁奎奎   

  1. (浙江工业大学计算机科学与技术学院 杭州310023)
  • 出版日期:2019-11-10 发布日期:2019-11-20
  • 作者简介:梁奎奎(1993-),男,硕士生,主要研究方向为大数据应用,E-mail:1353701931@qq.com。

Implementation of ETL Scheme Based on Storm Platform

LIANG Kui-kui   

  1. (College of Computer Science and Technology,Zhejiang University of Technology,Hangzhou 310023,China)
  • Online:2019-11-10 Published:2019-11-20

摘要: 随着互联网在各个领域的不断发展,数据开始呈现结构多样化与体积海量化。面对海量数据的冲击,如何提高ETL的效率至关重要。针对“信息孤岛”中数据来源及格式皆不统一、数据采集实时性差的问题,提出垂直切分ETL工作流和水平切分待处理数据集,建立一种基于Storm平台的流式ETL处理方案。同时,针对Storm在进行任务分配时对工作节点CPU负载不敏感的缺点,通过定时任务记录工作节点的CPU负载信息,对Storm调度器的slot分配方式进行优化,使得Storm集群的负载更加均衡。实验结果证明该方案可有效提高ETL的处理效率,同时针对slot分配优化可有效地提高系统稳定性与处理效率。

关键词: ETL, Storm, 垂直切分, 负载优化, 水平切分

Abstract: With the continuous development of the Internet in various fields,data begin to show the characteristics of structural diversity and volumetric quantification.In the face of the impact of massive data,how to improve the efficiency of ETL is crucial.In view of the problem of inconsistent data source and format and poor real-time data collection in “information island”,this paper proposed a vertical segmentation ETL workflow and horizontal segmentation pending data set,and established a flow-based ETL processing scheme based on Storm platform.At the same time,for the shortcomings of Storm,which is insensitive to the CPU load of the working node during task assignment,the CPU load information of the working node is recorded by the timing task to optimize the slot allocation mode of the Storm scheduler,sothat the load of the Storm cluster is more balanced.The experimental results show that the scheme can effectively improve the processing efficiency of ETL,and the system stability and processing efficiency for slot allocation optimization.

Key words: ETL, Horizontal segmentation, Load optimization, Storm, Vertical segmentation

中图分类号: 

  • TP399
[1]徐俊刚,裴莹.数据ETL研究综述[J].计算机科学,2011,38(4):15-20.
[2]ALI S M F,WREMBEL R.From conceptual design to perfor-mance optimization of ETL workflows:current state of research and open problems[J].Vldb Journal,2017,26(6):1-25.
[3]谢婷婷,李伟华.专用ETL模式设计与实现[J].计算机工程与应用,2010,46(35):133-135.
[4]CHEN G,AN B,LIU Y.A novel agent-based parallel ETL system for massive data[C]∥Control and Decision Conference.IEEE,2016:3942-3948.
[5]赵俊,夏小玲.公共数据中心的ETL系统设计与实现[J].计算机应用与软件,2011,28(10):167-169.
[6]宋杰,郝文宁,陈刚,等.基于mapreduce的分布式etl体系结构研究[J].计算机科学,2013,40(6):152-154.
[7]LIU X,THOMSEN C,PEDERSEN T B.ETLMR:A HighlyScalable Dimensional ETL Framework Based on MapReduce[C]∥International Conference on Data Warehousing & Knowledge Discovery.Springer,Berlin,Heidelberg,2011.
[8]汪保友,钱晶,袁时金.基于Hadoop的电信大数据采集方案研究与实现[J].电信科学,2017(1):135-142.
[9]孙莉,何刚,李继云.基于Hadoop平台的事实并行处理算法.计算机工程,2014,40(3):59-62,81.
[10]丁祥武,解书亮,李继云.基于Spark的并行ETL[J].计算机工程与设计,2017,38(9):2580-2585.
[11]曲朝阳,陈贺新,胡可为,等.基于Spark的电力调度数据整合模型[J].计算机工程与应用,2017,53(19):65-70.
[12]LEI C,RUNDENSTEINER E A,GUTTMAN J D.Robust distributed stream processing[J].Computer Science Faculty Publications,2012:817-828.
[13]李庆阳,彭宏.面向数据质量的ETL框架的设计与实现[J].计算机工程与设计,2010,31(9):2057-2060.
[14]季一木,张永潘,郎贤波,等.面向流数据的决策树分类算法并行化[J].计算机研究与发展,2017,54(9):1945-1957.
[15]YUAN S G,ZHU Y L,ZHOU G L,et al.Research on Dynamic Scheduling of Grid Monitoring Data Processing Tasks Based Storm[J].Applied Mechanics & Materials,2014:1051-1055.
[16]单莘,祝智岗,张龙,等.基于流处理技术的云计算平台监控方案的设计与实现[J].计算机应用与软件,2016,33(4):88-90.
[17]王春凯,孟小峰.分布式数据流关系查询技术研究[J].计算机学报,2016(1):80-96.
[18]蒋溢,罗宇豪,朱恒伟.Storm集群下一种基于Topology的任务调度策略[J].计算机工程与应用,2018,54(7):84-88.
[1] 简琤峰, 平靖, 张美玉.
面向边缘计算的Storm边缘节点调度优化方法
Edge Computing-oriented Storm Edge Node Scheduling Optimization Method
计算机科学, 2020, 47(5): 277-283. https://doi.org/10.11896/jsjkx.190600048
[2] 赵鑫, 马再超, 刘英博, 丁雨亭, 魏慕恒.
基于Apache Storm的增量式FFT及其应用
Incremental FFT Based on Apache Storm and Its Application
计算机科学, 2020, 47(11A): 504-507. https://doi.org/10.11896/jsjkx.191000086
[3] 张洲, 黄国锐, 金培权.
基于Storm的任务调度:现状与研究展望
Task Scheduling on Storm:Current Situations and Research Prospects
计算机科学, 2019, 46(9): 28-35. https://doi.org/10.11896/j.issn.1002-137X.2019.09.004
[4] 杨立鹏, 张仰森, 张雯, 王建, 曾健荣.
基于Storm实时流式计算框架的网络日志分析方法
Web Log Analysis Method Based on Storm Real-time Streaming Computing Framework
计算机科学, 2019, 46(9): 176-183. https://doi.org/10.11896/j.issn.1002-137X.2019.09.025
[5] 陆叶杉.
系统数据迁移常见问题及案例分析
Common Issues and Case Analysis of System Data Migration
计算机科学, 2019, 46(6A): 412-416.
[6] 王卓昊, 杨冬菊, 徐晨阳.
基于ISE算法的分布式ETL任务调度策略研究
Research on Distributed ETL Tasks Scheduling Strategy Based on ISE Algorithm
计算机科学, 2019, 46(12): 1-7. https://doi.org/10.11896/jsjkx.190100023
[7] 周雯, 史雪菲, 吴毅坚, 赵文耘.
数据需求驱动的Storm应用辅助开发框架
Framework Assisting Storm Application Development Driven by Data Requirements
计算机科学, 2018, 45(9): 81-88. https://doi.org/10.11896/j.issn.1002-137X.2018.09.012
[8] 刘杰, 王桂玲, 左小将.
基于可变时间窗口的增量数据抽取模型
Incremental Data Extraction Model Based on Variable Time-window
计算机科学, 2018, 45(11): 204-209. https://doi.org/10.11896/j.issn.1002-137X.2018.11.032
[9] 王金明,王远方.
基于Twitter Storm平台并行挖掘最稠密子图
Parallel Mining of Densest Subgraph Based on Twitter Storm
计算机科学, 2014, 41(1): 274-278.
[10] 宋杰,郝文宁,陈刚,靳大尉,赵成.
基于MapReduce的分布式ETL多维数据模型研究
Research of Distributed ETL Dimensional Data Model Based on MapReduce
计算机科学, 2013, 40(Z11): 263-266.
[11] 宋杰,郝文宁,陈刚,靳大尉,赵水宁.
基于MapReduce的分布式ETL体系结构研究
Research of Distributed ETL Architecture Based on MapReduce
计算机科学, 2013, 40(6): 152-154.
[12] 袁丽娜.
警务综合信息系统数据仓库的建设与实践
Construction and Practice of the DataWarehouse in the Police Comprehensive Information SystemPreplan Expression and Optimization Method Based on and Case-based Reasoning
计算机科学, 2012, 39(Z6): 291-292.
[13] 徐俊刚,裴莹.
数据ETL研究综述
Overview of Data Extraction, Transformation and Loading
计算机科学, 2011, 38(4): 15-20.
[14] 马震远,周杰,陈楚,张凌.
一种基于分段测量的Internet端到端性能参数估计方法
Segmental Measurement Based Approach to Estimate Internet End-to-end Performance Parameter
计算机科学, 2010, 37(3): 138-140.
[15] .
基于NetLogo平台的HIV治疗模型

计算机科学, 2008, 35(4): 283-284.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!