Computer Science ›› 2019, Vol. 46 ›› Issue (11A): 208-211.

• Data Science • Previous Articles     Next Articles

Implementation of ETL Scheme Based on Storm Platform

LIANG Kui-kui   

  1. (College of Computer Science and Technology,Zhejiang University of Technology,Hangzhou 310023,China)
  • Online:2019-11-10 Published:2019-11-20

Abstract: With the continuous development of the Internet in various fields,data begin to show the characteristics of structural diversity and volumetric quantification.In the face of the impact of massive data,how to improve the efficiency of ETL is crucial.In view of the problem of inconsistent data source and format and poor real-time data collection in “information island”,this paper proposed a vertical segmentation ETL workflow and horizontal segmentation pending data set,and established a flow-based ETL processing scheme based on Storm platform.At the same time,for the shortcomings of Storm,which is insensitive to the CPU load of the working node during task assignment,the CPU load information of the working node is recorded by the timing task to optimize the slot allocation mode of the Storm scheduler,sothat the load of the Storm cluster is more balanced.The experimental results show that the scheme can effectively improve the processing efficiency of ETL,and the system stability and processing efficiency for slot allocation optimization.

Key words: ETL, Horizontal segmentation, Load optimization, Storm, Vertical segmentation

CLC Number: 

  • TP399
[1]徐俊刚,裴莹.数据ETL研究综述[J].计算机科学,2011,38(4):15-20.
[2]ALI S M F,WREMBEL R.From conceptual design to perfor-mance optimization of ETL workflows:current state of research and open problems[J].Vldb Journal,2017,26(6):1-25.
[3]谢婷婷,李伟华.专用ETL模式设计与实现[J].计算机工程与应用,2010,46(35):133-135.
[4]CHEN G,AN B,LIU Y.A novel agent-based parallel ETL system for massive data[C]∥Control and Decision Conference.IEEE,2016:3942-3948.
[5]赵俊,夏小玲.公共数据中心的ETL系统设计与实现[J].计算机应用与软件,2011,28(10):167-169.
[6]宋杰,郝文宁,陈刚,等.基于mapreduce的分布式etl体系结构研究[J].计算机科学,2013,40(6):152-154.
[7]LIU X,THOMSEN C,PEDERSEN T B.ETLMR:A HighlyScalable Dimensional ETL Framework Based on MapReduce[C]∥International Conference on Data Warehousing & Knowledge Discovery.Springer,Berlin,Heidelberg,2011.
[8]汪保友,钱晶,袁时金.基于Hadoop的电信大数据采集方案研究与实现[J].电信科学,2017(1):135-142.
[9]孙莉,何刚,李继云.基于Hadoop平台的事实并行处理算法.计算机工程,2014,40(3):59-62,81.
[10]丁祥武,解书亮,李继云.基于Spark的并行ETL[J].计算机工程与设计,2017,38(9):2580-2585.
[11]曲朝阳,陈贺新,胡可为,等.基于Spark的电力调度数据整合模型[J].计算机工程与应用,2017,53(19):65-70.
[12]LEI C,RUNDENSTEINER E A,GUTTMAN J D.Robust distributed stream processing[J].Computer Science Faculty Publications,2012:817-828.
[13]李庆阳,彭宏.面向数据质量的ETL框架的设计与实现[J].计算机工程与设计,2010,31(9):2057-2060.
[14]季一木,张永潘,郎贤波,等.面向流数据的决策树分类算法并行化[J].计算机研究与发展,2017,54(9):1945-1957.
[15]YUAN S G,ZHU Y L,ZHOU G L,et al.Research on Dynamic Scheduling of Grid Monitoring Data Processing Tasks Based Storm[J].Applied Mechanics & Materials,2014:1051-1055.
[16]单莘,祝智岗,张龙,等.基于流处理技术的云计算平台监控方案的设计与实现[J].计算机应用与软件,2016,33(4):88-90.
[17]王春凯,孟小峰.分布式数据流关系查询技术研究[J].计算机学报,2016(1):80-96.
[18]蒋溢,罗宇豪,朱恒伟.Storm集群下一种基于Topology的任务调度策略[J].计算机工程与应用,2018,54(7):84-88.
[1] JIAN Cheng-feng, PING Jing, ZHANG Mei-yu. Edge Computing-oriented Storm Edge Node Scheduling Optimization Method [J]. Computer Science, 2020, 47(5): 277-283.
[2] ZHAO Xin, MA Zai-chao, LIU Ying-bo, DING Yu-ting, WEI Mu-heng. Incremental FFT Based on Apache Storm and Its Application [J]. Computer Science, 2020, 47(11A): 504-507.
[3] YANG Li-peng, ZHANG Yang-sen, ZHANG Wen, WANG Jian, ZENG Jian-rong. Web Log Analysis Method Based on Storm Real-time Streaming Computing Framework [J]. Computer Science, 2019, 46(9): 176-183.
[4] ZHANG Zhou, HUANG Guo-rui, JIN Pei-quan. Task Scheduling on Storm:Current Situations and Research Prospects [J]. Computer Science, 2019, 46(9): 28-35.
[5] LU Ye-shan. Common Issues and Case Analysis of System Data Migration [J]. Computer Science, 2019, 46(6A): 412-416.
[6] LIU Jing-fa, LI Fan, JIANG Sheng-yi. Focused Annealing Crawler Algorithm for Rainstorm Disasters Based on Comprehensive Priority and Host Information [J]. Computer Science, 2019, 46(2): 215-222.
[7] ZHOU Wen, SHI Xue-fei, WU Yi-jian, ZHAO Wen-yun. Framework Assisting Storm Application Development Driven by Data Requirements [J]. Computer Science, 2018, 45(9): 81-88.
[8] LIU Jie, WANG Gui-ling, ZUO Xiao-jiang. Incremental Data Extraction Model Based on Variable Time-window [J]. Computer Science, 2018, 45(11): 204-209.
[9] JIAO Na. Research on Vertical Segmentation Knowledge Reduction Algorithm Based on Tolerance Rough Set Theory [J]. Computer Science, 2016, 43(1): 49-52.
[10] WANG Jin-ming and WANG Yuan-fang. Parallel Mining of Densest Subgraph Based on Twitter Storm [J]. Computer Science, 2014, 41(1): 274-278.
[11] SONG Jie,HAO Wen-ning,CHEN Gang,JIN Da-wei and ZHAO Cheng. Research of Distributed ETL Dimensional Data Model Based on MapReduce [J]. Computer Science, 2013, 40(Z11): 263-266.
[12] SONG Jie,HAO Wen-ning,CHEN Gang,JIN Da-wei and ZHAO Shui-ning. Research of Distributed ETL Architecture Based on MapReduce [J]. Computer Science, 2013, 40(6): 152-154.
[13] . Construction and Practice of the DataWarehouse in the Police Comprehensive Information SystemPreplan Expression and Optimization Method Based on and Case-based Reasoning [J]. Computer Science, 2012, 39(Z6): 291-292.
[14] XU Jun-gang,PEI Ying. Overview of Data Extraction, Transformation and Loading [J]. Computer Science, 2011, 38(4): 15-20.
[15] MA Zhen-yuan,ZHOU Jie,CHEN Chu,ZHANG Ling. Segmental Measurement Based Approach to Estimate Internet End-to-end Performance Parameter [J]. Computer Science, 2010, 37(3): 138-140.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!