计算机科学 ›› 2019, Vol. 46 ›› Issue (12): 1-7.doi: 10.11896/jsjkx.190100023

• 大数据与数据科学 •    下一篇

基于ISE算法的分布式ETL任务调度策略研究

王卓昊1, 杨冬菊2,3, 徐晨阳1   

  1. (中国科学技术信息研究所 北京100038)1;
    (大规模流数据集成与分析技术北京市重点实验室 北京100144)2;
    (北方工业大学数据工程研究院 北京100144)3
  • 收稿日期:2019-01-24 出版日期:2019-12-15 发布日期:2019-12-17
  • 通讯作者: 杨冬菊(1975-),女,博士,副研究员,主要研究方向为服务计算、数据集成与分析等,E-mail:yangdongju@ncut.edu.cn。
  • 作者简介:王卓昊(1977-),男,博士,副研究员,主要研究方向为计算机应用、云计算与大数据应用等,E-mail:wangzh@most.cn;徐晨阳(1992-),男,硕士,主要研究方向为数据集成、数据分析等。
  • 基金资助:
    本文受国家自然科学基金重点项目(61832004)资助。

Research on Distributed ETL Tasks Scheduling Strategy Based on ISE Algorithm

WANG Zhuo-hao1, YANG Dong-ju2,3, XU Chen-yang1   

  1. (Institute of Scientific and Technical Information of China,Beijing 100038,China)1;
    (Beijing Key Laboratory on Integration and Analysis of Large-scale Stream Data,Beijing 100144,China)2;
    (Data Engineering Institute,North China University of Technology,Beijing 100144,China)3
  • Received:2019-01-24 Online:2019-12-15 Published:2019-12-17

摘要: 随着数据仓库的规模不断扩大,数据集成下的ETL(Extraction-Transformation-Loading)任务也随之增多,单机调度显然已经不能满足当下繁多复杂的ETL任务调度。针对ETL任务调度如何提高效率、缩短关键任务等待时间、提升资源利用率等问题,构建了一套分布式ETL任务调度框架,该框架由调度器和若干执行器组成,通过任务预处理、任务调度分配、任务执行3个阶段来完成ETL任务调度。在任务预处理阶段,对ETL任务建立权重模型,并根据权重确定调度优先级。在任务调度分配阶段,调度器根据各个执行器节点的性能及负载情况来约束执行器节点的选择,并设计贪心平衡(Greedy Balance,GB)算法来进行ETL任务执行请求的分发,使执行器节点的负载相对均衡。在任务执行阶段,通过高响应比优先(Highest Response Ratio Next,HRRN)算法确定执行器节点队列下任务的执行优先级。实验结果表明,分布式ETL任务调度框架及相应的一体化调度执行( Integrated Scheduling Execution,ISE)算法能够有效提高集群资源的利用率,缩短任务调度的执行时间。

关键词: ETL, 动态分配, 分布式集群, 负载均衡, 任务调度, 数据集成

Abstract: With the expansion of the data warehouse,ETL tasks have also increased under data integration.Stand-alone scheduling obviously cannot meet the needs of many complex ETL tasks.Aiming at how to improve the efficiency of ETL task scheduling,reduce the critical task waiting time,and improve the resource utilization and so on,this paper constructed a distributed ETL task scheduling framework consisting of a scheduler and several actuators and completing the ETL task scheduling through the task preprocessing,task scheduling and task execution.In the task preprocessing stage,a weight model is established for the ETL task,and the scheduling priority is determined according to the weight.In the task scheduling stage,the scheduler constrains the choice of actuator nodes according to the performance and load conditions of each actuator node,and a greedy balance (GB) algorithm is designed to distribute the ETL task execution requests,so that the load of the actuator nodes is relatively balanced.In the task execution phase,the execution priority of tasks under the actuator node queue is determined by the high response ratio first (Highest Response Ratio Next,HRRN) algorithm.Experiment results show that the distributed ETL task scheduling framework and the corresponding integrated scheduling execution (ISE) algorithm can effectively improve the utilization of cluster resources and shorten the task scheduling execution time.

Key words: Data integration, Distributed clustering, Dynamic allocation, Extraction-Transformation-Loading, Load balancing, Task scheduling

中图分类号: 

  • TP301
[1]ZHANG L.Integration and collection of heterogeneous data based on metedata[C]//2013 6th International Conference on Information Management,Innovation Management and IndustrialEngineering.Xi’an,2013:205-208.
[2]SALEH H,NASHAAT H,SABER W,et al.IPSO Task Sche- duling Algorithm for Large Scale Data in Cloud Computing Environment[J].IEEE Access,2019,7(1):5412-5420.
[3]ISLAM T,HASHEM M M A.Task Scheduling for Big Data Management in Fog Infrastructure[C]//2018 21st International Conference of Computer and Information Technology (ICCIT).IEEE,2018:1-6.
[4]SAHAR M,VAHID R.A hybrid heuristic workflow scheduling algorithm for cloud computing environments[J].Journal of Experimental & Theoretical Artificial Intelligence,2015,27(6):1-15.
[5]YAO Y,GAO H,WANG J,et al.New Scheduling Algorithms for Improving Performance and Resource Utilization in Hadoop YARN Clusters[J].IEEE Transactions on Cloud Computing,2019,7(1):1-1.
[6]SUN J,CHO H,EASWARAN A,et al.Flow Network-Based Real-Time Scheduling for Reducing Static Energy Consumption on Multiprocessors[J].IEEE Access,2019,7(1):1330-1344.
[7]KOKILAVANI T,GEORGE D I,THINAM A.Load Balanced MinMin Algorithm for Static MetaTask Scheduling in Grid Computing[J].International Journal of Computer Applications,2011,20(2):43-49.
[8]MALLET F,ZHANG M.Work-in-Progress:From Logical Time Scheduling to Real-Time Scheduling[C]//2018 IEEE Real-Time Systems Symposium (RTSS).IEEE,2018:143-146.
[9]ZHANG L,LIU S F,HAN L.Task scheduling algorithm based on load balancing [J].Journal of Jilin University (Science Edition),2014(4):769-772.
[10]GE W C,YE B.Improved priority table scheduling algorithm based on load balancing priority[J].Journal of Shenyang University of Technology,2017,39(3):241-247.
[11]YU W,LIU F,XIONG Z,et al.A Task Scheduling Mechanism Based on Quartz of Power Consumption Information Acquisition System[C]//2018 5th International Conference on Information Science and Control Engineering (ICISCE).IEEE,2018:98-101.
[12]SUNDAR S,CHAMPATI J P,LIANG B.Completion Time Minimization in Multi-user Task Scheduling with Heterogeneous Processors and Budget Constraints[C]//2018 IEEE/ACM 26th International Symposium on Quality of Service (IWQoS).IEEE,2018:1-6.
[13]MAHMOUD R,UWE R.The Quicksort process[J].Stochastic Processes and Their Applications:An Official Journal of the Bernoulli Society for Mathematical Statistics and Probability,2014,124(2):1036-1054.
[14]XIA H.Load balancing greedy algorithm for reduce on Hadoop platform[C]//2018 IEEE 3rd International Conference on Big Data Analysis (ICBDA).IEEE,2018:212-216.
[15]WANG C,MAO Y,HU B,et al.Ship Block Transportation Scheduling Problem Based on Greedy Algorithm[J].Journal of Engineering Science & Technology Review,2016,9(2):93-98.
[16]LI J M,WANG X,WU Y X.An Improved Priority List Task Scheduling Algorithm[J].Computer Science,2014,4(5):20-23,36.
[1] 田真真, 蒋维, 郑炳旭, 孟利民.
基于服务器集群的负载均衡优化调度算法
Load Balancing Optimization Scheduling Algorithm Based on Server Cluster
计算机科学, 2022, 49(6A): 639-644. https://doi.org/10.11896/jsjkx.210800071
[2] 高捷, 刘沙, 黄则强, 郑天宇, 刘鑫, 漆锋滨.
基于国产众核处理器的深度神经网络算子加速库优化
Deep Neural Network Operator Acceleration Library Optimization Based on Domestic Many-core Processor
计算机科学, 2022, 49(5): 355-362. https://doi.org/10.11896/jsjkx.210500226
[3] 田冰川, 田臣, 周宇航, 陈贵海, 窦万春.
减少Hadoop集群中网络队头阻塞的调度算法
Reducing Head-of-Line Blocking on Network in Hadoop Clusters
计算机科学, 2022, 49(3): 11-22. https://doi.org/10.11896/jsjkx.210900117
[4] 谭双杰, 林宝军, 刘迎春, 赵帅.
基于机器学习的分布式星载RTs系统负载调度算法
Load Scheduling Algorithm for Distributed On-board RTs System Based on Machine Learning
计算机科学, 2022, 49(2): 336-341. https://doi.org/10.11896/jsjkx.201200126
[5] 沈彪, 沈立炜, 李弋.
空间众包任务的路径动态调度方法
Dynamic Task Scheduling Method for Space Crowdsourcing
计算机科学, 2022, 49(2): 231-240. https://doi.org/10.11896/jsjkx.210400249
[6] 夏中, 向敏, 黄春梅.
基于CHBL的P2P视频监控网络分层管理机制
Hierarchical Management Mechanism of P2P Video Surveillance Network Based on CHBL
计算机科学, 2021, 48(9): 278-285. https://doi.org/10.11896/jsjkx.201200056
[7] 宋海宁, 焦健, 刘永.
高速公路中的移动边缘计算研究
Research on Mobile Edge Computing in Expressway
计算机科学, 2021, 48(6A): 383-386. https://doi.org/10.11896/jsjkx.200900212
[8] 王政, 姜春茂.
一种基于三支决策的云任务调度优化算法
Cloud Task Scheduling Algorithm Based on Three-way Decisions
计算机科学, 2021, 48(6A): 420-426. https://doi.org/10.11896/jsjkx.201000023
[9] 郑增乾, 王锟, 赵涛, 蒋维, 孟利民.
带宽和时延受限的流媒体服务器集群负载均衡机制
Load Balancing Mechanism for Bandwidth and Time-delay Constrained Streaming Media Server Cluster
计算机科学, 2021, 48(6): 261-267. https://doi.org/10.11896/jsjkx.200400131
[10] 姚泽玮, 林嘉雯, 胡俊钦, 陈星.
基于PSO-GA的多边缘负载均衡方法
PSO-GA Based Approach to Multi-edge Load Balancing
计算机科学, 2021, 48(11A): 456-463. https://doi.org/10.11896/jsjkx.210100191
[11] 蔡凌峰, 魏祥麟, 邢长友, 邹霞, 张国敏.
故障场景下的边缘计算DAG任务重调度方法
Failure-resilient DAG Task Rescheduling in Edge Computing
计算机科学, 2021, 48(10): 334-342. https://doi.org/10.11896/jsjkx.210300304
[12] 杨紫淇, 蔡英, 张皓晨, 范艳芳.
基于负载均衡的VEC服务器联合计算任务卸载方案
Computational Task Offloading Scheme Based on Load Balance for Cooperative VEC Servers
计算机科学, 2021, 48(1): 81-88. https://doi.org/10.11896/jsjkx.200800220
[13] 郭飞雁, 唐兵.
基于用户延迟感知的移动边缘服务器放置方法
Mobile Edge Server Placement Method Based on User Latency-aware
计算机科学, 2021, 48(1): 103-110. https://doi.org/10.11896/jsjkx.200900146
[14] 王国澎, 杨剑新, 尹飞, 蒋生健.
负载均衡的处理器运算资源分配方法
Computing Resources Allocation with Load Balance in Modern Processor
计算机科学, 2020, 47(8): 41-48. https://doi.org/10.11896/jsjkx.191000148
[15] 金琪, 王俊昌, 付雄.
基于智能放置策略的Cuckoo哈希表
Cuckoo Hash Table Based on Smart Placement Strategy
计算机科学, 2020, 47(8): 80-86. https://doi.org/10.11896/jsjkx.191200109
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!