计算机科学 ›› 2019, Vol. 46 ›› Issue (12): 1-7.doi: 10.11896/jsjkx.190100023

• 大数据与数据科学 •    下一篇

基于ISE算法的分布式ETL任务调度策略研究

王卓昊1, 杨冬菊2,3, 徐晨阳1   

  1. (中国科学技术信息研究所 北京100038)1;
    (大规模流数据集成与分析技术北京市重点实验室 北京100144)2;
    (北方工业大学数据工程研究院 北京100144)3
  • 收稿日期:2019-01-24 出版日期:2019-12-15 发布日期:2019-12-17
  • 通讯作者: 杨冬菊(1975-),女,博士,副研究员,主要研究方向为服务计算、数据集成与分析等,E-mail:yangdongju@ncut.edu.cn。
  • 作者简介:王卓昊(1977-),男,博士,副研究员,主要研究方向为计算机应用、云计算与大数据应用等,E-mail:wangzh@most.cn;徐晨阳(1992-),男,硕士,主要研究方向为数据集成、数据分析等。
  • 基金资助:
    本文受国家自然科学基金重点项目(61832004)资助。

Research on Distributed ETL Tasks Scheduling Strategy Based on ISE Algorithm

WANG Zhuo-hao1, YANG Dong-ju2,3, XU Chen-yang1   

  1. (Institute of Scientific and Technical Information of China,Beijing 100038,China)1;
    (Beijing Key Laboratory on Integration and Analysis of Large-scale Stream Data,Beijing 100144,China)2;
    (Data Engineering Institute,North China University of Technology,Beijing 100144,China)3
  • Received:2019-01-24 Online:2019-12-15 Published:2019-12-17

摘要: 随着数据仓库的规模不断扩大,数据集成下的ETL(Extraction-Transformation-Loading)任务也随之增多,单机调度显然已经不能满足当下繁多复杂的ETL任务调度。针对ETL任务调度如何提高效率、缩短关键任务等待时间、提升资源利用率等问题,构建了一套分布式ETL任务调度框架,该框架由调度器和若干执行器组成,通过任务预处理、任务调度分配、任务执行3个阶段来完成ETL任务调度。在任务预处理阶段,对ETL任务建立权重模型,并根据权重确定调度优先级。在任务调度分配阶段,调度器根据各个执行器节点的性能及负载情况来约束执行器节点的选择,并设计贪心平衡(Greedy Balance,GB)算法来进行ETL任务执行请求的分发,使执行器节点的负载相对均衡。在任务执行阶段,通过高响应比优先(Highest Response Ratio Next,HRRN)算法确定执行器节点队列下任务的执行优先级。实验结果表明,分布式ETL任务调度框架及相应的一体化调度执行( Integrated Scheduling Execution,ISE)算法能够有效提高集群资源的利用率,缩短任务调度的执行时间。

关键词: 任务调度, 负载均衡, 动态分配, 分布式集群, ETL, 数据集成

Abstract: With the expansion of the data warehouse,ETL tasks have also increased under data integration.Stand-alone scheduling obviously cannot meet the needs of many complex ETL tasks.Aiming at how to improve the efficiency of ETL task scheduling,reduce the critical task waiting time,and improve the resource utilization and so on,this paper constructed a distributed ETL task scheduling framework consisting of a scheduler and several actuators and completing the ETL task scheduling through the task preprocessing,task scheduling and task execution.In the task preprocessing stage,a weight model is established for the ETL task,and the scheduling priority is determined according to the weight.In the task scheduling stage,the scheduler constrains the choice of actuator nodes according to the performance and load conditions of each actuator node,and a greedy balance (GB) algorithm is designed to distribute the ETL task execution requests,so that the load of the actuator nodes is relatively balanced.In the task execution phase,the execution priority of tasks under the actuator node queue is determined by the high response ratio first (Highest Response Ratio Next,HRRN) algorithm.Experiment results show that the distributed ETL task scheduling framework and the corresponding integrated scheduling execution (ISE) algorithm can effectively improve the utilization of cluster resources and shorten the task scheduling execution time.

Key words: Task scheduling, Load balancing, Dynamic allocation, Distributed clustering, Extraction-Transformation-Loading, Data integration

中图分类号: 

  • TP301
[1] ZHANG L.Integration and collection of heterogeneous data based on metedata[C]//2013 6th International Conference on Information Management,Innovation Management and IndustrialEngineering.Xi’an,2013:205-208.
[2] SALEH H,NASHAAT H,SABER W,et al.IPSO Task Sche- duling Algorithm for Large Scale Data in Cloud Computing Environment[J].IEEE Access,2019,7(1):5412-5420.
[3] ISLAM T,HASHEM M M A.Task Scheduling for Big Data Management in Fog Infrastructure[C]//2018 21st International Conference of Computer and Information Technology (ICCIT).IEEE,2018:1-6.
[4] SAHAR M,VAHID R.A hybrid heuristic workflow scheduling algorithm for cloud computing environments[J].Journal of Experimental & Theoretical Artificial Intelligence,2015,27(6):1-15.
[5] YAO Y,GAO H,WANG J,et al.New Scheduling Algorithms for Improving Performance and Resource Utilization in Hadoop YARN Clusters[J].IEEE Transactions on Cloud Computing,2019,7(1):1-1.
[6] SUN J,CHO H,EASWARAN A,et al.Flow Network-Based Real-Time Scheduling for Reducing Static Energy Consumption on Multiprocessors[J].IEEE Access,2019,7(1):1330-1344.
[7] KOKILAVANI T,GEORGE D I,THINAM A.Load Balanced MinMin Algorithm for Static MetaTask Scheduling in Grid Computing[J].International Journal of Computer Applications,2011,20(2):43-49.
[8] MALLET F,ZHANG M.Work-in-Progress:From Logical Time Scheduling to Real-Time Scheduling[C]//2018 IEEE Real-Time Systems Symposium (RTSS).IEEE,2018:143-146.
[9] ZHANG L,LIU S F,HAN L.Task scheduling algorithm based on load balancing [J].Journal of Jilin University (Science Edition),2014(4):769-772.
[10] GE W C,YE B.Improved priority table scheduling algorithm based on load balancing priority[J].Journal of Shenyang University of Technology,2017,39(3):241-247.
[11] YU W,LIU F,XIONG Z,et al.A Task Scheduling Mechanism Based on Quartz of Power Consumption Information Acquisition System[C]//2018 5th International Conference on Information Science and Control Engineering (ICISCE).IEEE,2018:98-101.
[12] SUNDAR S,CHAMPATI J P,LIANG B.Completion Time Minimization in Multi-user Task Scheduling with Heterogeneous Processors and Budget Constraints[C]//2018 IEEE/ACM 26th International Symposium on Quality of Service (IWQoS).IEEE,2018:1-6.
[13] MAHMOUD R,UWE R.The Quicksort process[J].Stochastic Processes and Their Applications:An Official Journal of the Bernoulli Society for Mathematical Statistics and Probability,2014,124(2):1036-1054.
[14] XIA H.Load balancing greedy algorithm for reduce on Hadoop platform[C]//2018 IEEE 3rd International Conference on Big Data Analysis (ICBDA).IEEE,2018:212-216.
[15] WANG C,MAO Y,HU B,et al.Ship Block Transportation Scheduling Problem Based on Greedy Algorithm[J].Journal of Engineering Science & Technology Review,2016,9(2):93-98.
[16] LI J M,WANG X,WU Y X.An Improved Priority List Task Scheduling Algorithm[J].Computer Science,2014,4(5):20-23,36.
[1] 杨紫淇, 蔡英, 张皓晨, 范艳芳. 基于负载均衡的VEC服务器联合计算任务卸载方案[J]. 计算机科学, 2021, 48(1): 81-88.
[2] 郭飞雁, 唐兵. 基于用户延迟感知的移动边缘服务器放置方法[J]. 计算机科学, 2021, 48(1): 103-110.
[3] 王国澎, 杨剑新, 尹飞, 蒋生健. 负载均衡的处理器运算资源分配方法[J]. 计算机科学, 2020, 47(8): 41-48.
[4] 金琪, 王俊昌, 付雄. 基于智能放置策略的Cuckoo哈希表[J]. 计算机科学, 2020, 47(8): 80-86.
[5] 张龙信, 周立前, 文鸿, 肖满生, 邓晓军. 基于异构云计算的成本约束下的工作流能量高效调度算法[J]. 计算机科学, 2020, 47(8): 112-118.
[6] 高子妍, 王勇. 面向云服务的分布式消息系统负载均衡策略[J]. 计算机科学, 2020, 47(6A): 318-324.
[7] 黄梅根, 汪涛, 刘亮, 庞瑞琴, 杜欢. 基于软件定义网络资源优化的虚拟网络功能部署策略[J]. 计算机科学, 2020, 47(6A): 404-408.
[8] 孙敏, 陈中雄, 叶侨楠. 云环境下基于HEDSM的工作流调度策略[J]. 计算机科学, 2020, 47(6): 252-259.
[9] 周建新, 张志鹏, 周宁. 基于CKSP的分段路由负载均衡技术[J]. 计算机科学, 2020, 47(4): 256-261.
[10] 朱岸青, 李帅, 唐晓东. Spark平台中的并行化FP_growth关联规则挖掘方法[J]. 计算机科学, 2020, 47(12): 139-143.
[11] 胡俊钦, 张佳俊, 黄引豪, 陈星, 林兵. 边缘环境下DNN应用的计算迁移调度技术[J]. 计算机科学, 2020, 47(10): 247-255.
[12] 张洲, 黄国锐, 金培权. 基于Storm的任务调度:现状与研究展望[J]. 计算机科学, 2019, 46(9): 28-35.
[13] 张钊, 李海龙, 胡磊, 董思歧. 基于SDN-SFC的服务功能负载均衡[J]. 计算机科学, 2019, 46(9): 130-136.
[14] 曾金晶, 张建山, 林兵, 张文德. 基于无线城域网的微云负载均衡算法[J]. 计算机科学, 2019, 46(8): 163-170.
[15] 郑本立, 李跃辉. 基于改进蚁群算法的SDN网络负载均衡研究[J]. 计算机科学, 2019, 46(6A): 291-294.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 雷丽晖,王静. 可能性测度下的LTL模型检测并行化研究[J]. 计算机科学, 2018, 45(4): 71 -75 .
[2] 孙启,金燕,何琨,徐凌轩. 用于求解混合车辆路径问题的混合进化算法[J]. 计算机科学, 2018, 45(4): 76 -82 .
[3] 张佳男,肖鸣宇. 带权混合支配问题的近似算法研究[J]. 计算机科学, 2018, 45(4): 83 -88 .
[4] 伍建辉,黄中祥,李武,吴健辉,彭鑫,张生. 城市道路建设时序决策的鲁棒优化[J]. 计算机科学, 2018, 45(4): 89 -93 .
[5] 史雯隽,武继刚,罗裕春. 针对移动云计算任务迁移的快速高效调度算法[J]. 计算机科学, 2018, 45(4): 94 -99 .
[6] 周燕萍,业巧林. 基于L1-范数距离的最小二乘对支持向量机[J]. 计算机科学, 2018, 45(4): 100 -105 .
[7] 崔琼,李建华,王宏,南明莉. 基于节点修复的网络化指挥信息系统弹性分析模型[J]. 计算机科学, 2018, 45(4): 117 -121 .
[8] 杨羽琦,章国安,金喜龙. 车载自组织网络中基于车辆密度的双簇头路由协议[J]. 计算机科学, 2018, 45(4): 126 -130 .
[9] 施超,谢在鹏,柳晗,吕鑫. 基于稳定匹配的容器部署策略的优化[J]. 计算机科学, 2018, 45(4): 131 -136 .
[10] 韩奎奎,谢在鹏,吕鑫. 一种基于改进遗传算法的雾计算任务调度策略[J]. 计算机科学, 2018, 45(4): 137 -142 .