Computer Science ›› 2019, Vol. 46 ›› Issue (12): 1-7.doi: 10.11896/jsjkx.190100023

• Big Data & Data Science •     Next Articles

Research on Distributed ETL Tasks Scheduling Strategy Based on ISE Algorithm

WANG Zhuo-hao1, YANG Dong-ju2,3, XU Chen-yang1   

  1. (Institute of Scientific and Technical Information of China,Beijing 100038,China)1;
    (Beijing Key Laboratory on Integration and Analysis of Large-scale Stream Data,Beijing 100144,China)2;
    (Data Engineering Institute,North China University of Technology,Beijing 100144,China)3
  • Received:2019-01-24 Online:2019-12-15 Published:2019-12-17

Abstract: With the expansion of the data warehouse,ETL tasks have also increased under data integration.Stand-alone scheduling obviously cannot meet the needs of many complex ETL tasks.Aiming at how to improve the efficiency of ETL task scheduling,reduce the critical task waiting time,and improve the resource utilization and so on,this paper constructed a distributed ETL task scheduling framework consisting of a scheduler and several actuators and completing the ETL task scheduling through the task preprocessing,task scheduling and task execution.In the task preprocessing stage,a weight model is established for the ETL task,and the scheduling priority is determined according to the weight.In the task scheduling stage,the scheduler constrains the choice of actuator nodes according to the performance and load conditions of each actuator node,and a greedy balance (GB) algorithm is designed to distribute the ETL task execution requests,so that the load of the actuator nodes is relatively balanced.In the task execution phase,the execution priority of tasks under the actuator node queue is determined by the high response ratio first (Highest Response Ratio Next,HRRN) algorithm.Experiment results show that the distributed ETL task scheduling framework and the corresponding integrated scheduling execution (ISE) algorithm can effectively improve the utilization of cluster resources and shorten the task scheduling execution time.

Key words: Data integration, Distributed clustering, Dynamic allocation, Extraction-Transformation-Loading, Load balancing, Task scheduling

CLC Number: 

  • TP301
[1]ZHANG L.Integration and collection of heterogeneous data based on metedata[C]//2013 6th International Conference on Information Management,Innovation Management and IndustrialEngineering.Xi’an,2013:205-208.
[2]SALEH H,NASHAAT H,SABER W,et al.IPSO Task Sche- duling Algorithm for Large Scale Data in Cloud Computing Environment[J].IEEE Access,2019,7(1):5412-5420.
[3]ISLAM T,HASHEM M M A.Task Scheduling for Big Data Management in Fog Infrastructure[C]//2018 21st International Conference of Computer and Information Technology (ICCIT).IEEE,2018:1-6.
[4]SAHAR M,VAHID R.A hybrid heuristic workflow scheduling algorithm for cloud computing environments[J].Journal of Experimental & Theoretical Artificial Intelligence,2015,27(6):1-15.
[5]YAO Y,GAO H,WANG J,et al.New Scheduling Algorithms for Improving Performance and Resource Utilization in Hadoop YARN Clusters[J].IEEE Transactions on Cloud Computing,2019,7(1):1-1.
[6]SUN J,CHO H,EASWARAN A,et al.Flow Network-Based Real-Time Scheduling for Reducing Static Energy Consumption on Multiprocessors[J].IEEE Access,2019,7(1):1330-1344.
[7]KOKILAVANI T,GEORGE D I,THINAM A.Load Balanced MinMin Algorithm for Static MetaTask Scheduling in Grid Computing[J].International Journal of Computer Applications,2011,20(2):43-49.
[8]MALLET F,ZHANG M.Work-in-Progress:From Logical Time Scheduling to Real-Time Scheduling[C]//2018 IEEE Real-Time Systems Symposium (RTSS).IEEE,2018:143-146.
[9]ZHANG L,LIU S F,HAN L.Task scheduling algorithm based on load balancing [J].Journal of Jilin University (Science Edition),2014(4):769-772.
[10]GE W C,YE B.Improved priority table scheduling algorithm based on load balancing priority[J].Journal of Shenyang University of Technology,2017,39(3):241-247.
[11]YU W,LIU F,XIONG Z,et al.A Task Scheduling Mechanism Based on Quartz of Power Consumption Information Acquisition System[C]//2018 5th International Conference on Information Science and Control Engineering (ICISCE).IEEE,2018:98-101.
[12]SUNDAR S,CHAMPATI J P,LIANG B.Completion Time Minimization in Multi-user Task Scheduling with Heterogeneous Processors and Budget Constraints[C]//2018 IEEE/ACM 26th International Symposium on Quality of Service (IWQoS).IEEE,2018:1-6.
[13]MAHMOUD R,UWE R.The Quicksort process[J].Stochastic Processes and Their Applications:An Official Journal of the Bernoulli Society for Mathematical Statistics and Probability,2014,124(2):1036-1054.
[14]XIA H.Load balancing greedy algorithm for reduce on Hadoop platform[C]//2018 IEEE 3rd International Conference on Big Data Analysis (ICBDA).IEEE,2018:212-216.
[15]WANG C,MAO Y,HU B,et al.Ship Block Transportation Scheduling Problem Based on Greedy Algorithm[J].Journal of Engineering Science & Technology Review,2016,9(2):93-98.
[16]LI J M,WANG X,WU Y X.An Improved Priority List Task Scheduling Algorithm[J].Computer Science,2014,4(5):20-23,36.
[1] TIAN Zhen-zhen, JIANG Wei, ZHENG Bing-xu, MENG Li-min. Load Balancing Optimization Scheduling Algorithm Based on Server Cluster [J]. Computer Science, 2022, 49(6A): 639-644.
[2] GAO Jie, LIU Sha, HUANG Ze-qiang, ZHENG Tian-yu, LIU Xin, QI Feng-bin. Deep Neural Network Operator Acceleration Library Optimization Based on Domestic Many-core Processor [J]. Computer Science, 2022, 49(5): 355-362.
[3] TAN Shuang-jie, LIN Bao-jun, LIU Ying-chun, ZHAO Shuai. Load Scheduling Algorithm for Distributed On-board RTs System Based on Machine Learning [J]. Computer Science, 2022, 49(2): 336-341.
[4] SHEN Biao, SHEN Li-wei, LI Yi. Dynamic Task Scheduling Method for Space Crowdsourcing [J]. Computer Science, 2022, 49(2): 231-240.
[5] XIA Zhong, XIANG Min, HUANG Chun-mei. Hierarchical Management Mechanism of P2P Video Surveillance Network Based on CHBL [J]. Computer Science, 2021, 48(9): 278-285.
[6] SONG Hai-ning, JIAO Jian, LIU Yong. Research on Mobile Edge Computing in Expressway [J]. Computer Science, 2021, 48(6A): 383-386.
[7] WANG Zheng, JIANG Chun-mao. Cloud Task Scheduling Algorithm Based on Three-way Decisions [J]. Computer Science, 2021, 48(6A): 420-426.
[8] ZHENG Zeng-qian, WANG Kun, ZHAO Tao, JIANG Wei, MENG Li-min. Load Balancing Mechanism for Bandwidth and Time-delay Constrained Streaming Media Server Cluster [J]. Computer Science, 2021, 48(6): 261-267.
[9] YAO Ze-wei, LIU Jia-wen, HU Jun-qin, CHEN Xing. PSO-GA Based Approach to Multi-edge Load Balancing [J]. Computer Science, 2021, 48(11A): 456-463.
[10] CAI Ling-feng, WEI Xiang-lin, XING Chang-you, ZOU Xia, ZHANG Guo-min. Failure-resilient DAG Task Rescheduling in Edge Computing [J]. Computer Science, 2021, 48(10): 334-342.
[11] YANG Zi-qi, CAI Ying, ZHANG Hao-chen, FAN Yan-fang. Computational Task Offloading Scheme Based on Load Balance for Cooperative VEC Servers [J]. Computer Science, 2021, 48(1): 81-88.
[12] GUO Fei-yan, TANG Bing. Mobile Edge Server Placement Method Based on User Latency-aware [J]. Computer Science, 2021, 48(1): 103-110.
[13] ZHANG Long-xin, ZHOU Li-qian, WEN Hong, XIAO Man-sheng, DENG Xiao-jun. Energy Efficient Scheduling Algorithm of Workflows with Cost Constraint in Heterogeneous Cloud Computing Systems [J]. Computer Science, 2020, 47(8): 112-118.
[14] GAO Zi-yan and WANG Yong. Load Balancing Strategy of Distributed Messaging System for Cloud Services [J]. Computer Science, 2020, 47(6A): 318-324.
[15] HUANG Mei-gen, WANG Tao, LIU Liang, PANG Rui-qin and DU Huan. Virtual Network Function Deployment Strategy Based on Software Defined Network Resource Optimization [J]. Computer Science, 2020, 47(6A): 404-408.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!