计算机科学 ›› 2026, Vol. 53 ›› Issue (5): 129-136.doi: 10.11896/jsjkx.250900001

• 数据库 & 大数据 & 数据科学 • 上一篇    下一篇

融合多维算子特征的深度学习训练时间预测算法

陈远生1, 陈顺珏1, 莫萱1, 吴维刚1, 李嘉伦2   

  1. 1 中山大学计算机学院 广州 510006
    2 广东技术师范大学计算机科学学院 广州 510665
  • 收稿日期:2025-09-01 修回日期:2025-11-25 发布日期:2026-05-08
  • 通讯作者: 李嘉伦(jialun.li@gpnu.edu.cn)
  • 作者简介:(chenysh253@mail2.sysu.edu.cn)
  • 基金资助:
    广东省自然科学基金(2025A1515011663,2024A1515010378)

Deep Learning Training Time Prediction Algorithm Integrating Multi-dimensional Operator Features

CHEN Yuansheng1, CHEN Shunjue1, MO Xuan1, WU Weigang1, LI Jialun2   

  1. 1 School of Computer Science, Sun Yat-sen University, Guangzhou 510006, China
    2 School of Computer Science, Guangdong Polytechnic Normal University, Guangzhou 510665, China
  • Received:2025-09-01 Revised:2025-11-25 Online:2026-05-08
  • About author:CHEN Yuansheng,born in 1999,master.His main research interest is cloud computing.
    LI Jialun,born in 1997,lecturer.His main research interests include resource management in cloud datacenters,task scheduling in co-location datacenters,MLaaS and graph neural network.
  • Supported by:
    This work was supported by theNatural Science Foundation of Guangdong Province(2025A1515011663,2024A1515010378).

摘要: 离线任务是可延迟处理的任务,对完成时间没有严格的要求,通常包括数据批处理或机器学习训练任务。随着深度学习技术的发展,深度学习训练任务已经成为云数据中心的核心负载之一,通过准确预测离线训练任务的运行时间,可以合理地利用在线任务空闲时的资源。然而,深度学习模型结构各异,模型规模跨度巨大,训练过程中的数据批量大小、超参数、执行算子的特性等因素也会影响到训练时间。现有方法无法兼顾所有因素:基于配置的方法忽视模型的内部执行机制;基于算子的方法忽视计算图结构对训练的影响;基于计算图的方法,若使用图神经网络模型则复杂度高,若简化为拓扑序列则会丢失部分依赖关系。针对拓扑序列方法的不足,提出了MDOT算法,将计算图按照拓扑排序转换成一个算子序列。基于此算子序列,MDOT首先利用Transformer融合算子的3个维度信息,即算子类型、算子配置和计算负载,执行多维度算子编码,更全面地建模算子的执行特性;其次,为了捕捉计算图的依赖关系,MDOT设计了图位置编码机制,通过Transformer的自注意力捕捉算子序列间的关系,建模算子之间在运行时间上的相互影响。实验结果表明,MDOT在深度学习任务训练时间预测上优于现有方法,平均绝对误差和均方根误差比次优模型低25%和45%。

关键词: 云计算, 运行时间预测, 深度学习, 算子, 离线任务

Abstract: Offline tasks are delay-tolerant workloads without strict requirements on completion time,typically including batch processing or machine learning tasks.With the development of deep learning technology,deep learning tasks have become one of the important parts of offline workloads in cloud data centers.Accurate runtime prediction of offline tasks improves resource utilization during idle periods of online tasks.However,deep learning models exhibit diverse architectures and vast scale differences.Factors such as batch sizes,hyperparameters and operator characteristics during training also significantly affect task execution time.Existing methods struggle to comprehensively account for all these factors:configuration-based methods ignore the internal execution mechanism of the algorithm;operator-based methods neglect the impact of computation graph structure;graph-based methods either face excessive model complexity with graph neural networks or lose dependency information when simplifying to topological sequences.In view of the deficiencies of the topological sequence methods,this paper proposes the MDOT(Multi-dimensional Operator Transformer) algorithm to convert the computational graph into an operator sequence according to topological sorting.Based on this sequence of operators,MDOT uses Transformer to fuse the three-dimensional information of the operators:operator type,operator configuration,and computational load to perform multi-dimensional operator encoding,more comprehensively modeling the execution characteristics of the operators.Secondly,in order to capture the dependencies of the computational graph,MDOT designs a graph position encoding mechanism,which captures the relationships between operator sequences through the self-attention of the Transformer and models the mutual influence of operators in terms of running time.Experimental results show that MDOT outperforms existing methods in predicting the training time of deep learning tasks,with the mean absolute error and root mean square error being 25% and 45% lower than those of suboptimal models,respectively.

Key words: Cloud computing, Execution time prediction, Deep learning, Operator, Offline task

中图分类号: 

  • TP312
[1]WENG Q,XIAO W,YU Y,et al.Mlaas in the wild:Workloadanalysis and scheduling in large-scale heterogeneous GPU clusters[C]//19th USENIX Symposium on Networked Systems Design and Implementation.2022:945-960.
[2]SUBRAMANYA S J,ARFEEN D,LIN S,et al.Sia:Heterogeneity-aware,goodput-optimized ml-cluster scheduling[C]//Proceedings of the 29th Symposium on Operating Systems Principles.2023:642-657.
[3]MOHAN J,PHANISHAYEE A,KULKARNI J,et al.Looking beyond gpus for DNN scheduling on multitenant clusters[C]//16th USENIX Symposium on Operating Systems Design and Implementation.2022:579-596.
[4]GAO W,YE Z,SUN P,et al.Chronus:A novel deadline-aware scheduler for deep learning training jobs[C]//ACM Symposium on Cloud Computing.2021:609-623.
[5]LE T N,SUN X,CHOWDHURY M,et al.Allox:compute allocation in hybrid clusters[C]//Fifteenth European Conference on Computer Systems.2020:31:1-31:16.
[6]GU D,ZHAO Y,ZHONG Y,et al.Elasticflow:An elasticserverless training platform for distributed deep learning[C]//Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems.2023:266-280.
[7]YANG Z C,WU H,WU Y W,et al.A review of deep learning training task scheduling based on performance modeling[J].Journal of Software,2025,36(4):1570-1589.
[8]HU Q,SUN P,YAN S,et al.Characterization and prediction of deep learning workloads in large-scale GPU datacenters[C]//International Conference for High Performance Computing,Networking,Storage and Analysis.2021.
[9]YANG Z,WU H,XU Y,et al.Hydra:Deadline-aware and efficiency-oriented scheduling for deep learning jobs on heterogeneous gpus[J].IEEE Transactions on Computers,2023,72(8):2224-2236.
[10]YU G X,GAO Y,GOLIKOV P,et al.Habitat:A runtime-based computational performance predictor for deep neural network training[C]//Proceedings of the 2021 USENIX Annual Technical Conference.2021:503-521.
[11]LIU G,WANG S,BAO Y.SEER:A time prediction model for cnns from GPU kernel’s view[C]//30th International Confe-rence on Parallel Architectures and Compilation Techniques.2021:173-185.
[12]WANG C,LIAO Y,KAO M,et al.Perfnet:Platform-aware performance modeling for deep neural networks[C]//International Conference on Research in Adaptive and Convergent Systems.2020:90-95.
[13]LEE S,PHANISHAYEE A,MAHAJAN D.Forecasting GPU performance for deep learning training and inference[C]//Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems.2025:493-508.
[14]LI Y,SUN Y,JOG A.Path forward beyond simulators:Fast and accurate GPU execution time prediction for DNN workloads[C]//Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture.2023:380-394.
[15]GAO Y,GU X,ZHANG H,et al.Runtime performance prediction for deep learning models with graph neural network[C]//45th IEEE/ACM International Conference on Software Engineering:Software Engineering in Practice.2023:368-380.
[16]YANG G,SHIN C,LEE J,et al.Prediction of the resource consumption of distributed deep learning systems[J].Proceedings of the ACM on Measurement and Analysis of Computing Systems,2022,6(2):29:1-29:25.
[17]YEUNG G,BOROWIEC D,YANG R,et al.Horus:Interfe-rence-aware and prediction-based scheduling in deep learning systems[J].IEEE Transactions on Parallel and Distributed Systems,2022,33(1):88-100.
[18]ZAREMBA W,SUTSKEVER I,VINYALS O.Recurrent neural network regularization[J].arXiv:1409.2329,2014.
[19]HOCHREITER S,SCHMIDHUBER J.Long short-term memory[J].Neural Computation,1997,9(8):1735-1780.
[20]VASWANI A,SHAZEER N,PARMAR N,et al.Attention is all you need[C]//Proceedings of the 31st International Confe-rence on Neural Information Processing Systems.2017:5998-6008.
[21]YANG Z,GUO H,WU H,et al.ETS:deep learning trainingiteration time prediction based on execution trace sliding window[C]//Proceedings of the 33rd International Symposium on High Performance Parallel and Distributed Computing.2024:56-68.
[22]ZHU H,PHANISHAYEE A,PEKHIMENKO G.Daydream:Accurately estimating the efficacy of optimizations for DNN training[C]//Proceedings of the 2020 USENIX Annual Technical Conference.2020:337-352.
[23]HE K,ZHANG X,REN S,et al.Deep residual learning forimage recognition[C]//IEEE Conference on Computer Vision and Pattern Recognition.2016:770-778.
[24]TAN M,CHEN B,PANG R,et al.Mnasnet:Platform-awareneural architecture search for mobile[C]//IEEE Conference on Computer Vision and Pattern Recognition.2019:2820-2828.
[25]KRIZHEVSKY A,SUTSKEVER I,HINTON G E.Imagenetclassification with deep convolutional neural networks[C]//Advances in Neural Information Processing Systems.2012:1106-1114.
[26]LIU Z,MAO H,WU C,et al.A convnet for the 2020s[C]//IEEE Conference on Computer Vision and Pattern Recognition.2022:11966-11976.
[27]ZAGORUYKO S,KOMODAKIS N.Wide residual networks[C]//Proceedings of the British Machine Vision Conference.2016.
[28]TAN M,LE Q V.Efficientnet:Rethinking model scaling forconvolutional neural networks[C]//Proceedings of the 36th International Conference on Machine Learning.2019:6105-6114.
[29]SZEGEDY C,VANHOUCKE V,IOFFE S,et al.Rethinking the inception architecture for computer vision[C]//IEEE Confe-rence on Computer Vision and Pattern Recognition.2016:2818-2826.
[30]RADOSAVOVIC I,KOSARAJU R P,GIRSHICK R B,et al.Designing network design spaces[C]//IEEE Conference on Computer Vision and Pattern Recognition.2020:10425-10433.
[31]ZHANG X,ZHOU X,LIN M,et al.Shufflenet:An extremelyefficient convolutional neural network for mobile devices[C]//IEEE Conference on Computer Vision and Pattern Recognition.2018:6848-6856.
[32]IANDOLA F N,HAN S,MOSKEWICZ M W,et al.Sque-ezenet:Alexnet-level accuracy with 50x fewer parameters and <0.5 mb model size[J].arXiv:1602.07360,2016.
[33]SIMONYAN K,ZISSERMAN A.Very deep convolutional networks for large-scale image recognition[C]//3rd International Conference on Learning Representations.2015.
[34]SZEGEDY C,LIU W,JIA Y,et al.Going deeper with convolutions[C]//IEEE Conference on Computer Vision and Pattern Recognition.2015:1-9.
[35]PEI Z Q,LI C S,QIN X W,et al.Iteration time prediction for cnn in multi-gpu platform:modeling and analysis[J].IEEE Access,2019,7:64788-64797.
[36]GYEONGSIK Y,SHIN C Y,JEUNGHWAN L,et al.Prediction of the resource consumption of distributed deep learning systems[C]//Proceedings of the ACM on Measurement and Analysis of Computing Systems.2022:1-25.
[37]ZHU H Y,AMAR P,GENNADY P.Daydream:Accuratelyestimating the efficacy of optimizations for {DNN} training[C]//2020 USENIX Annual Technical Conference(USENIX ATC 20).2020:337-352.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!