计算机科学 ›› 2026, Vol. 53 ›› Issue (5): 129-136.doi: 10.11896/jsjkx.250900001
陈远生1, 陈顺珏1, 莫萱1, 吴维刚1, 李嘉伦2
CHEN Yuansheng1, CHEN Shunjue1, MO Xuan1, WU Weigang1, LI Jialun2
摘要: 离线任务是可延迟处理的任务,对完成时间没有严格的要求,通常包括数据批处理或机器学习训练任务。随着深度学习技术的发展,深度学习训练任务已经成为云数据中心的核心负载之一,通过准确预测离线训练任务的运行时间,可以合理地利用在线任务空闲时的资源。然而,深度学习模型结构各异,模型规模跨度巨大,训练过程中的数据批量大小、超参数、执行算子的特性等因素也会影响到训练时间。现有方法无法兼顾所有因素:基于配置的方法忽视模型的内部执行机制;基于算子的方法忽视计算图结构对训练的影响;基于计算图的方法,若使用图神经网络模型则复杂度高,若简化为拓扑序列则会丢失部分依赖关系。针对拓扑序列方法的不足,提出了MDOT算法,将计算图按照拓扑排序转换成一个算子序列。基于此算子序列,MDOT首先利用Transformer融合算子的3个维度信息,即算子类型、算子配置和计算负载,执行多维度算子编码,更全面地建模算子的执行特性;其次,为了捕捉计算图的依赖关系,MDOT设计了图位置编码机制,通过Transformer的自注意力捕捉算子序列间的关系,建模算子之间在运行时间上的相互影响。实验结果表明,MDOT在深度学习任务训练时间预测上优于现有方法,平均绝对误差和均方根误差比次优模型低25%和45%。
中图分类号:
| [1]WENG Q,XIAO W,YU Y,et al.Mlaas in the wild:Workloadanalysis and scheduling in large-scale heterogeneous GPU clusters[C]//19th USENIX Symposium on Networked Systems Design and Implementation.2022:945-960. [2]SUBRAMANYA S J,ARFEEN D,LIN S,et al.Sia:Heterogeneity-aware,goodput-optimized ml-cluster scheduling[C]//Proceedings of the 29th Symposium on Operating Systems Principles.2023:642-657. [3]MOHAN J,PHANISHAYEE A,KULKARNI J,et al.Looking beyond gpus for DNN scheduling on multitenant clusters[C]//16th USENIX Symposium on Operating Systems Design and Implementation.2022:579-596. [4]GAO W,YE Z,SUN P,et al.Chronus:A novel deadline-aware scheduler for deep learning training jobs[C]//ACM Symposium on Cloud Computing.2021:609-623. [5]LE T N,SUN X,CHOWDHURY M,et al.Allox:compute allocation in hybrid clusters[C]//Fifteenth European Conference on Computer Systems.2020:31:1-31:16. [6]GU D,ZHAO Y,ZHONG Y,et al.Elasticflow:An elasticserverless training platform for distributed deep learning[C]//Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems.2023:266-280. [7]YANG Z C,WU H,WU Y W,et al.A review of deep learning training task scheduling based on performance modeling[J].Journal of Software,2025,36(4):1570-1589. [8]HU Q,SUN P,YAN S,et al.Characterization and prediction of deep learning workloads in large-scale GPU datacenters[C]//International Conference for High Performance Computing,Networking,Storage and Analysis.2021. [9]YANG Z,WU H,XU Y,et al.Hydra:Deadline-aware and efficiency-oriented scheduling for deep learning jobs on heterogeneous gpus[J].IEEE Transactions on Computers,2023,72(8):2224-2236. [10]YU G X,GAO Y,GOLIKOV P,et al.Habitat:A runtime-based computational performance predictor for deep neural network training[C]//Proceedings of the 2021 USENIX Annual Technical Conference.2021:503-521. [11]LIU G,WANG S,BAO Y.SEER:A time prediction model for cnns from GPU kernel’s view[C]//30th International Confe-rence on Parallel Architectures and Compilation Techniques.2021:173-185. [12]WANG C,LIAO Y,KAO M,et al.Perfnet:Platform-aware performance modeling for deep neural networks[C]//International Conference on Research in Adaptive and Convergent Systems.2020:90-95. [13]LEE S,PHANISHAYEE A,MAHAJAN D.Forecasting GPU performance for deep learning training and inference[C]//Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems.2025:493-508. [14]LI Y,SUN Y,JOG A.Path forward beyond simulators:Fast and accurate GPU execution time prediction for DNN workloads[C]//Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture.2023:380-394. [15]GAO Y,GU X,ZHANG H,et al.Runtime performance prediction for deep learning models with graph neural network[C]//45th IEEE/ACM International Conference on Software Engineering:Software Engineering in Practice.2023:368-380. [16]YANG G,SHIN C,LEE J,et al.Prediction of the resource consumption of distributed deep learning systems[J].Proceedings of the ACM on Measurement and Analysis of Computing Systems,2022,6(2):29:1-29:25. [17]YEUNG G,BOROWIEC D,YANG R,et al.Horus:Interfe-rence-aware and prediction-based scheduling in deep learning systems[J].IEEE Transactions on Parallel and Distributed Systems,2022,33(1):88-100. [18]ZAREMBA W,SUTSKEVER I,VINYALS O.Recurrent neural network regularization[J].arXiv:1409.2329,2014. [19]HOCHREITER S,SCHMIDHUBER J.Long short-term memory[J].Neural Computation,1997,9(8):1735-1780. [20]VASWANI A,SHAZEER N,PARMAR N,et al.Attention is all you need[C]//Proceedings of the 31st International Confe-rence on Neural Information Processing Systems.2017:5998-6008. [21]YANG Z,GUO H,WU H,et al.ETS:deep learning trainingiteration time prediction based on execution trace sliding window[C]//Proceedings of the 33rd International Symposium on High Performance Parallel and Distributed Computing.2024:56-68. [22]ZHU H,PHANISHAYEE A,PEKHIMENKO G.Daydream:Accurately estimating the efficacy of optimizations for DNN training[C]//Proceedings of the 2020 USENIX Annual Technical Conference.2020:337-352. [23]HE K,ZHANG X,REN S,et al.Deep residual learning forimage recognition[C]//IEEE Conference on Computer Vision and Pattern Recognition.2016:770-778. [24]TAN M,CHEN B,PANG R,et al.Mnasnet:Platform-awareneural architecture search for mobile[C]//IEEE Conference on Computer Vision and Pattern Recognition.2019:2820-2828. [25]KRIZHEVSKY A,SUTSKEVER I,HINTON G E.Imagenetclassification with deep convolutional neural networks[C]//Advances in Neural Information Processing Systems.2012:1106-1114. [26]LIU Z,MAO H,WU C,et al.A convnet for the 2020s[C]//IEEE Conference on Computer Vision and Pattern Recognition.2022:11966-11976. [27]ZAGORUYKO S,KOMODAKIS N.Wide residual networks[C]//Proceedings of the British Machine Vision Conference.2016. [28]TAN M,LE Q V.Efficientnet:Rethinking model scaling forconvolutional neural networks[C]//Proceedings of the 36th International Conference on Machine Learning.2019:6105-6114. [29]SZEGEDY C,VANHOUCKE V,IOFFE S,et al.Rethinking the inception architecture for computer vision[C]//IEEE Confe-rence on Computer Vision and Pattern Recognition.2016:2818-2826. [30]RADOSAVOVIC I,KOSARAJU R P,GIRSHICK R B,et al.Designing network design spaces[C]//IEEE Conference on Computer Vision and Pattern Recognition.2020:10425-10433. [31]ZHANG X,ZHOU X,LIN M,et al.Shufflenet:An extremelyefficient convolutional neural network for mobile devices[C]//IEEE Conference on Computer Vision and Pattern Recognition.2018:6848-6856. [32]IANDOLA F N,HAN S,MOSKEWICZ M W,et al.Sque-ezenet:Alexnet-level accuracy with 50x fewer parameters and <0.5 mb model size[J].arXiv:1602.07360,2016. [33]SIMONYAN K,ZISSERMAN A.Very deep convolutional networks for large-scale image recognition[C]//3rd International Conference on Learning Representations.2015. [34]SZEGEDY C,LIU W,JIA Y,et al.Going deeper with convolutions[C]//IEEE Conference on Computer Vision and Pattern Recognition.2015:1-9. [35]PEI Z Q,LI C S,QIN X W,et al.Iteration time prediction for cnn in multi-gpu platform:modeling and analysis[J].IEEE Access,2019,7:64788-64797. [36]GYEONGSIK Y,SHIN C Y,JEUNGHWAN L,et al.Prediction of the resource consumption of distributed deep learning systems[C]//Proceedings of the ACM on Measurement and Analysis of Computing Systems.2022:1-25. [37]ZHU H Y,AMAR P,GENNADY P.Daydream:Accuratelyestimating the efficacy of optimizations for {DNN} training[C]//2020 USENIX Annual Technical Conference(USENIX ATC 20).2020:337-352. |
|
||