计算机科学 ›› 2024, Vol. 51 ›› Issue (9): 71-79.doi: 10.11896/jsjkx.231000222
杜昱, 俞子舒, 彭晓晖, 徐志伟
DU Yu, YU Zishu, PENG Xiaohui, XU Zhiwei
摘要: 近年来,大模型在生物信息学、自然语言处理和计算机视觉等多个领域取得了显著成功。然而,这些模型在训练和推理阶段需要大量的计算资源,导致计算成本高昂。同时,计算集群中存在资源利用率低、任务调度难的供需失衡问题。为了解决这一问题,提出了填充性载荷的概念,即一种在计算集群中利用空闲资源进行计算的负载。填充性载荷的计算资源随时可能被其他负载抢占,但其使用的资源优先级较低,资源成本也相对较低。为此,设计了适用于填充性载荷的分布式深度学习训练框架PaddingTorch。基于阿里巴巴PAI集群的数据,使用4块GPU模拟了任务切换最频繁的4个GPU时间段上的作业调度情况,使用PaddingTorch将蛋白质复合物预测程序作为填充性载荷进行训练。训练时长为独占资源时训练时长的2.8倍,但训练成本降低了84%,在填充性载荷填充时间段内GPU资源利用率提升了25.8%。
中图分类号:
[1]JUMPER J,EVANS R,PRITZEL A,et al.Highly accurate pro-tein structure prediction with AlphaFold[J].Nature,2021,596(7873):583-589. [2]BATEMAN A,MARTIN M J,ORCHARD S,et al.UniProt:the Universal Protein Knowledgebase in 2023[J].Nucleic Acids Research,2023,51(D1):D523-D531. [3]DISKIN M,BUKHTIYAROV A,RYABININ M,et al.Distri-buted deep learning in open collaborations[J].Advances in Neural Information Processing Systems,2021,34:7879-7897. [4]WENG Q,YANG L,YU Y,et al.Beware of Fragmentation:Scheduling GPU-Sharing Workloads with Fragmentation Gra-dient Descent[C]//2023 USENIX Annual Technical Conference(USENIX ATC 23).2023. [5]WENG Q,XIAO W,YU Y,et al.MLaaS in the wild:Workload analysis and scheduling in Large-Scale heterogeneous GPU clusters[C]//19th USENIX Symposium on Networked Systems Design and Implementation(NSDI 22).USENIX Association,2022:945-960. [6]JIA Z,MAGGIONI M,STAIGER B,et al.Dissecting theNVIDIA volta GPU architecture via microbenchmarking[J].arXiv:1804.06826,2018. [7]NVIDIA[EB/OL].[2023-06-29].https://developer.download.nvidia.com/compute/DCGM/docs/nvidia-smi-367.38.pdf. [8]YEUNG G,BOROWIEC D,FRIDAY A,et al.Towards GPUutilization prediction for cloud deep learning[C]//Proceedings of the 12th USENIX Conference on Hot Topics in Cloud Computing.2020:6-6. [9]HU Q,SUN P,YAN S,et al.Characterization and prediction of deep learning workloads in large-scale gpu datacenters[C]//Proceedings of the International Conference for High Perfor-mance Computing,Networking,Storage and Analysis.2021:1-15. [10]LI J,XU H,ZHU Y,et al.Lyra:Elastic scheduling for deep learning clusters[C]//Proceedings of the Eighteenth European Conference on Computer Systems.2023:835-850. [11]XIAO W,REN S,LI Y,et al.AntMan:Dynamic Scaling on GPU Clusters for Deep Learning[C]//OSDI.2020:533-548. [12]Amazon Web Service(AWS)- Cloud Computing Services[EB/OL].[2023-04-15].https://aws.amazon.com/. [13]Alibaba Cloud[EB/OL].[2023-04-15].https://www.alibabacloud.com/. [14]Tencent Cloud[EB/OL].[2023-04-15].https://cloud.tencent.com/. [15]SERGEEV A,DEL BALSO M.Horovod:fast and easy distributed deep learning in TensorFlow[J].arXiv:1802.05799,2018. [16]PyTorch Lightning[EB/OL].[2023-04-15].https://lightning.ai/docs/pytorch/latest/. [17]PRZYBYLSKI B,PAWLIK M,UK P,et al.Using unused:non-invasive dynamic FaaS infrastructure with HPC-whisk[C]//SC22:International Conference for High Performance Computing,Networking,Storage and Analysis.IEEE,2022:1-15. [18]GOYAL P,DOLLAR P,GIRSHICK R B,et al.Accurate,Large Minibatch SGD:Training ImageNet in 1 Hour[J].arXiv:1706.02677,2017. [19]Alibaba Cluster Trace Program[EB/OL].[2023-04-15].ht-tps://github.com/alibaba/clusterdata. [20]WANG K J,JIA T,LI Y.State-of-the-art Survey of Scheduling and Resource Management Technology for Colocation Jobs[J].Journal of Software,2020,31(10):3100-3119. [21]ROMERO F,DELIMITROU C.Mage:Online and interference-aware scheduling for multi-scale heterogeneous systems[C]//Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques.2018:1-13. |
|