填充性载荷:减少集群资源浪费与深度学习训练成本的负载

doi:10.11896/jsjkx.231000222

计算机科学 ›› 2024, Vol. 51 ›› Issue (9): 71-79.doi: 10.11896/jsjkx.231000222

填充性载荷:减少集群资源浪费与深度学习训练成本的负载

杜昱, 俞子舒, 彭晓晖, 徐志伟

中国科学院计算技术研究所北京 100190
中国科学院大学北京 100049

收稿日期:2023-10-31 修回日期:2024-04-18 出版日期:2024-09-15 发布日期:2024-09-10
通讯作者: 俞子舒(yuzishu19s@ict.ac.cn)
作者简介:(duyu19@mails.ucas.ac.cn)
基金资助:
北京市自然科学基金(4212027);国家自然科学基金(62072434)

Padding Load:Load Reducing Cluster Resource Waste and Deep Learning Training Costs

DU Yu, YU Zishu, PENG Xiaohui, XU Zhiwei

Institute of Computing Technology,Chinese Academy of Sciences,Beijing 100190,China
University of Chinese Academy of Sciences,Beijing 100049,China

Received:2023-10-31 Revised:2024-04-18 Online:2024-09-15 Published:2024-09-10
About author:DU Yu,born in 2001,postgraduate.Her main research interests include distributed systems and deep learning frameworks.
YU Zishu,born in 1996,Ph.D candidate.His main research interests include distributed systems and runtime management.
Supported by:
Natural Science Foundation of Beijing,China(4212027) and National Natural Science Foundation of China(62072434).

摘要/Abstract

摘要： 近年来,大模型在生物信息学、自然语言处理和计算机视觉等多个领域取得了显著成功。然而,这些模型在训练和推理阶段需要大量的计算资源,导致计算成本高昂。同时,计算集群中存在资源利用率低、任务调度难的供需失衡问题。为了解决这一问题,提出了填充性载荷的概念,即一种在计算集群中利用空闲资源进行计算的负载。填充性载荷的计算资源随时可能被其他负载抢占,但其使用的资源优先级较低,资源成本也相对较低。为此,设计了适用于填充性载荷的分布式深度学习训练框架PaddingTorch。基于阿里巴巴PAI集群的数据,使用4块GPU模拟了任务切换最频繁的4个GPU时间段上的作业调度情况,使用PaddingTorch将蛋白质复合物预测程序作为填充性载荷进行训练。训练时长为独占资源时训练时长的2.8倍,但训练成本降低了84%,在填充性载荷填充时间段内GPU资源利用率提升了25.8%。

关键词: 深度学习, 分布式训练, 资源利用率, 计算集群, 编程框架

Abstract: In recent years,large-scale models have achieved remarkable success in multiple domains such as bioinformatics,natural language processing,and computer vision.However,these models often require substantial computational resources during the training and inference stages,resulting in considerable computational costs.Additionally,computing clusters experience imba-lances between supply and demand,manifesting as low resource utilization and difficulties in task scheduling.To address this problem,the concept of Padding Load is introduced,which leverages idle computing resources for computational tasks.Resources allocated to Padding Load can be preempted by other tasks at any time.However,they operate with a lower resource priority,leading to relatively lower costs.PaddingTorch is a distributed deep learning training framework tailored for Padding Load.Utilizing data from the Alibaba PAI cluster,job scheduling is simulated on four GPUs,specifically during peak task-switching intervals.PaddingTorch is employed to train a protein complex prediction model using the Padding Load approach.While the training duration is 2.8 times that of exclusive resource usage,there is an 84% reduction in training costs and a 25.8% increase in GPU resource utilization during the periods when Padding Load is employed.

Key words: Deep learning, Distributed training, Resource utilization, Computing cluster, Programming framework

中图分类号:

TP312

杜昱, 俞子舒, 彭晓晖, 徐志伟. 填充性载荷:减少集群资源浪费与深度学习训练成本的负载[J]. 计算机科学, 2024, 51(9): 71-79. https://doi.org/10.11896/jsjkx.231000222

DU Yu, YU Zishu, PENG Xiaohui, XU Zhiwei. Padding Load:Load Reducing Cluster Resource Waste and Deep Learning Training Costs[J]. Computer Science, 2024, 51(9): 71-79. https://doi.org/10.11896/jsjkx.231000222

参考文献

[1]JUMPER J,EVANS R,PRITZEL A,et al.Highly accurate pro-tein structure prediction with AlphaFold[J].Nature,2021,596(7873):583-589.
[2]BATEMAN A,MARTIN M J,ORCHARD S,et al.UniProt:the Universal Protein Knowledgebase in 2023[J].Nucleic Acids Research,2023,51(D1):D523-D531.
[3]DISKIN M,BUKHTIYAROV A,RYABININ M,et al.Distri-buted deep learning in open collaborations[J].Advances in Neural Information Processing Systems,2021,34:7879-7897.
[4]WENG Q,YANG L,YU Y,et al.Beware of Fragmentation:Scheduling GPU-Sharing Workloads with Fragmentation Gra-dient Descent[C]//2023 USENIX Annual Technical Conference(USENIX ATC 23).2023.
[5]WENG Q,XIAO W,YU Y,et al.MLaaS in the wild:Workload analysis and scheduling in Large-Scale heterogeneous GPU clusters[C]//19th USENIX Symposium on Networked Systems Design and Implementation(NSDI 22).USENIX Association,2022:945-960.
[6]JIA Z,MAGGIONI M,STAIGER B,et al.Dissecting theNVIDIA volta GPU architecture via microbenchmarking[J].arXiv:1804.06826,2018.
[7]NVIDIA[EB/OL].[2023-06-29].https://developer.download.nvidia.com/compute/DCGM/docs/nvidia-smi-367.38.pdf.
[8]YEUNG G,BOROWIEC D,FRIDAY A,et al.Towards GPUutilization prediction for cloud deep learning[C]//Proceedings of the 12th USENIX Conference on Hot Topics in Cloud Computing.2020:6-6.
[9]HU Q,SUN P,YAN S,et al.Characterization and prediction of deep learning workloads in large-scale gpu datacenters[C]//Proceedings of the International Conference for High Perfor-mance Computing,Networking,Storage and Analysis.2021:1-15.
[10]LI J,XU H,ZHU Y,et al.Lyra:Elastic scheduling for deep learning clusters[C]//Proceedings of the Eighteenth European Conference on Computer Systems.2023:835-850.
[11]XIAO W,REN S,LI Y,et al.AntMan:Dynamic Scaling on GPU Clusters for Deep Learning[C]//OSDI.2020:533-548.
[12]Amazon Web Service(AWS)- Cloud Computing Services[EB/OL].[2023-04-15].https://aws.amazon.com/.
[13]Alibaba Cloud[EB/OL].[2023-04-15].https://www.alibabacloud.com/.
[14]Tencent Cloud[EB/OL].[2023-04-15].https://cloud.tencent.com/.
[15]SERGEEV A,DEL BALSO M.Horovod:fast and easy distributed deep learning in TensorFlow[J].arXiv:1802.05799,2018.
[16]PyTorch Lightning[EB/OL].[2023-04-15].https://lightning.ai/docs/pytorch/latest/.
[17]PRZYBYLSKI B,PAWLIK M,UK P,et al.Using unused:non-invasive dynamic FaaS infrastructure with HPC-whisk[C]//SC22:International Conference for High Performance Computing,Networking,Storage and Analysis.IEEE,2022:1-15.
[18]GOYAL P,DOLLAR P,GIRSHICK R B,et al.Accurate,Large Minibatch SGD:Training ImageNet in 1 Hour[J].arXiv:1706.02677,2017.
[19]Alibaba Cluster Trace Program[EB/OL].[2023-04-15].ht-tps://github.com/alibaba/clusterdata.
[20]WANG K J,JIA T,LI Y.State-of-the-art Survey of Scheduling and Resource Management Technology for Colocation Jobs[J].Journal of Software,2020,31(10):3100-3119.
[21]ROMERO F,DELIMITROU C.Mage:Online and interference-aware scheduling for multi-scale heterogeneous systems[C]//Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques.2018:1-13.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

填充性载荷:减少集群资源浪费与深度学习训练成本的负载

Padding Load:Load Reducing Cluster Resource Waste and Deep Learning Training Costs

PDF (PC)

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 0

Metrics

本文评价

推荐阅读 0