填充性载荷:减少集群资源浪费与深度学习训练成本的负载

doi:10.11896/jsjkx.231000222

Computer Science ›› 2024, Vol. 51 ›› Issue (9): 71-79.doi: 10.11896/jsjkx.231000222

• High Performance Computing • Previous Articles Next Articles

Padding Load:Load Reducing Cluster Resource Waste and Deep Learning Training Costs

DU Yu, YU Zishu, PENG Xiaohui, XU Zhiwei

Institute of Computing Technology,Chinese Academy of Sciences,Beijing 100190,China
University of Chinese Academy of Sciences,Beijing 100049,China

Received:2023-10-31 Revised:2024-04-18 Online:2024-09-15 Published:2024-09-10
About author:DU Yu,born in 2001,postgraduate.Her main research interests include distributed systems and deep learning frameworks.
YU Zishu,born in 1996,Ph.D candidate.His main research interests include distributed systems and runtime management.
Supported by:
Natural Science Foundation of Beijing,China(4212027) and National Natural Science Foundation of China(62072434).

Abstract

Abstract: In recent years,large-scale models have achieved remarkable success in multiple domains such as bioinformatics,natural language processing,and computer vision.However,these models often require substantial computational resources during the training and inference stages,resulting in considerable computational costs.Additionally,computing clusters experience imba-lances between supply and demand,manifesting as low resource utilization and difficulties in task scheduling.To address this problem,the concept of Padding Load is introduced,which leverages idle computing resources for computational tasks.Resources allocated to Padding Load can be preempted by other tasks at any time.However,they operate with a lower resource priority,leading to relatively lower costs.PaddingTorch is a distributed deep learning training framework tailored for Padding Load.Utilizing data from the Alibaba PAI cluster,job scheduling is simulated on four GPUs,specifically during peak task-switching intervals.PaddingTorch is employed to train a protein complex prediction model using the Padding Load approach.While the training duration is 2.8 times that of exclusive resource usage,there is an 84% reduction in training costs and a 25.8% increase in GPU resource utilization during the periods when Padding Load is employed.

Key words: Deep learning, Distributed training, Resource utilization, Computing cluster, Programming framework

CLC Number:

TP312

DU Yu, YU Zishu, PENG Xiaohui, XU Zhiwei. Padding Load:Load Reducing Cluster Resource Waste and Deep Learning Training Costs[J].Computer Science, 2024, 51(9): 71-79.

References

[1]JUMPER J,EVANS R,PRITZEL A,et al.Highly accurate pro-tein structure prediction with AlphaFold[J].Nature,2021,596(7873):583-589.
[2]BATEMAN A,MARTIN M J,ORCHARD S,et al.UniProt:the Universal Protein Knowledgebase in 2023[J].Nucleic Acids Research,2023,51(D1):D523-D531.
[3]DISKIN M,BUKHTIYAROV A,RYABININ M,et al.Distri-buted deep learning in open collaborations[J].Advances in Neural Information Processing Systems,2021,34:7879-7897.
[4]WENG Q,YANG L,YU Y,et al.Beware of Fragmentation:Scheduling GPU-Sharing Workloads with Fragmentation Gra-dient Descent[C]//2023 USENIX Annual Technical Conference(USENIX ATC 23).2023.
[5]WENG Q,XIAO W,YU Y,et al.MLaaS in the wild:Workload analysis and scheduling in Large-Scale heterogeneous GPU clusters[C]//19th USENIX Symposium on Networked Systems Design and Implementation(NSDI 22).USENIX Association,2022:945-960.
[6]JIA Z,MAGGIONI M,STAIGER B,et al.Dissecting theNVIDIA volta GPU architecture via microbenchmarking[J].arXiv:1804.06826,2018.
[7]NVIDIA[EB/OL].[2023-06-29].https://developer.download.nvidia.com/compute/DCGM/docs/nvidia-smi-367.38.pdf.
[8]YEUNG G,BOROWIEC D,FRIDAY A,et al.Towards GPUutilization prediction for cloud deep learning[C]//Proceedings of the 12th USENIX Conference on Hot Topics in Cloud Computing.2020:6-6.
[9]HU Q,SUN P,YAN S,et al.Characterization and prediction of deep learning workloads in large-scale gpu datacenters[C]//Proceedings of the International Conference for High Perfor-mance Computing,Networking,Storage and Analysis.2021:1-15.
[10]LI J,XU H,ZHU Y,et al.Lyra:Elastic scheduling for deep learning clusters[C]//Proceedings of the Eighteenth European Conference on Computer Systems.2023:835-850.
[11]XIAO W,REN S,LI Y,et al.AntMan:Dynamic Scaling on GPU Clusters for Deep Learning[C]//OSDI.2020:533-548.
[12]Amazon Web Service(AWS)- Cloud Computing Services[EB/OL].[2023-04-15].https://aws.amazon.com/.
[13]Alibaba Cloud[EB/OL].[2023-04-15].https://www.alibabacloud.com/.
[14]Tencent Cloud[EB/OL].[2023-04-15].https://cloud.tencent.com/.
[15]SERGEEV A,DEL BALSO M.Horovod:fast and easy distributed deep learning in TensorFlow[J].arXiv:1802.05799,2018.
[16]PyTorch Lightning[EB/OL].[2023-04-15].https://lightning.ai/docs/pytorch/latest/.
[17]PRZYBYLSKI B,PAWLIK M,UK P,et al.Using unused:non-invasive dynamic FaaS infrastructure with HPC-whisk[C]//SC22:International Conference for High Performance Computing,Networking,Storage and Analysis.IEEE,2022:1-15.
[18]GOYAL P,DOLLAR P,GIRSHICK R B,et al.Accurate,Large Minibatch SGD:Training ImageNet in 1 Hour[J].arXiv:1706.02677,2017.
[19]Alibaba Cluster Trace Program[EB/OL].[2023-04-15].ht-tps://github.com/alibaba/clusterdata.
[20]WANG K J,JIA T,LI Y.State-of-the-art Survey of Scheduling and Resource Management Technology for Colocation Jobs[J].Journal of Software,2020,31(10):3100-3119.
[21]ROMERO F,DELIMITROU C.Mage:Online and interference-aware scheduling for multi-scale heterogeneous systems[C]//Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques.2018:1-13.

Related Articles 15

[1]	XU Jinlong, GUI Zhonghua, LI Jia'nan, LI Yingying, HAN Lin. FP8 Quantization and Inference Memory Optimization Based on MLIR [J]. Computer Science, 2024, 51(9): 112-120.
[2]	SUN Yumo, LI Xinhang, ZHAO Wenjie, ZHU Li, LIANG Ya’nan. Driving Towards Intelligent Future:The Application of Deep Learning in Rail Transit Innovation [J]. Computer Science, 2024, 51(8): 1-10.
[3]	KONG Lingchao, LIU Guozhu. Review of Outlier Detection Algorithms [J]. Computer Science, 2024, 51(8): 20-33.
[4]	TANG Ruiqi, XIAO Ting, CHI Ziqiu, WANG Zhe. Few-shot Image Classification Based on Pseudo-label Dependence Enhancement and NoiseInterferenceReduction [J]. Computer Science, 2024, 51(8): 152-159.
[5]	XIAO Xiao, BAI Zhengyao, LI Zekai, LIU Xuheng, DU Jiajin. Parallel Multi-scale with Attention Mechanism for Point Cloud Upsampling [J]. Computer Science, 2024, 51(8): 183-191.
[6]	ZHANG Junsan, CHENG Ming, SHEN Xiuxuan, LIU Yuxue, WANG Leiquan. Diversified Label Matrix Based Medical Image Report Generation [J]. Computer Science, 2024, 51(8): 200-208.
[7]	GUO Fangyuan, JI Genlin. Video Anomaly Detection Method Based on Dual Discriminators and Pseudo Video Generation [J]. Computer Science, 2024, 51(8): 217-223.
[8]	CHEN Siyu, MA Hailong, ZHANG Jianhui. Encrypted Traffic Classification of CNN and BiGRU Based on Self-attention [J]. Computer Science, 2024, 51(8): 396-402.
[9]	YANG Heng, LIU Qinrang, FAN Wang, PEI Xue, WEI Shuai, WANG Xuan. Study on Deep Learning Automatic Scheduling Optimization Based on Feature Importance [J]. Computer Science, 2024, 51(7): 22-28.
[10]	LI Jiaying, LIANG Yudong, LI Shaoji, ZHANG Kunpeng, ZHANG Chao. Study on Algorithm of Depth Image Super-resolution Guided by High-frequency Information ofColor Images [J]. Computer Science, 2024, 51(7): 197-205.
[11]	SHI Dianxi, GAO Yunqi, SONG Linna, LIU Zhe, ZHOU Chenlei, CHEN Ying. Deep-Init:Non Joint Initialization Method for Visual Inertial Odometry Based on Deep Learning [J]. Computer Science, 2024, 51(7): 327-336.
[12]	FAN Yi, HU Tao, YI Peng. Host Anomaly Detection Framework Based on Multifaceted Information Fusion of SemanticFeatures for System Calls [J]. Computer Science, 2024, 51(7): 380-388.
[13]	GAN Run, WEI Xianglin, WANG Chao, WANG Bin, WANG Min, FAN Jianhua. Backdoor Attack Method in Autoencoder End-to-End Communication System [J]. Computer Science, 2024, 51(7): 413-421.
[14]	HUANG Haixin, CAI Mingqi, WANG Yuyao. Review of Point Cloud Semantic Segmentation Based on Graph Convolutional Neural Networks [J]. Computer Science, 2024, 51(6A): 230400196-7.
[15]	WANG Yingjie, ZHANG Chengye, BAI Fengbo, WANG Zumin. Named Entity Recognition Approach of Judicial Documents Based on Transformer [J]. Computer Science, 2024, 51(6A): 230500164-9.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Padding Load:Load Reducing Cluster Resource Waste and Deep Learning Training Costs

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 15

Metrics

Comments

Recommended 0