深度学习容器云平台下的GPU共享调度系统

doi:10.11896/jsjkx.220900110

计算机科学 ›› 2023, Vol. 50 ›› Issue (6): 86-91.doi: 10.11896/jsjkx.220900110

深度学习容器云平台下的GPU共享调度系统

王壮¹, 王平辉¹, 王彬丞¹, 武文博¹, 王斌², 丛鹏宇²

1 西安交通大学智能网络与网络安全教育部重点实验室西安 710049
2 中国移动通信有限公司研究院北京 100053

收稿日期:2022-09-13 修回日期:2022-12-14 出版日期:2023-06-15 发布日期:2023-06-06
通讯作者: 王平辉(phwang@mail.xjtu.edu.cn)
作者简介:(wangzhuang@stu.xjtu.edu.cn)
基金资助:
国家重点研发计划(2021YFB1715600);国家自然科学基金(61902305,61922067);深圳市基础研究基金(JCYJ20170816100819428);“人工智能”教育部-中国移动建设项目(MCM20190701)

GPU Shared Scheduling System Under Deep Learning Container Cloud Platform

WANG Zhuang¹, WANG Pinghui¹, WANG Bincheng¹, WU Wenbo¹, WANG Bin², CONG Pengyu²

1 Ministry of Education Key Lab for Intelligent Networks and Network Security,Xi'an Jiaotong University,Xi'an 710049,China
2 China Mobile Research Institute,Beijing 100053,China

Received:2022-09-13 Revised:2022-12-14 Online:2023-06-15 Published:2023-06-06
About author:WANG Zhuang,born in 1997,postgra-duate,is a student member of China Computer Federation.His main research interests include cloud computing and GPU virtualization.WANG Pinghui,born in 1984,Ph.D,professor,Ph.D supervisor,is a member of China Computer Federation.His main research interests include mobile Internet security,network graph data mining and knowledge discovery.
Supported by:
National Key R & D Program of China(2021YFB1715600),National Natural Science Foundation of China(61902305,61922067),Shenzhen Basic Research Grant(JCYJ20170816100819428) and MoE-CMCC “Artifical Intelligence” Project(MCM20190701).

摘要/Abstract

摘要： 近年来,容器由于具有轻量级以及高可扩展性,逐渐替代了虚拟机,被广泛应用于深度学习云平台中。但目前深度学习云平台在GPU资源管理上依然存在着不足,主要表现为由于容器编排技术的限制,多个容器无法共享使用GPU资源,而对于一些小规模模型的训练任务和推理任务,单个任务并不能充分利用整张GPU卡的计算资源。当前的独占模式会导致昂贵的GPU资源的浪费,降低资源效率和服务可用性。针对这一问题,提出了一种GPU共享调度系统。一方面,基于Kubernetes的Operator机制对现有集群功能进行扩展,实现了多个Pod共享使用GPU资源,同时设计了一种代理机制保证了与原生Kubernetes的兼容性。另一方面,基于GPU时间片与抢占机制,实现了GPU资源的动态管理与调度,在多个任务之间进行细粒度的协调,并减少了任务干扰。实验结果表明,与原生Kubernetes调度系统相比,该系统能够将一组深度学习训练任务的完成时间平均减少约20%,使得集群GPU资源利用率平均提升约10%。在共享使用GPU时高优先级任务性能相较于独占GPU损耗不到5%,同时能够使得低优先级任务以20%的性能运行在同一张GPU上。

关键词: 深度学习云平台, GPU共享, 容器调度, Docker, Kubernetes

Abstract: In recent years,containers have gradually replaced virtual machines and are widely used in deep learning cloud platforms due to their lightweight and high scalability.However,the deep learning cloud platform still has deficiencies in GPU resource management,which are mainly manifested as multiple containers cannot share GPU resources due to the limitation of container orchestration technology.For some small-scale model training tasks and model inference tasks,a single task cannot fully utilize the computing resources of the entire GPU card.The current exclusive mode will result in a waste of expensive GPU resources,reduce resource efficiency and service availability.In response to this problem,this paper proposes a GPU sharing sche-duling system.On the one hand,the Kubernetes-based Operator mechanism extends the existing cluster functions,enabling multiple Pods to share GPU resources,and designs an agent mechanism to ensure that compatibility with native Kubernetes.On the other hand,based on the GPU time slice and preemption mechanism,the dynamic management and scheduling of GPU resources is realized,fine-grained coordination among multiple tasks is performed,and task interference is reduced.Experimental results show that compared with the native Kubernetes scheduling system,the proposed system can reduce the completion time of a group of deep learning training tasks by about 20% on average,and increase the utilization of cluster GPU resources by about 10% on average.When the GPU is shared,the performance loss of high-priority tasks is less than 5% compared to the exclusive GPU,and the low-priority tasks can run on the same GPU with 20% of the performance.

Key words: Deep learning cloud platform, GPU sharing, Container scheduling, Docker, Kubernetes

中图分类号:

TP181

王壮, 王平辉, 王彬丞, 武文博, 王斌, 丛鹏宇. 深度学习容器云平台下的GPU共享调度系统[J]. 计算机科学, 2023, 50(6): 86-91. https://doi.org/10.11896/jsjkx.220900110

WANG Zhuang, WANG Pinghui, WANG Bincheng, WU Wenbo, WANG Bin, CONG Pengyu. GPU Shared Scheduling System Under Deep Learning Container Cloud Platform[J]. Computer Science, 2023, 50(6): 86-91. https://doi.org/10.11896/jsjkx.220900110

参考文献

[1]JOSHI A V.Amazon's machine learning toolkit:Sagemaker[M]//Machine Learning and Artificial Intelligence.Cham:Springer,2020:233-243.
[2]BISONG E.Google colaboratory[M]//Building Machine Lear-ning and Deep Learning Models on Google Cloud Platform.Berkeley:Apress,2019:59-64.
[3]MERKEL D.Docker:lightweight linux containers for consistent development and deployment[J].Linux Journal,2014,239:76-90.
[4]SAYFAN G.Mastering Kubernetes[M].Birmingham:PacktPublishing,2017.
[5]VAUCHER S,PIRES R,FELBER P,et al.SGX-Aware Container Orchestration for Heterogeneous Clusters[C]//2018 IEEE 38th International Conference on Distributed Computing Systems(ICDCS).Austria:Vienna,2018:730-741.
[6]NVIDIA Corporation.k8s-device-plugin[EB/OL].https://github.com-/NVIDIA/k8s-device-plugin/blob/master-/READ-ME.md.
[7]HE K M,ZHANG X Y,REN S Q,et al.Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:770-778.
[8]KANG D,JUN T J,KIM D,et al.ConVGPU:GPU Management Middleware in Container Based Virtualized Environment[C]//2017 IEEE International Conference on Cluster Computing(CLUSTER).IEEE,2017:301-309.
[9]Alibaba Cloud.GPU Sharing Scheduler Extender in Kubernetes[EB/OL].https://github.com/AliyunContainerService/gpushare-scheduler-extender.
[10]Intel.Intel device plugin for kubernetes[EB/OL].https://github.com/intel-/intel-device-plugins-for-kubernetes.
[11]GU J,SONG S,LI Y,et al.GaiaGPU:Sharing GPUs in Contai-ner Clouds[C]//IEEE International Conference on Parallel Distributed Processing with Applications,Ubiquitous Computing Communications,Big Data Cloud Computing,Social Computing Networking,Sustainable Computing Communications.2018:469-476.
[12]YEH T,CHEN H,CHOU J.Kubeshare:A framework to ma-nage gpus as first-class and shared resources in container cloud[C]// The 29th International Symposium on High-Performance Parallel and Distributed Computing.Sweden:Stockholm,2020:173-184.
[13]GREGORY M K,VANESSA S,MICHAEL W B.Singularity:Scientific containers for mobility of compute[J].PLOS ONE,2017,12(5):1-20.
[14]Kubernetes.Kubernetes Scheduler [EB/OL].https://kuber-netes.io-/docs/concepts/scheduling-eviction/kube-scheduler.
[15]PASZKE A,GROSS S,MASSA F,et al.PyTorch:An Imperative Style,High-Performance Deep Learning Library[C]//Advances in Neural Information Processing Systems 32.2019:8024-8035.
[16]NVIDIA Corporation.NVIDIA Management Library(NVML) [EB/OL].https://developer.nvidia.com/nvidia-management-library-nvml.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

深度学习容器云平台下的GPU共享调度系统

GPU Shared Scheduling System Under Deep Learning Container Cloud Platform

PDF (PC)

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 0

Metrics

本文评价

推荐阅读 0