计算机科学 ›› 2023, Vol. 50 ›› Issue (6): 86-91.doi: 10.11896/jsjkx.220900110

• 高性能计算 • 上一篇    下一篇

深度学习容器云平台下的GPU共享调度系统

王壮1, 王平辉1, 王彬丞1, 武文博1, 王斌2, 丛鹏宇2   

  1. 1 西安交通大学智能网络与网络安全教育部重点实验室 西安 710049
    2 中国移动通信有限公司研究院 北京 100053
  • 收稿日期:2022-09-13 修回日期:2022-12-14 出版日期:2023-06-15 发布日期:2023-06-06
  • 通讯作者: 王平辉(phwang@mail.xjtu.edu.cn)
  • 作者简介:(wangzhuang@stu.xjtu.edu.cn)
  • 基金资助:
    国家重点研发计划(2021YFB1715600);国家自然科学基金(61902305,61922067);深圳市基础研究基金(JCYJ20170816100819428);“人工智能”教育部-中国移动建设项目(MCM20190701)

GPU Shared Scheduling System Under Deep Learning Container Cloud Platform

WANG Zhuang1, WANG Pinghui1, WANG Bincheng1, WU Wenbo1, WANG Bin2, CONG Pengyu2   

  1. 1 Ministry of Education Key Lab for Intelligent Networks and Network Security,Xi'an Jiaotong University,Xi'an 710049,China
    2 China Mobile Research Institute,Beijing 100053,China
  • Received:2022-09-13 Revised:2022-12-14 Online:2023-06-15 Published:2023-06-06
  • About author:WANG Zhuang,born in 1997,postgra-duate,is a student member of China Computer Federation.His main research interests include cloud computing and GPU virtualization.WANG Pinghui,born in 1984,Ph.D,professor,Ph.D supervisor,is a member of China Computer Federation.His main research interests include mobile Internet security,network graph data mining and knowledge discovery.
  • Supported by:
    National Key R & D Program of China(2021YFB1715600),National Natural Science Foundation of China(61902305,61922067),Shenzhen Basic Research Grant(JCYJ20170816100819428) and MoE-CMCC “Artifical Intelligence” Project(MCM20190701).

摘要: 近年来,容器由于具有轻量级以及高可扩展性,逐渐替代了虚拟机,被广泛应用于深度学习云平台中。但目前深度学习云平台在GPU资源管理上依然存在着不足,主要表现为由于容器编排技术的限制,多个容器无法共享使用GPU资源,而对于一些小规模模型的训练任务和推理任务,单个任务并不能充分利用整张GPU卡的计算资源。当前的独占模式会导致昂贵的GPU资源的浪费,降低资源效率和服务可用性。针对这一问题,提出了一种GPU共享调度系统。一方面,基于Kubernetes的Operator机制对现有集群功能进行扩展,实现了多个Pod共享使用GPU资源,同时设计了一种代理机制保证了与原生Kubernetes的兼容性。另一方面,基于GPU时间片与抢占机制,实现了GPU资源的动态管理与调度,在多个任务之间进行细粒度的协调,并减少了任务干扰。实验结果表明,与原生Kubernetes调度系统相比,该系统能够将一组深度学习训练任务的完成时间平均减少约20%,使得集群GPU资源利用率平均提升约10%。在共享使用GPU时高优先级任务性能相较于独占GPU损耗不到5%,同时能够使得低优先级任务以20%的性能运行在同一张GPU上。

关键词: 深度学习云平台, GPU共享, 容器调度, Docker, Kubernetes

Abstract: In recent years,containers have gradually replaced virtual machines and are widely used in deep learning cloud platforms due to their lightweight and high scalability.However,the deep learning cloud platform still has deficiencies in GPU resource management,which are mainly manifested as multiple containers cannot share GPU resources due to the limitation of container orchestration technology.For some small-scale model training tasks and model inference tasks,a single task cannot fully utilize the computing resources of the entire GPU card.The current exclusive mode will result in a waste of expensive GPU resources,reduce resource efficiency and service availability.In response to this problem,this paper proposes a GPU sharing sche-duling system.On the one hand,the Kubernetes-based Operator mechanism extends the existing cluster functions,enabling multiple Pods to share GPU resources,and designs an agent mechanism to ensure that compatibility with native Kubernetes.On the other hand,based on the GPU time slice and preemption mechanism,the dynamic management and scheduling of GPU resources is realized,fine-grained coordination among multiple tasks is performed,and task interference is reduced.Experimental results show that compared with the native Kubernetes scheduling system,the proposed system can reduce the completion time of a group of deep learning training tasks by about 20% on average,and increase the utilization of cluster GPU resources by about 10% on average.When the GPU is shared,the performance loss of high-priority tasks is less than 5% compared to the exclusive GPU,and the low-priority tasks can run on the same GPU with 20% of the performance.

Key words: Deep learning cloud platform, GPU sharing, Container scheduling, Docker, Kubernetes

中图分类号: 

  • TP181
[1]JOSHI A V.Amazon's machine learning toolkit:Sagemaker[M]//Machine Learning and Artificial Intelligence.Cham:Springer,2020:233-243.
[2]BISONG E.Google colaboratory[M]//Building Machine Lear-ning and Deep Learning Models on Google Cloud Platform.Berkeley:Apress,2019:59-64.
[3]MERKEL D.Docker:lightweight linux containers for consistent development and deployment[J].Linux Journal,2014,239:76-90.
[4]SAYFAN G.Mastering Kubernetes[M].Birmingham:PacktPublishing,2017.
[5]VAUCHER S,PIRES R,FELBER P,et al.SGX-Aware Container Orchestration for Heterogeneous Clusters[C]//2018 IEEE 38th International Conference on Distributed Computing Systems(ICDCS).Austria:Vienna,2018:730-741.
[6]NVIDIA Corporation.k8s-device-plugin[EB/OL].https://github.com-/NVIDIA/k8s-device-plugin/blob/master-/READ-ME.md.
[7]HE K M,ZHANG X Y,REN S Q,et al.Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:770-778.
[8]KANG D,JUN T J,KIM D,et al.ConVGPU:GPU Management Middleware in Container Based Virtualized Environment[C]//2017 IEEE International Conference on Cluster Computing(CLUSTER).IEEE,2017:301-309.
[9]Alibaba Cloud.GPU Sharing Scheduler Extender in Kubernetes[EB/OL].https://github.com/AliyunContainerService/gpushare-scheduler-extender.
[10]Intel.Intel device plugin for kubernetes[EB/OL].https://github.com/intel-/intel-device-plugins-for-kubernetes.
[11]GU J,SONG S,LI Y,et al.GaiaGPU:Sharing GPUs in Contai-ner Clouds[C]//IEEE International Conference on Parallel Distributed Processing with Applications,Ubiquitous Computing Communications,Big Data Cloud Computing,Social Computing Networking,Sustainable Computing Communications.2018:469-476.
[12]YEH T,CHEN H,CHOU J.Kubeshare:A framework to ma-nage gpus as first-class and shared resources in container cloud[C]// The 29th International Symposium on High-Performance Parallel and Distributed Computing.Sweden:Stockholm,2020:173-184.
[13]GREGORY M K,VANESSA S,MICHAEL W B.Singularity:Scientific containers for mobility of compute[J].PLOS ONE,2017,12(5):1-20.
[14]Kubernetes.Kubernetes Scheduler [EB/OL].https://kuber-netes.io-/docs/concepts/scheduling-eviction/kube-scheduler.
[15]PASZKE A,GROSS S,MASSA F,et al.PyTorch:An Imperative Style,High-Performance Deep Learning Library[C]//Advances in Neural Information Processing Systems 32.2019:8024-8035.
[16]NVIDIA Corporation.NVIDIA Management Library(NVML) [EB/OL].https://developer.nvidia.com/nvidia-management-library-nvml.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!