Computer Science ›› 2023, Vol. 50 ›› Issue (6): 86-91.doi: 10.11896/jsjkx.220900110

• High Performance Computing • Previous Articles     Next Articles

GPU Shared Scheduling System Under Deep Learning Container Cloud Platform

WANG Zhuang1, WANG Pinghui1, WANG Bincheng1, WU Wenbo1, WANG Bin2, CONG Pengyu2   

  1. 1 Ministry of Education Key Lab for Intelligent Networks and Network Security,Xi'an Jiaotong University,Xi'an 710049,China
    2 China Mobile Research Institute,Beijing 100053,China
  • Received:2022-09-13 Revised:2022-12-14 Online:2023-06-15 Published:2023-06-06
  • About author:WANG Zhuang,born in 1997,postgra-duate,is a student member of China Computer Federation.His main research interests include cloud computing and GPU virtualization.WANG Pinghui,born in 1984,Ph.D,professor,Ph.D supervisor,is a member of China Computer Federation.His main research interests include mobile Internet security,network graph data mining and knowledge discovery.
  • Supported by:
    National Key R & D Program of China(2021YFB1715600),National Natural Science Foundation of China(61902305,61922067),Shenzhen Basic Research Grant(JCYJ20170816100819428) and MoE-CMCC “Artifical Intelligence” Project(MCM20190701).

Abstract: In recent years,containers have gradually replaced virtual machines and are widely used in deep learning cloud platforms due to their lightweight and high scalability.However,the deep learning cloud platform still has deficiencies in GPU resource management,which are mainly manifested as multiple containers cannot share GPU resources due to the limitation of container orchestration technology.For some small-scale model training tasks and model inference tasks,a single task cannot fully utilize the computing resources of the entire GPU card.The current exclusive mode will result in a waste of expensive GPU resources,reduce resource efficiency and service availability.In response to this problem,this paper proposes a GPU sharing sche-duling system.On the one hand,the Kubernetes-based Operator mechanism extends the existing cluster functions,enabling multiple Pods to share GPU resources,and designs an agent mechanism to ensure that compatibility with native Kubernetes.On the other hand,based on the GPU time slice and preemption mechanism,the dynamic management and scheduling of GPU resources is realized,fine-grained coordination among multiple tasks is performed,and task interference is reduced.Experimental results show that compared with the native Kubernetes scheduling system,the proposed system can reduce the completion time of a group of deep learning training tasks by about 20% on average,and increase the utilization of cluster GPU resources by about 10% on average.When the GPU is shared,the performance loss of high-priority tasks is less than 5% compared to the exclusive GPU,and the low-priority tasks can run on the same GPU with 20% of the performance.

Key words: Deep learning cloud platform, GPU sharing, Container scheduling, Docker, Kubernetes

CLC Number: 

  • TP181
[1]JOSHI A V.Amazon's machine learning toolkit:Sagemaker[M]//Machine Learning and Artificial Intelligence.Cham:Springer,2020:233-243.
[2]BISONG E.Google colaboratory[M]//Building Machine Lear-ning and Deep Learning Models on Google Cloud Platform.Berkeley:Apress,2019:59-64.
[3]MERKEL D.Docker:lightweight linux containers for consistent development and deployment[J].Linux Journal,2014,239:76-90.
[4]SAYFAN G.Mastering Kubernetes[M].Birmingham:PacktPublishing,2017.
[5]VAUCHER S,PIRES R,FELBER P,et al.SGX-Aware Container Orchestration for Heterogeneous Clusters[C]//2018 IEEE 38th International Conference on Distributed Computing Systems(ICDCS).Austria:Vienna,2018:730-741.
[6]NVIDIA Corporation.k8s-device-plugin[EB/OL].https://github.com-/NVIDIA/k8s-device-plugin/blob/master-/READ-ME.md.
[7]HE K M,ZHANG X Y,REN S Q,et al.Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:770-778.
[8]KANG D,JUN T J,KIM D,et al.ConVGPU:GPU Management Middleware in Container Based Virtualized Environment[C]//2017 IEEE International Conference on Cluster Computing(CLUSTER).IEEE,2017:301-309.
[9]Alibaba Cloud.GPU Sharing Scheduler Extender in Kubernetes[EB/OL].https://github.com/AliyunContainerService/gpushare-scheduler-extender.
[10]Intel.Intel device plugin for kubernetes[EB/OL].https://github.com/intel-/intel-device-plugins-for-kubernetes.
[11]GU J,SONG S,LI Y,et al.GaiaGPU:Sharing GPUs in Contai-ner Clouds[C]//IEEE International Conference on Parallel Distributed Processing with Applications,Ubiquitous Computing Communications,Big Data Cloud Computing,Social Computing Networking,Sustainable Computing Communications.2018:469-476.
[12]YEH T,CHEN H,CHOU J.Kubeshare:A framework to ma-nage gpus as first-class and shared resources in container cloud[C]// The 29th International Symposium on High-Performance Parallel and Distributed Computing.Sweden:Stockholm,2020:173-184.
[13]GREGORY M K,VANESSA S,MICHAEL W B.Singularity:Scientific containers for mobility of compute[J].PLOS ONE,2017,12(5):1-20.
[14]Kubernetes.Kubernetes Scheduler [EB/OL].https://kuber-netes.io-/docs/concepts/scheduling-eviction/kube-scheduler.
[15]PASZKE A,GROSS S,MASSA F,et al.PyTorch:An Imperative Style,High-Performance Deep Learning Library[C]//Advances in Neural Information Processing Systems 32.2019:8024-8035.
[16]NVIDIA Corporation.NVIDIA Management Library(NVML) [EB/OL].https://developer.nvidia.com/nvidia-management-library-nvml.
[1] DENG Guanghong, ZHANG Qiheng. Container-based Scheduling Architecture for Mixed-Criticality Systems [J]. Computer Science, 2023, 50(6A): 220800215-5.
[2] LIU Bang-bang, YI Guo-hong, HUANG Zu-yuan. Dynamic Loading Algorithm for Docker Container [J]. Computer Science, 2021, 48(6): 276-281.
[3] YU Chang-fa, CHEN Xue-lin, YANG Xiao-hu. Design and Implementation of Distributed TensorFlow Platform Based onKubernetes [J]. Computer Science, 2018, 45(11A): 527-531.
[4] HU Xing, WANG Ze-rui, LI Shuo, YANG Nan, ZHANG Zhi-fan, WANG Qiao and WANG Qian-xiang. POP:Micro-service Based Online Programming System [J]. Computer Science, 2017, 44(4): 8-11.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!