计算机科学 ›› 2023, Vol. 50 ›› Issue (6): 86-91.doi: 10.11896/jsjkx.220900110
王壮1, 王平辉1, 王彬丞1, 武文博1, 王斌2, 丛鹏宇2
WANG Zhuang1, WANG Pinghui1, WANG Bincheng1, WU Wenbo1, WANG Bin2, CONG Pengyu2
摘要: 近年来,容器由于具有轻量级以及高可扩展性,逐渐替代了虚拟机,被广泛应用于深度学习云平台中。但目前深度学习云平台在GPU资源管理上依然存在着不足,主要表现为由于容器编排技术的限制,多个容器无法共享使用GPU资源,而对于一些小规模模型的训练任务和推理任务,单个任务并不能充分利用整张GPU卡的计算资源。当前的独占模式会导致昂贵的GPU资源的浪费,降低资源效率和服务可用性。针对这一问题,提出了一种GPU共享调度系统。一方面,基于Kubernetes的Operator机制对现有集群功能进行扩展,实现了多个Pod共享使用GPU资源,同时设计了一种代理机制保证了与原生Kubernetes的兼容性。另一方面,基于GPU时间片与抢占机制,实现了GPU资源的动态管理与调度,在多个任务之间进行细粒度的协调,并减少了任务干扰。实验结果表明,与原生Kubernetes调度系统相比,该系统能够将一组深度学习训练任务的完成时间平均减少约20%,使得集群GPU资源利用率平均提升约10%。在共享使用GPU时高优先级任务性能相较于独占GPU损耗不到5%,同时能够使得低优先级任务以20%的性能运行在同一张GPU上。
中图分类号:
[1]JOSHI A V.Amazon's machine learning toolkit:Sagemaker[M]//Machine Learning and Artificial Intelligence.Cham:Springer,2020:233-243. [2]BISONG E.Google colaboratory[M]//Building Machine Lear-ning and Deep Learning Models on Google Cloud Platform.Berkeley:Apress,2019:59-64. [3]MERKEL D.Docker:lightweight linux containers for consistent development and deployment[J].Linux Journal,2014,239:76-90. [4]SAYFAN G.Mastering Kubernetes[M].Birmingham:PacktPublishing,2017. [5]VAUCHER S,PIRES R,FELBER P,et al.SGX-Aware Container Orchestration for Heterogeneous Clusters[C]//2018 IEEE 38th International Conference on Distributed Computing Systems(ICDCS).Austria:Vienna,2018:730-741. [6]NVIDIA Corporation.k8s-device-plugin[EB/OL].https://github.com-/NVIDIA/k8s-device-plugin/blob/master-/READ-ME.md. [7]HE K M,ZHANG X Y,REN S Q,et al.Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:770-778. [8]KANG D,JUN T J,KIM D,et al.ConVGPU:GPU Management Middleware in Container Based Virtualized Environment[C]//2017 IEEE International Conference on Cluster Computing(CLUSTER).IEEE,2017:301-309. [9]Alibaba Cloud.GPU Sharing Scheduler Extender in Kubernetes[EB/OL].https://github.com/AliyunContainerService/gpushare-scheduler-extender. [10]Intel.Intel device plugin for kubernetes[EB/OL].https://github.com/intel-/intel-device-plugins-for-kubernetes. [11]GU J,SONG S,LI Y,et al.GaiaGPU:Sharing GPUs in Contai-ner Clouds[C]//IEEE International Conference on Parallel Distributed Processing with Applications,Ubiquitous Computing Communications,Big Data Cloud Computing,Social Computing Networking,Sustainable Computing Communications.2018:469-476. [12]YEH T,CHEN H,CHOU J.Kubeshare:A framework to ma-nage gpus as first-class and shared resources in container cloud[C]// The 29th International Symposium on High-Performance Parallel and Distributed Computing.Sweden:Stockholm,2020:173-184. [13]GREGORY M K,VANESSA S,MICHAEL W B.Singularity:Scientific containers for mobility of compute[J].PLOS ONE,2017,12(5):1-20. [14]Kubernetes.Kubernetes Scheduler [EB/OL].https://kuber-netes.io-/docs/concepts/scheduling-eviction/kube-scheduler. [15]PASZKE A,GROSS S,MASSA F,et al.PyTorch:An Imperative Style,High-Performance Deep Learning Library[C]//Advances in Neural Information Processing Systems 32.2019:8024-8035. [16]NVIDIA Corporation.NVIDIA Management Library(NVML) [EB/OL].https://developer.nvidia.com/nvidia-management-library-nvml. |
|