计算机科学 ›› 2018, Vol. 45 ›› Issue (11A): 527-531.

• 综合、交叉与应用 • 上一篇    下一篇

基于Kubernetes的分布式TensorFlow平台的设计与实现

余昌发, 程学林, 杨小虎   

  1. 浙江大学软件学院 杭州310027
  • 出版日期:2019-02-26 发布日期:2019-02-26
  • 作者简介:余昌发(1992-),男,硕士,主要研究方向为容器云与深度学习,E-mail:yu_changfa@126.com;程学林(1966-),男,硕士,高级工程师,主要研究方向为数据可视化分析,E-mail:cxlin@zju.edu.cn;杨小虎(1976-),男,博士,研究员,主要研究方向为软件工程、云计算、金融信息技术,E-mail:yangxh@zju.edu.cn。
  • 基金资助:
    本文受中央高校基本科研业务费专项资金,国家科技支撑计划:公共文化科技服务能力建设与绩效评估技术研究与示范(2015BAK26B00)资助。

Design and Implementation of Distributed TensorFlow Platform Based onKubernetes

YU Chang-fa, CHEN Xue-lin, YANG Xiao-hu   

  1. School of Software Technology,Zhejiang University,Hangzhou 310027,China
  • Online:2019-02-26 Published:2019-02-26

摘要: 文中介绍了基于Kubernetes的分布式TensorFlow平台的设计与实现,针对分布式TensorFlow存在的环境配置复杂、底层物理资源分布不均、训练效率过低、模型研发周期长等问题,提出了一种容器化TensorFlow的方法,并基于Kubernetes容器PaaS平台来统一调度管理TensorFlow容器。文中将Kubernetes和TensorFlow的优点相结合,由Kubernetes提供可靠、稳定的计算环境,以充分发挥TensorFlow异构的优势,极大地降低了大规模使用的难度,同时建立了一个敏捷的管理平台,实现了分布式TensorFlow资源的快速分配、一键部署、秒级启动、动态伸缩、高效训练等。

关键词: Docker, Kubernetes, TensorFlow, 深度学习

Abstract: This paper designed and implemented a distributed deep learning platform based on Kubernetes.In order to solve the propblems of complex environment configuration of distributed TensorFlow,uneven distribution of underlying physical resources,low efficiency of training model and long development cycle,a method of containerized TensorFlow based on Kubernetes was proposed.By combining the advantages of Kubernetes and TensorFlow,Kubernetes provides a stable and reliable computing environment and gives full play to the advantages of heterogeneous TensorFlow,which greatly reduces the difficulty in large-scale use.Meanwhile,an agile management platform is established,which realizes the fast distribution of distributed TensorFlow resources,one key deployment,second level running,dynamic expansion,efficient training and so on.

Key words: Deep learning, Docker, Kubernetes, TensorFlow

中图分类号: 

  • TP311
[1]ABADI M,AGARWAL A,BARHAM P,et al.TensorFlow:Large-Scale Machine Learning on Heterogeneous Distributed System[J].arXiv:1603.04467v2,2016.
[2]龚正,吴治辉,王伟,等.Kubernetes权威指南:从Docker到Kubernetes实践全接触(纪念版)[M].北京:电子工业版社,2017:1-42.
[3]浙江大学SEL实验室.Docker容器与容器云[M].北京:人民邮电出版社,2016:1-27.
[4]李航.统计学习方法 [M].北京:清华大学出版社,2012:1-24.
[5]李嘉璇.TensorFlow技术解析与实战[M].北京:人民邮电出版社,2017:218-224.
[6]PEINL R,HOLZSCHUHER A F,PFITZER F.Docker Cluster Management for the Cloud-Survey Results and Own Solution[J].Grid Computing,2016,14:265-282.
[7]Serving a TensorFlow Model[EB/OL].https://www.tensorflow.org/serving/serving_basic.
[8]go-restful[EB/OL].https://github.com/emicklei/go-restful.
[9]CHANG F,DEAN J,GHEMAWAT S,et al.Gruber.Bigtable:A Distributed Storage System for Structured Data[J].ACM Transactions on Computer Systems (TOCS),2008,26(2):1-26.
[10]朱林.Elasticsearch技术解析与实战[M].北京:机械工业出版社,2017:6-10.
[11]https://github.com/kubernetes/examples/blob/master/staging/volumes/glusterfs/README.md.
[12]https://github.com/heketi/heketi.
[13]https://en.wikipedia.org/wiki/Network_File_System.
[14]SEYMOUR K,NAKADA H,MATSUOKA S,et al.Overview of GridRPC:A Remote Procedure Call API for Grid Computing[J].Grid Computing,2002,2536:274-278.
[15]http://yann.lecun.com/exdb/mnist.
[1] 饶志双, 贾真, 张凡, 李天瑞.
基于Key-Value关联记忆网络的知识图谱问答方法
Key-Value Relational Memory Networks for Question Answering over Knowledge Graph
计算机科学, 2022, 49(9): 202-207. https://doi.org/10.11896/jsjkx.220300277
[2] 汤凌韬, 王迪, 张鲁飞, 刘盛云.
基于安全多方计算和差分隐私的联邦学习方案
Federated Learning Scheme Based on Secure Multi-party Computation and Differential Privacy
计算机科学, 2022, 49(9): 297-305. https://doi.org/10.11896/jsjkx.210800108
[3] 徐涌鑫, 赵俊峰, 王亚沙, 谢冰, 杨恺.
时序知识图谱表示学习
Temporal Knowledge Graph Representation Learning
计算机科学, 2022, 49(9): 162-171. https://doi.org/10.11896/jsjkx.220500204
[4] 王剑, 彭雨琦, 赵宇斐, 杨健.
基于深度学习的社交网络舆情信息抽取方法综述
Survey of Social Network Public Opinion Information Extraction Based on Deep Learning
计算机科学, 2022, 49(8): 279-293. https://doi.org/10.11896/jsjkx.220300099
[5] 郝志荣, 陈龙, 黄嘉成.
面向文本分类的类别区分式通用对抗攻击方法
Class Discriminative Universal Adversarial Attack for Text Classification
计算机科学, 2022, 49(8): 323-329. https://doi.org/10.11896/jsjkx.220200077
[6] 姜梦函, 李邵梅, 郑洪浩, 张建朋.
基于改进位置编码的谣言检测模型
Rumor Detection Model Based on Improved Position Embedding
计算机科学, 2022, 49(8): 330-335. https://doi.org/10.11896/jsjkx.210600046
[7] 孙奇, 吉根林, 张杰.
基于非局部注意力生成对抗网络的视频异常事件检测方法
Non-local Attention Based Generative Adversarial Network for Video Abnormal Event Detection
计算机科学, 2022, 49(8): 172-177. https://doi.org/10.11896/jsjkx.210600061
[8] 侯钰涛, 阿布都克力木·阿布力孜, 哈里旦木·阿布都克里木.
中文预训练模型研究进展
Advances in Chinese Pre-training Models
计算机科学, 2022, 49(7): 148-163. https://doi.org/10.11896/jsjkx.211200018
[9] 周慧, 施皓晨, 屠要峰, 黄圣君.
基于主动采样的深度鲁棒神经网络学习
Robust Deep Neural Network Learning Based on Active Sampling
计算机科学, 2022, 49(7): 164-169. https://doi.org/10.11896/jsjkx.210600044
[10] 苏丹宁, 曹桂涛, 王燕楠, 王宏, 任赫.
小样本雷达辐射源识别的深度学习方法综述
Survey of Deep Learning for Radar Emitter Identification Based on Small Sample
计算机科学, 2022, 49(7): 226-235. https://doi.org/10.11896/jsjkx.210600138
[11] 胡艳羽, 赵龙, 董祥军.
一种用于癌症分类的两阶段深度特征选择提取算法
Two-stage Deep Feature Selection Extraction Algorithm for Cancer Classification
计算机科学, 2022, 49(7): 73-78. https://doi.org/10.11896/jsjkx.210500092
[12] 程成, 降爱莲.
基于多路径特征提取的实时语义分割方法
Real-time Semantic Segmentation Method Based on Multi-path Feature Extraction
计算机科学, 2022, 49(7): 120-126. https://doi.org/10.11896/jsjkx.210500157
[13] 王君锋, 刘凡, 杨赛, 吕坦悦, 陈峙宇, 许峰.
基于多源迁移学习的大坝裂缝检测
Dam Crack Detection Based on Multi-source Transfer Learning
计算机科学, 2022, 49(6A): 319-324. https://doi.org/10.11896/jsjkx.210500124
[14] 楚玉春, 龚航, 王学芳, 刘培顺.
基于YOLOv4的目标检测知识蒸馏算法研究
Study on Knowledge Distillation of Target Detection Algorithm Based on YOLOv4
计算机科学, 2022, 49(6A): 337-344. https://doi.org/10.11896/jsjkx.210600204
[15] 周志豪, 陈磊, 伍翔, 丘东亮, 梁广升, 曾凡巧.
基于SMOTE-SDSAE-SVM的车载CAN总线入侵检测算法
SMOTE-SDSAE-SVM Based Vehicle CAN Bus Intrusion Detection Algorithm
计算机科学, 2022, 49(6A): 562-570. https://doi.org/10.11896/jsjkx.210700106
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!