计算机科学 ›› 2020, Vol. 47 ›› Issue (7): 220-226.doi: 10.11896/jsjkx.200300097

• 计算机网络 • 上一篇    下一篇

一种基于4Bit编码的深度学习梯度压缩算法

蒋文斌, 符智, 彭晶, 祝简   

  1. 华中科技大学计算机科学与技术学院大数据技术与系统国家工程研究中心 武汉430074
  • 收稿日期:2020-03-16 出版日期:2020-07-15 发布日期:2020-07-16
  • 通讯作者: 蒋文斌(wenbinjiang@hust.edu.cn)
  • 基金资助:
    国家自然科学基金(61672250)

4Bit-based Gradient Compression Method for Distributed Deep Learning System

JIANG Wen-bin, FU Zhi, PENG Jing, ZHU Jian   

  1. National Engineering Research Center for Big Data Technology and System,School of Computer Science and Technology,Huazhong University of Science and Technology,Wuhan 430074,China
  • Received:2020-03-16 Online:2020-07-15 Published:2020-07-16
  • About author:JIANG Wen-bin,born in 1975,Ph.D,professor,is a member of China Computer Federation.His main research interests include distributed computing and machine learning.
  • Supported by:
    This work was supported by the National Natural Science Foundation of China (61672250)

摘要: 对梯度数据进行压缩,是一种减少多机间通信开销的有效方法,如MXNet系统中的2Bit方法等。但这类方法存在一个突出的问题,即过高的压缩比会导致精度及收敛速度下降,尤其是对规模较大的深度神经网络模型。针对上述问题,提出了一种新的4Bit梯度压缩策略。该方法采用4个比特位表示一个具体的梯度值(通常为32位的浮点数)。相对于2Bit,该方法能够对梯度值进行更细粒度的近似,从而提高训练结果的准确率和收敛性。进一步地,根据网络模型每一层梯度特性的不同,选择不同的近似阈值,使得压缩后的数值更合理,从而进一步加快模型的收敛速度并提高最终准确率;具体地,兼顾操作的方便性和分布的合理性,根据每层梯度特性的不同,设置3组不同的阈值,以满足不同层梯度差异化特性的需求。实验结果表明,使用多组阈值的4Bit梯度压缩策略虽然在加速方面略逊于2Bit方法,但其准确率更高,实用性更强,能够在保持模型更高精度的前提下减少分布式深度学习系统的通信开销,这对于在资源受限环境下实现性能更好的深度学习模型非常有意义。

关键词: 深度学习, 梯度压缩策略, 分布式训练

Abstract: In order to reduce the communication overhead of distributed deep learning system,compression of gradient data before transmission is an effective method,such as 2Bit method in MXNet.However,there is a problem in this kind of method,that is,too high compression ratio will lead to decline in accuracy and convergence speed,especially for larger network models.To address this problem,a new gradient compression strategy called 4Bit is proposed.Four bits are used to represent a specific gradient value.Compared with 2Bit,this method can approximate the gradient more finely,thus improving the accuracy of training results and convergence speed.Furthermore,different approximation thresholds are selected according to the gradient characteristics of each layer of the network model,which makes the compressed values more reasonable,and finally improves the convergence speed and final accuracy of the model.The experimental results show that,although 4Bit is slightly lower than the 2Bit method in terms of acceleration,its accuracy is higher and practicability is better by using more bits and multiple thresholds.It is very meaningful to reduce the communication overhead of the distributed deep learning system while maintaining high accuracy by 4Bit.

Key words: Deep learning, Gradient compression strategy, Distributed training

中图分类号: 

  • TP183
[1] LECUN Y,BENGIO Y,HINTON G.Deep Learning[J].Na-ture,2015,521(7553):436-444.
[2] YIN B,WANG W,WANG L.Review of Deep Learning[J].Journal of Beijing University of Technology,2015,41(1):48-59.
[3] HINTON G,DENG L,YU D,et al.Deep Neural Networks for Acoustic Modeling in Speech Recognition:The Shared Views of Four Research Groups[J].IEEE Signal Processing Magazine,2012,29(6):82-97.
[4] GRAVES A,MOHAMED A,HINTON G.Speech Recognition with Deep Recurrent Neural Networks[C]//International Conference on Acoustics,Speech and Signal Processing.USA:IEEE,2013:6645-6649.
[5] DAI Y L,HE L,HUANG Z C.Unsupervised image hashing algorithm based on sparse-autoencoder[J].Computer Enginee-ring,2019,45(5):222-225,236.
[6] FARABET C,COUPRIE C,NAJMAN L,et al.Learning Hierarchical Features for Scene Labeling[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2013,35(8):1915-1929.
[7] SUTSKEVER I,VINYALS O,LE Q.Sequence to SequenceLearning with Neural Networks[C]//Advances in Neural Information Processing Systems 27.USA:MIT press,2014:3104-3112.
[8] COLLOBERT R,WESTON J,BOTTOU L,et al.Natural Language Processing (Almost) from Scratch[J].Journal of Machine Learning Research,2011,12(8):2493-2537.
[9] YU K,JIA L,CHEN Y,et al.Deep Learning:Yesterday,To-day,and Tomorrow[J].Journal of Computer Research and Development,2013,50(9):1799-1804.
[10] CHE S,BOYER M,MENG J,et al.A Performance Study ofGeneral-purpose Applications on Graphics Processors Using CUDA[J].Journal of Parallel and Distributed Computing,2008,68(10):1370-1380.
[11] HUILGOL R.2bit Gradient Compression [EB/OL].https://github.com/apache/incubator-mxnet/pull/8662.
[12] DEAN J,CORRADO G,MONGA R,et al.Large Scale Distributed Deep Networks[C]//Advances in Neural Information Processing Systems 25.USA:Curran Associates Inc,2012:1223-1231.
[13] REN Y,WU X,LI Z,et al.iRDMA:Efficient Use of RDMA in Distributed Deep Learning Systems[C]//Proceedings of the 2017 IEEE 19th International Conference on High Performance Computing and Communications.USA:IEEE,2017:231-238.
[14] ZHANG H,ZHENG Z,XU S,et al.Poseidon:An EfficientCommunication Architecture for Distributed Deep Learning on GPU Clusters[C]//Proceedings of the 2017 USENIX Annual Technical Conference.USA:USENIX Association,2017:181-193.
[15] WEN W,XU C,YAN F,et al.TernGrad:Ternary Gradients to Reduce Communication in Distributed Deep Learning[C]//Advances in Neural Information Processing Systems 30.USA:Curran Associates Inc,2017:1508-1518.
[16] IOFFE S,SZEGEDY C.Batch Normalization:Accelerating Deep Network Training by Reducing Internal Covariate Shift[J].ArXiv:1502.03167,2015.
[17] KRIZHEVSKY A,HINTON G.Learning Multiple Layers ofFeatures from Tiny Images[R].Toronto:University of Toronto,2009.
[18] ZHAO L,WANG J,LI X,et al.On the Connection of Deep Fusion to Ensembling[J].ArXiv:1611.07718,2016.
[19] RUSSAKOVSKY O,DENG J,SU H,et al.ImageNet LargeScale Visual Recognition Challenge[J].International Journal of Computer Vision,2015,115(3):211-252.
[1] 王瑞平, 贾真, 刘畅, 陈泽威, 李天瑞. 基于DeepFM的深度兴趣因子分解机网络[J]. 计算机科学, 2021, 48(1): 226-232.
[2] 于文家, 丁世飞. 基于自注意力机制的条件生成对抗网络[J]. 计算机科学, 2021, 48(1): 241-246.
[3] 仝鑫, 王斌君, 王润正, 潘孝勤. 面向自然语言处理的深度学习对抗样本综述[J]. 计算机科学, 2021, 48(1): 258-267.
[4] 丁钰, 魏浩, 潘志松, 刘鑫. 网络表示学习算法综述[J]. 计算机科学, 2020, 47(9): 52-59.
[5] 何鑫, 许娟, 金莹莹. 行为关联网络:完整的变化行为建模[J]. 计算机科学, 2020, 47(9): 123-128.
[6] 叶亚男, 迟静, 于志平, 战玉丽, 张彩明. 基于改进CycleGan模型和区域分割的表情动画合成[J]. 计算机科学, 2020, 47(9): 142-149.
[7] 邓良, 许庚林, 李梦杰, 陈章进. 基于深度学习与多哈希相似度加权实现快速人脸识别[J]. 计算机科学, 2020, 47(9): 163-168.
[8] 暴雨轩, 芦天亮, 杜彦辉. 深度伪造视频检测技术综述[J]. 计算机科学, 2020, 47(9): 283-292.
[9] 袁野, 和晓歌, 朱定坤, 王富利, 谢浩然, 汪俊, 魏明强, 郭延文. 视觉图像显著性检测综述[J]. 计算机科学, 2020, 47(7): 84-91.
[10] 王文刀, 王润泽, 魏鑫磊, 漆云亮, 马义德. 基于堆叠式双向LSTM的心电图自动识别算法[J]. 计算机科学, 2020, 47(7): 118-124.
[11] 刘燕, 温静. 基于注意力机制的复杂场景文本检测[J]. 计算机科学, 2020, 47(7): 135-140.
[12] 张志扬, 张凤荔, 谭琪, 王瑞锦. 基于深度学习的信息级联预测方法综述[J]. 计算机科学, 2020, 47(7): 141-153.
[13] 陈晋音, 张敦杰, 林翔, 徐晓东, 朱子凌. 基于影响力最大化策略的抑制虚假消息传播的方法[J]. 计算机科学, 2020, 47(6A): 17-23.
[14] 程哲, 白茜, 张浩, 王世普, 梁宇. 使用深层卷积神经网络提高Hi-C 数据分辨率[J]. 计算机科学, 2020, 47(6A): 70-74.
[15] 赫磊, 邵展鹏, 张剑华, 周小龙. 基于深度学习的行为识别算法综述[J]. 计算机科学, 2020, 47(6A): 139-147.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 雷丽晖,王静. 可能性测度下的LTL模型检测并行化研究[J]. 计算机科学, 2018, 45(4): 71 -75 .
[2] 孙启,金燕,何琨,徐凌轩. 用于求解混合车辆路径问题的混合进化算法[J]. 计算机科学, 2018, 45(4): 76 -82 .
[3] 张佳男,肖鸣宇. 带权混合支配问题的近似算法研究[J]. 计算机科学, 2018, 45(4): 83 -88 .
[4] 伍建辉,黄中祥,李武,吴健辉,彭鑫,张生. 城市道路建设时序决策的鲁棒优化[J]. 计算机科学, 2018, 45(4): 89 -93 .
[5] 史雯隽,武继刚,罗裕春. 针对移动云计算任务迁移的快速高效调度算法[J]. 计算机科学, 2018, 45(4): 94 -99 .
[6] 周燕萍,业巧林. 基于L1-范数距离的最小二乘对支持向量机[J]. 计算机科学, 2018, 45(4): 100 -105 .
[7] 刘博艺,唐湘滟,程杰仁. 基于多生长时期模板匹配的玉米螟识别方法[J]. 计算机科学, 2018, 45(4): 106 -111 .
[8] 耿海军,施新刚,王之梁,尹霞,尹少平. 基于有向无环图的互联网域内节能路由算法[J]. 计算机科学, 2018, 45(4): 112 -116 .
[9] 崔琼,李建华,王宏,南明莉. 基于节点修复的网络化指挥信息系统弹性分析模型[J]. 计算机科学, 2018, 45(4): 117 -121 .
[10] 王振朝,侯欢欢,连蕊. 抑制CMT中乱序程度的路径优化方案[J]. 计算机科学, 2018, 45(4): 122 -125 .