计算机科学 ›› 2020, Vol. 47 ›› Issue (7): 220-226.doi: 10.11896/jsjkx.200300097

所属专题: 网络通信

• 计算机网络 • 上一篇    下一篇

一种基于4Bit编码的深度学习梯度压缩算法

蒋文斌, 符智, 彭晶, 祝简   

  1. 华中科技大学计算机科学与技术学院大数据技术与系统国家工程研究中心 武汉430074
  • 收稿日期:2020-03-16 出版日期:2020-07-15 发布日期:2020-07-16
  • 通讯作者: 蒋文斌(wenbinjiang@hust.edu.cn)
  • 基金资助:
    国家自然科学基金(61672250)

4Bit-based Gradient Compression Method for Distributed Deep Learning System

JIANG Wen-bin, FU Zhi, PENG Jing, ZHU Jian   

  1. National Engineering Research Center for Big Data Technology and System,School of Computer Science and Technology,Huazhong University of Science and Technology,Wuhan 430074,China
  • Received:2020-03-16 Online:2020-07-15 Published:2020-07-16
  • About author:JIANG Wen-bin,born in 1975,Ph.D,professor,is a member of China Computer Federation.His main research interests include distributed computing and machine learning.
  • Supported by:
    This work was supported by the National Natural Science Foundation of China (61672250)

摘要: 对梯度数据进行压缩,是一种减少多机间通信开销的有效方法,如MXNet系统中的2Bit方法等。但这类方法存在一个突出的问题,即过高的压缩比会导致精度及收敛速度下降,尤其是对规模较大的深度神经网络模型。针对上述问题,提出了一种新的4Bit梯度压缩策略。该方法采用4个比特位表示一个具体的梯度值(通常为32位的浮点数)。相对于2Bit,该方法能够对梯度值进行更细粒度的近似,从而提高训练结果的准确率和收敛性。进一步地,根据网络模型每一层梯度特性的不同,选择不同的近似阈值,使得压缩后的数值更合理,从而进一步加快模型的收敛速度并提高最终准确率;具体地,兼顾操作的方便性和分布的合理性,根据每层梯度特性的不同,设置3组不同的阈值,以满足不同层梯度差异化特性的需求。实验结果表明,使用多组阈值的4Bit梯度压缩策略虽然在加速方面略逊于2Bit方法,但其准确率更高,实用性更强,能够在保持模型更高精度的前提下减少分布式深度学习系统的通信开销,这对于在资源受限环境下实现性能更好的深度学习模型非常有意义。

关键词: 分布式训练, 深度学习, 梯度压缩策略

Abstract: In order to reduce the communication overhead of distributed deep learning system,compression of gradient data before transmission is an effective method,such as 2Bit method in MXNet.However,there is a problem in this kind of method,that is,too high compression ratio will lead to decline in accuracy and convergence speed,especially for larger network models.To address this problem,a new gradient compression strategy called 4Bit is proposed.Four bits are used to represent a specific gradient value.Compared with 2Bit,this method can approximate the gradient more finely,thus improving the accuracy of training results and convergence speed.Furthermore,different approximation thresholds are selected according to the gradient characteristics of each layer of the network model,which makes the compressed values more reasonable,and finally improves the convergence speed and final accuracy of the model.The experimental results show that,although 4Bit is slightly lower than the 2Bit method in terms of acceleration,its accuracy is higher and practicability is better by using more bits and multiple thresholds.It is very meaningful to reduce the communication overhead of the distributed deep learning system while maintaining high accuracy by 4Bit.

Key words: Deep learning, Distributed training, Gradient compression strategy

中图分类号: 

  • TP183
[1]LECUN Y,BENGIO Y,HINTON G.Deep Learning[J].Na-ture,2015,521(7553):436-444.
[2]YIN B,WANG W,WANG L.Review of Deep Learning[J].Journal of Beijing University of Technology,2015,41(1):48-59.
[3]HINTON G,DENG L,YU D,et al.Deep Neural Networks for Acoustic Modeling in Speech Recognition:The Shared Views of Four Research Groups[J].IEEE Signal Processing Magazine,2012,29(6):82-97.
[4]GRAVES A,MOHAMED A,HINTON G.Speech Recognition with Deep Recurrent Neural Networks[C]//International Conference on Acoustics,Speech and Signal Processing.USA:IEEE,2013:6645-6649.
[5]DAI Y L,HE L,HUANG Z C.Unsupervised image hashing algorithm based on sparse-autoencoder[J].Computer Enginee-ring,2019,45(5):222-225,236.
[6]FARABET C,COUPRIE C,NAJMAN L,et al.Learning Hierarchical Features for Scene Labeling[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2013,35(8):1915-1929.
[7]SUTSKEVER I,VINYALS O,LE Q.Sequence to SequenceLearning with Neural Networks[C]//Advances in Neural Information Processing Systems 27.USA:MIT press,2014:3104-3112.
[8]COLLOBERT R,WESTON J,BOTTOU L,et al.Natural Language Processing (Almost) from Scratch[J].Journal of Machine Learning Research,2011,12(8):2493-2537.
[9]YU K,JIA L,CHEN Y,et al.Deep Learning:Yesterday,To-day,and Tomorrow[J].Journal of Computer Research and Development,2013,50(9):1799-1804.
[10]CHE S,BOYER M,MENG J,et al.A Performance Study ofGeneral-purpose Applications on Graphics Processors Using CUDA[J].Journal of Parallel and Distributed Computing,2008,68(10):1370-1380.
[11]HUILGOL R.2bit Gradient Compression [EB/OL].https://github.com/apache/incubator-mxnet/pull/8662.
[12]DEAN J,CORRADO G,MONGA R,et al.Large Scale Distributed Deep Networks[C]//Advances in Neural Information Processing Systems 25.USA:Curran Associates Inc,2012:1223-1231.
[13]REN Y,WU X,LI Z,et al.iRDMA:Efficient Use of RDMA in Distributed Deep Learning Systems[C]//Proceedings of the 2017 IEEE 19th International Conference on High Performance Computing and Communications.USA:IEEE,2017:231-238.
[14]ZHANG H,ZHENG Z,XU S,et al.Poseidon:An EfficientCommunication Architecture for Distributed Deep Learning on GPU Clusters[C]//Proceedings of the 2017 USENIX Annual Technical Conference.USA:USENIX Association,2017:181-193.
[15]WEN W,XU C,YAN F,et al.TernGrad:Ternary Gradients to Reduce Communication in Distributed Deep Learning[C]//Advances in Neural Information Processing Systems 30.USA:Curran Associates Inc,2017:1508-1518.
[16]IOFFE S,SZEGEDY C.Batch Normalization:Accelerating Deep Network Training by Reducing Internal Covariate Shift[J].ArXiv:1502.03167,2015.
[17]KRIZHEVSKY A,HINTON G.Learning Multiple Layers ofFeatures from Tiny Images[R].Toronto:University of Toronto,2009.
[18]ZHAO L,WANG J,LI X,et al.On the Connection of Deep Fusion to Ensembling[J].ArXiv:1611.07718,2016.
[19]RUSSAKOVSKY O,DENG J,SU H,et al.ImageNet LargeScale Visual Recognition Challenge[J].International Journal of Computer Vision,2015,115(3):211-252.
[1] 饶志双, 贾真, 张凡, 李天瑞.
基于Key-Value关联记忆网络的知识图谱问答方法
Key-Value Relational Memory Networks for Question Answering over Knowledge Graph
计算机科学, 2022, 49(9): 202-207. https://doi.org/10.11896/jsjkx.220300277
[2] 汤凌韬, 王迪, 张鲁飞, 刘盛云.
基于安全多方计算和差分隐私的联邦学习方案
Federated Learning Scheme Based on Secure Multi-party Computation and Differential Privacy
计算机科学, 2022, 49(9): 297-305. https://doi.org/10.11896/jsjkx.210800108
[3] 徐涌鑫, 赵俊峰, 王亚沙, 谢冰, 杨恺.
时序知识图谱表示学习
Temporal Knowledge Graph Representation Learning
计算机科学, 2022, 49(9): 162-171. https://doi.org/10.11896/jsjkx.220500204
[4] 王剑, 彭雨琦, 赵宇斐, 杨健.
基于深度学习的社交网络舆情信息抽取方法综述
Survey of Social Network Public Opinion Information Extraction Based on Deep Learning
计算机科学, 2022, 49(8): 279-293. https://doi.org/10.11896/jsjkx.220300099
[5] 郝志荣, 陈龙, 黄嘉成.
面向文本分类的类别区分式通用对抗攻击方法
Class Discriminative Universal Adversarial Attack for Text Classification
计算机科学, 2022, 49(8): 323-329. https://doi.org/10.11896/jsjkx.220200077
[6] 姜梦函, 李邵梅, 郑洪浩, 张建朋.
基于改进位置编码的谣言检测模型
Rumor Detection Model Based on Improved Position Embedding
计算机科学, 2022, 49(8): 330-335. https://doi.org/10.11896/jsjkx.210600046
[7] 孙奇, 吉根林, 张杰.
基于非局部注意力生成对抗网络的视频异常事件检测方法
Non-local Attention Based Generative Adversarial Network for Video Abnormal Event Detection
计算机科学, 2022, 49(8): 172-177. https://doi.org/10.11896/jsjkx.210600061
[8] 胡艳羽, 赵龙, 董祥军.
一种用于癌症分类的两阶段深度特征选择提取算法
Two-stage Deep Feature Selection Extraction Algorithm for Cancer Classification
计算机科学, 2022, 49(7): 73-78. https://doi.org/10.11896/jsjkx.210500092
[9] 程成, 降爱莲.
基于多路径特征提取的实时语义分割方法
Real-time Semantic Segmentation Method Based on Multi-path Feature Extraction
计算机科学, 2022, 49(7): 120-126. https://doi.org/10.11896/jsjkx.210500157
[10] 侯钰涛, 阿布都克力木·阿布力孜, 哈里旦木·阿布都克里木.
中文预训练模型研究进展
Advances in Chinese Pre-training Models
计算机科学, 2022, 49(7): 148-163. https://doi.org/10.11896/jsjkx.211200018
[11] 周慧, 施皓晨, 屠要峰, 黄圣君.
基于主动采样的深度鲁棒神经网络学习
Robust Deep Neural Network Learning Based on Active Sampling
计算机科学, 2022, 49(7): 164-169. https://doi.org/10.11896/jsjkx.210600044
[12] 苏丹宁, 曹桂涛, 王燕楠, 王宏, 任赫.
小样本雷达辐射源识别的深度学习方法综述
Survey of Deep Learning for Radar Emitter Identification Based on Small Sample
计算机科学, 2022, 49(7): 226-235. https://doi.org/10.11896/jsjkx.210600138
[13] 刘伟业, 鲁慧民, 李玉鹏, 马宁.
指静脉识别技术研究综述
Survey on Finger Vein Recognition Research
计算机科学, 2022, 49(6A): 1-11. https://doi.org/10.11896/jsjkx.210400056
[14] 孙福权, 崔志清, 邹彭, 张琨.
基于多尺度特征的脑肿瘤分割算法
Brain Tumor Segmentation Algorithm Based on Multi-scale Features
计算机科学, 2022, 49(6A): 12-16. https://doi.org/10.11896/jsjkx.210700217
[15] 康雁, 徐玉龙, 寇勇奇, 谢思宇, 杨学昆, 李浩.
基于Transformer和LSTM的药物相互作用预测
Drug-Drug Interaction Prediction Based on Transformer and LSTM
计算机科学, 2022, 49(6A): 17-21. https://doi.org/10.11896/jsjkx.210400150
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!