Computer Science ›› 2020, Vol. 47 ›› Issue (7): 220-226.doi: 10.11896/jsjkx.200300097

• Computer Network • Previous Articles     Next Articles

4Bit-based Gradient Compression Method for Distributed Deep Learning System

JIANG Wen-bin, FU Zhi, PENG Jing, ZHU Jian   

  1. National Engineering Research Center for Big Data Technology and System,School of Computer Science and Technology,Huazhong University of Science and Technology,Wuhan 430074,China
  • Received:2020-03-16 Online:2020-07-15 Published:2020-07-16
  • About author:JIANG Wen-bin,born in 1975,Ph.D,professor,is a member of China Computer Federation.His main research interests include distributed computing and machine learning.
  • Supported by:
    This work was supported by the National Natural Science Foundation of China (61672250)

Abstract: In order to reduce the communication overhead of distributed deep learning system,compression of gradient data before transmission is an effective method,such as 2Bit method in MXNet.However,there is a problem in this kind of method,that is,too high compression ratio will lead to decline in accuracy and convergence speed,especially for larger network models.To address this problem,a new gradient compression strategy called 4Bit is proposed.Four bits are used to represent a specific gradient value.Compared with 2Bit,this method can approximate the gradient more finely,thus improving the accuracy of training results and convergence speed.Furthermore,different approximation thresholds are selected according to the gradient characteristics of each layer of the network model,which makes the compressed values more reasonable,and finally improves the convergence speed and final accuracy of the model.The experimental results show that,although 4Bit is slightly lower than the 2Bit method in terms of acceleration,its accuracy is higher and practicability is better by using more bits and multiple thresholds.It is very meaningful to reduce the communication overhead of the distributed deep learning system while maintaining high accuracy by 4Bit.

Key words: Deep learning, Gradient compression strategy, Distributed training

CLC Number: 

  • TP183
[1] LECUN Y,BENGIO Y,HINTON G.Deep Learning[J].Na-ture,2015,521(7553):436-444.
[2] YIN B,WANG W,WANG L.Review of Deep Learning[J].Journal of Beijing University of Technology,2015,41(1):48-59.
[3] HINTON G,DENG L,YU D,et al.Deep Neural Networks for Acoustic Modeling in Speech Recognition:The Shared Views of Four Research Groups[J].IEEE Signal Processing Magazine,2012,29(6):82-97.
[4] GRAVES A,MOHAMED A,HINTON G.Speech Recognition with Deep Recurrent Neural Networks[C]//International Conference on Acoustics,Speech and Signal Processing.USA:IEEE,2013:6645-6649.
[5] DAI Y L,HE L,HUANG Z C.Unsupervised image hashing algorithm based on sparse-autoencoder[J].Computer Enginee-ring,2019,45(5):222-225,236.
[6] FARABET C,COUPRIE C,NAJMAN L,et al.Learning Hierarchical Features for Scene Labeling[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2013,35(8):1915-1929.
[7] SUTSKEVER I,VINYALS O,LE Q.Sequence to SequenceLearning with Neural Networks[C]//Advances in Neural Information Processing Systems 27.USA:MIT press,2014:3104-3112.
[8] COLLOBERT R,WESTON J,BOTTOU L,et al.Natural Language Processing (Almost) from Scratch[J].Journal of Machine Learning Research,2011,12(8):2493-2537.
[9] YU K,JIA L,CHEN Y,et al.Deep Learning:Yesterday,To-day,and Tomorrow[J].Journal of Computer Research and Development,2013,50(9):1799-1804.
[10] CHE S,BOYER M,MENG J,et al.A Performance Study ofGeneral-purpose Applications on Graphics Processors Using CUDA[J].Journal of Parallel and Distributed Computing,2008,68(10):1370-1380.
[11] HUILGOL R.2bit Gradient Compression [EB/OL].
[12] DEAN J,CORRADO G,MONGA R,et al.Large Scale Distributed Deep Networks[C]//Advances in Neural Information Processing Systems 25.USA:Curran Associates Inc,2012:1223-1231.
[13] REN Y,WU X,LI Z,et al.iRDMA:Efficient Use of RDMA in Distributed Deep Learning Systems[C]//Proceedings of the 2017 IEEE 19th International Conference on High Performance Computing and Communications.USA:IEEE,2017:231-238.
[14] ZHANG H,ZHENG Z,XU S,et al.Poseidon:An EfficientCommunication Architecture for Distributed Deep Learning on GPU Clusters[C]//Proceedings of the 2017 USENIX Annual Technical Conference.USA:USENIX Association,2017:181-193.
[15] WEN W,XU C,YAN F,et al.TernGrad:Ternary Gradients to Reduce Communication in Distributed Deep Learning[C]//Advances in Neural Information Processing Systems 30.USA:Curran Associates Inc,2017:1508-1518.
[16] IOFFE S,SZEGEDY C.Batch Normalization:Accelerating Deep Network Training by Reducing Internal Covariate Shift[J].ArXiv:1502.03167,2015.
[17] KRIZHEVSKY A,HINTON G.Learning Multiple Layers ofFeatures from Tiny Images[R].Toronto:University of Toronto,2009.
[18] ZHAO L,WANG J,LI X,et al.On the Connection of Deep Fusion to Ensembling[J].ArXiv:1611.07718,2016.
[19] RUSSAKOVSKY O,DENG J,SU H,et al.ImageNet LargeScale Visual Recognition Challenge[J].International Journal of Computer Vision,2015,115(3):211-252.
[1] WANG Rui-ping, JIA Zhen, LIU Chang, CHEN Ze-wei, LI Tian-rui. Deep Interest Factorization Machine Network Based on DeepFM [J]. Computer Science, 2021, 48(1): 226-232.
[2] YU Wen-jia, DING Shi-fei. Conditional Generative Adversarial Network Based on Self-attention Mechanism [J]. Computer Science, 2021, 48(1): 241-246.
[3] TONG Xin, WANG Bin-jun, WANG Run-zheng, PAN Xiao-qin. Survey on Adversarial Sample of Deep Learning Towards Natural Language Processing [J]. Computer Science, 2021, 48(1): 258-267.
[4] DING Yu, WEI Hao, PAN Zhi-song, LIU Xin. Survey of Network Representation Learning [J]. Computer Science, 2020, 47(9): 52-59.
[5] HE Xin, XU Juan, JIN Ying-ying. Action-related Network:Towards Modeling Complete Changeable Action [J]. Computer Science, 2020, 47(9): 123-128.
[6] YE Ya-nan, CHI Jing, YU Zhi-ping, ZHAN Yu-liand ZHANG Cai-ming. Expression Animation Synthesis Based on Improved CycleGan Model and Region Segmentation [J]. Computer Science, 2020, 47(9): 142-149.
[7] DENG Liang, XU Geng-lin, LI Meng-jie, CHEN Zhang-jin. Fast Face Recognition Based on Deep Learning and Multiple Hash Similarity Weighting [J]. Computer Science, 2020, 47(9): 163-168.
[8] BAO Yu-xuan, LU Tian-liang, DU Yan-hui. Overview of Deepfake Video Detection Technology [J]. Computer Science, 2020, 47(9): 283-292.
[9] YUAN Ye, HE Xiao-ge, ZHU Ding-kun, WANG Fu-lee, XIE Hao-ran, WANG Jun, WEI Ming-qiang, GUO Yan-wen. Survey of Visual Image Saliency Detection [J]. Computer Science, 2020, 47(7): 84-91.
[10] WANG Wen-dao, WANG Run-ze, WEI Xin-lei, QI Yun-liang, MA Yi-de. Automatic Recognition of ECG Based on Stacked Bidirectional LSTM [J]. Computer Science, 2020, 47(7): 118-124.
[11] LIU Yan, WEN Jing. Complex Scene Text Detection Based on Attention Mechanism [J]. Computer Science, 2020, 47(7): 135-140.
[12] ZHANG Zhi-yang, ZHANG Feng-li, TAN Qi, WANG Rui-jin. Review of Information Cascade Prediction Methods Based on Deep Learning [J]. Computer Science, 2020, 47(7): 141-153.
[13] CHEN Jin-yin, ZHANG Dun-Jie, LIN Xiang, XU Xiao-dong and ZHU Zi-ling. False Message Propagation Suppression Based on Influence Maximization [J]. Computer Science, 2020, 47(6A): 17-23.
[14] CHENG Zhe, BAI Qian, ZHANG Hao, WANG Shi-pu and LIANG Yu. Improving Hi-C Data Resolution with Deep Convolutional Neural Networks [J]. Computer Science, 2020, 47(6A): 70-74.
[15] HE Lei, SHAO Zhan-peng, ZHANG Jian-hua and ZHOU Xiao-long. Review of Deep Learning-based Action Recognition Algorithms [J]. Computer Science, 2020, 47(6A): 139-147.
Full text



[1] LEI Li-hui and WANG Jing. Parallelization of LTL Model Checking Based on Possibility Measure[J]. Computer Science, 2018, 45(4): 71 -75 .
[2] SUN Qi, JIN Yan, HE Kun and XU Ling-xuan. Hybrid Evolutionary Algorithm for Solving Mixed Capacitated General Routing Problem[J]. Computer Science, 2018, 45(4): 76 -82 .
[3] ZHANG Jia-nan and XIAO Ming-yu. Approximation Algorithm for Weighted Mixed Domination Problem[J]. Computer Science, 2018, 45(4): 83 -88 .
[4] WU Jian-hui, HUANG Zhong-xiang, LI Wu, WU Jian-hui, PENG Xin and ZHANG Sheng. Robustness Optimization of Sequence Decision in Urban Road Construction[J]. Computer Science, 2018, 45(4): 89 -93 .
[5] SHI Wen-jun, WU Ji-gang and LUO Yu-chun. Fast and Efficient Scheduling Algorithms for Mobile Cloud Offloading[J]. Computer Science, 2018, 45(4): 94 -99 .
[6] ZHOU Yan-ping and YE Qiao-lin. L1-norm Distance Based Least Squares Twin Support Vector Machine[J]. Computer Science, 2018, 45(4): 100 -105 .
[7] LIU Bo-yi, TANG Xiang-yan and CHENG Jie-ren. Recognition Method for Corn Borer Based on Templates Matching in Muliple Growth Periods[J]. Computer Science, 2018, 45(4): 106 -111 .
[8] GENG Hai-jun, SHI Xin-gang, WANG Zhi-liang, YIN Xia and YIN Shao-ping. Energy-efficient Intra-domain Routing Algorithm Based on Directed Acyclic Graph[J]. Computer Science, 2018, 45(4): 112 -116 .
[9] CUI Qiong, LI Jian-hua, WANG Hong and NAN Ming-li. Resilience Analysis Model of Networked Command Information System Based on Node Repairability[J]. Computer Science, 2018, 45(4): 117 -121 .
[10] WANG Zhen-chao, HOU Huan-huan and LIAN Rui. Path Optimization Scheme for Restraining Degree of Disorder in CMT[J]. Computer Science, 2018, 45(4): 122 -125 .