计算机科学 ›› 2025, Vol. 52 ›› Issue (11A): 250100106-6.doi: 10.11896/jsjkx.250100106

• 计算机网络 • 上一篇    下一篇

自适应梯度稀疏化的深度神经网络训练方法

黄新利, 高国举   

  1. 苏州大学计算机科学与技术学院 江苏 苏州 215006
  • 出版日期:2025-11-15 发布日期:2025-11-10
  • 通讯作者: 黄新利(109838280@qq.com)

Adaptive Gradient Sparsification Approach to Training Deep Neural Networks

HUANG Xinli, GAO Guoju   

  1. School of Computer Science & Technology,Soochow University,Suzhou,Jiangsu 215006,China
  • Online:2025-11-15 Published:2025-11-10

摘要: 具有误差补偿的Top-k稀疏化方法目前是分布式深度神经网络(DNNs)训练中最先进的技术之一,它在每次迭代训练中动态传输部分梯度来减少通信量,传输的梯度总量取决于k值的选择。虽然较小的k值可以加速训练,但即使在有误差补偿的情况下,也可能降低测试准确性。本文提出了AdaTopK——一种自适应Top-k压缩器,它可以通过动态调整k值来权衡训练速度和测试准确性。大量动态网络场景下的实验表明:与不压缩的情况相比,AdaTopK可以减少29%的训练时间;同时与已有实验DC2相比,AdaTopK也可以减少15%的训练时间。

关键词: 分布式训练, 网络压缩, 稀疏化, 深度神经网络, 误差补偿

Abstract: Top-k Sparsification Method with error compensation is one of the state-of-the-art technologies in the training of distributed deep neural networks(DNNs).This technique aims to reduce the amount of communication by dynamically transmitting only parts of the gradients in each iteration,with the amount of transmitted gradients depending on the value of k.Although a smaller k can speed up training time,it may degrade the test accuracy,even with error compensation,known as the speed-accuracy dilemma.Based on the observation that the increase speed of the training accuracy and test accuracy have a dynamic correlation over time,this paper presents AdaTopK-an adaptive Top-k compressor with convergence guarantees.AdaTopK can dynamically adjust the value of k to accelerate the training speed while keeping or enhancing the test accuracy.Extensive experiments in the static and dynamic network scenarios show that AdaTopK can reduce 29% training time over the baseline without compression,while reducing 15% training time over DC2.

Key words: Distributed training, Network compression, Sparsification, Deep neural networks, Error compensation

中图分类号: 

  • TP319
[1]朱永伟.基于深度学习和注意力机制的文本分类关键技术研究[D].南京信息工程大学,2024.
[2]CHENG Z T,HUANG H R,XUE H,et al.Event CausalityIdentification Model Based on Prompt Learning and Hypergraph[J].Computer Science,2025,52(9):303-312.
[3]LYU Y F,ZHANG X L,GAO W N,et al.The Application of Deep Learning in Customs Image Recognition Technology[J].China Port Science and Technology,2024,6(Z2):4-12.
[4]井煜.基于深度学习的局部遮挡人脸图像识别方法研究[J].互联网周刊,2024,(21):56-58.
[5]VOGELS T,KARIMIREDDY S P,JAGGI M.Powersgd:Practical low-rank gradient compression for distributed optimization[C]//NeurIPS.2019:14236-14245.
[6]李诗琪.分布式深度学习模型训练中梯度稀疏方法的改进[D].北京:北京邮电大学,2021.
[7]SAPIO A,CANINI M,HO C Y,et al.Scaling distributed machine learning with in-network aggregation[J/OL].https://www.usenix.org/system/files/nsdi21-sapio.pdf.
[8]SEIDE F,FU H,DROPPO J,et al.1-bit stochastic gradient descent and its application to data-parallel distributed training of speech dnns[C]//INTERSPEECH.2014:1058-1062.
[9]欧阳硕.基于梯度压缩的分布式深度学习通信优化技术研究[D].长沙:国防科技大学,2021.
[10]ESSER S K,MCKINSTRY J L,BABLANI D,et al.Learned step size quantization[C]//ICLR.2020.
[11]LI R,WANG Y,LIANG F,et al.Fully quantized network for object detection[C]//CVPR.2019:2810-2819.
[12]刘松伟.高性能二值卷积神经网络的研究与实现[D].杭州:浙江大学,2021.
[13]AJI A FHEAFIELD K.Sparse communication for distributed gradient descent[C]//EMNLP.2017:1440-445.
[14]HORVÁTH S,RICHTÁRIK P.A better alternative to errorfeedback for communication-efficient distributed learning[C]//ICLR.2021.
[15]LIN Y,HAN S,MAO H,et al.Deep gradient compression:Reducing the communication bandwidth for distributed training[C]//ICLR.2018.
[16]牛丽玲.基于深度学习处理器的Top-K算法实现及应用[D].北京:中国科学院大学(中国科学院大学人工智能学院),2020.
[17]ABDELMONIEM A M,CANINI M.Dc2:Delay-aware compression control for distributed machine learning[C]//INFOCOM.2021.
[18]ALISTARH D,GRUBIC D,LI J,et al.Qsgd:Communication-efficient sgd via gradient quantization and encoding[C]//NIPS.2017:1709-1720.
[19]DRYDEN N,MOON T,JACOBS S A,et al.Communicationquantization for data-parallel training of deep neural networks[C]//MLHPC@SC.2016:1-8.
[20]SHI S,TANG Z,WANG Q,et al.Layer-wise adaptive gradient sparsification for distributed deep learning with convergence guarantees[C]//ECAI.2020:1467-1474.
[21]ALISTARH D,HOEFLER T,JOHANSSON M,et al.The convergence of sparsified gradient methods[C]//NeurIPS.2018:5977-5987.
[22]CHEN C Y,NI J,LU S,et al.Gopalakrishnan.Scalecom:Scalable sparsified gradient compression for communication-efficient distributed training[C]//NeurIPS.2020.
[23]BRADLEY J K,KYROLA A,BICKSON D,et al.Parallel coordinate descent for l1-regularized loss minimization[C]//ICML.2011:321-328.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!