计算机科学 ›› 2025, Vol. 52 ›› Issue (11A): 250100106-6.doi: 10.11896/jsjkx.250100106
黄新利, 高国举
HUANG Xinli, GAO Guoju
摘要: 具有误差补偿的Top-k稀疏化方法目前是分布式深度神经网络(DNNs)训练中最先进的技术之一,它在每次迭代训练中动态传输部分梯度来减少通信量,传输的梯度总量取决于k值的选择。虽然较小的k值可以加速训练,但即使在有误差补偿的情况下,也可能降低测试准确性。本文提出了AdaTopK——一种自适应Top-k压缩器,它可以通过动态调整k值来权衡训练速度和测试准确性。大量动态网络场景下的实验表明:与不压缩的情况相比,AdaTopK可以减少29%的训练时间;同时与已有实验DC2相比,AdaTopK也可以减少15%的训练时间。
中图分类号:
| [1]朱永伟.基于深度学习和注意力机制的文本分类关键技术研究[D].南京信息工程大学,2024. [2]CHENG Z T,HUANG H R,XUE H,et al.Event CausalityIdentification Model Based on Prompt Learning and Hypergraph[J].Computer Science,2025,52(9):303-312. [3]LYU Y F,ZHANG X L,GAO W N,et al.The Application of Deep Learning in Customs Image Recognition Technology[J].China Port Science and Technology,2024,6(Z2):4-12. [4]井煜.基于深度学习的局部遮挡人脸图像识别方法研究[J].互联网周刊,2024,(21):56-58. [5]VOGELS T,KARIMIREDDY S P,JAGGI M.Powersgd:Practical low-rank gradient compression for distributed optimization[C]//NeurIPS.2019:14236-14245. [6]李诗琪.分布式深度学习模型训练中梯度稀疏方法的改进[D].北京:北京邮电大学,2021. [7]SAPIO A,CANINI M,HO C Y,et al.Scaling distributed machine learning with in-network aggregation[J/OL].https://www.usenix.org/system/files/nsdi21-sapio.pdf. [8]SEIDE F,FU H,DROPPO J,et al.1-bit stochastic gradient descent and its application to data-parallel distributed training of speech dnns[C]//INTERSPEECH.2014:1058-1062. [9]欧阳硕.基于梯度压缩的分布式深度学习通信优化技术研究[D].长沙:国防科技大学,2021. [10]ESSER S K,MCKINSTRY J L,BABLANI D,et al.Learned step size quantization[C]//ICLR.2020. [11]LI R,WANG Y,LIANG F,et al.Fully quantized network for object detection[C]//CVPR.2019:2810-2819. [12]刘松伟.高性能二值卷积神经网络的研究与实现[D].杭州:浙江大学,2021. [13]AJI A FHEAFIELD K.Sparse communication for distributed gradient descent[C]//EMNLP.2017:1440-445. [14]HORVÁTH S,RICHTÁRIK P.A better alternative to errorfeedback for communication-efficient distributed learning[C]//ICLR.2021. [15]LIN Y,HAN S,MAO H,et al.Deep gradient compression:Reducing the communication bandwidth for distributed training[C]//ICLR.2018. [16]牛丽玲.基于深度学习处理器的Top-K算法实现及应用[D].北京:中国科学院大学(中国科学院大学人工智能学院),2020. [17]ABDELMONIEM A M,CANINI M.Dc2:Delay-aware compression control for distributed machine learning[C]//INFOCOM.2021. [18]ALISTARH D,GRUBIC D,LI J,et al.Qsgd:Communication-efficient sgd via gradient quantization and encoding[C]//NIPS.2017:1709-1720. [19]DRYDEN N,MOON T,JACOBS S A,et al.Communicationquantization for data-parallel training of deep neural networks[C]//MLHPC@SC.2016:1-8. [20]SHI S,TANG Z,WANG Q,et al.Layer-wise adaptive gradient sparsification for distributed deep learning with convergence guarantees[C]//ECAI.2020:1467-1474. [21]ALISTARH D,HOEFLER T,JOHANSSON M,et al.The convergence of sparsified gradient methods[C]//NeurIPS.2018:5977-5987. [22]CHEN C Y,NI J,LU S,et al.Gopalakrishnan.Scalecom:Scalable sparsified gradient compression for communication-efficient distributed training[C]//NeurIPS.2020. [23]BRADLEY J K,KYROLA A,BICKSON D,et al.Parallel coordinate descent for l1-regularized loss minimization[C]//ICML.2011:321-328. |
|
||