计算机科学 ›› 2023, Vol. 50 ›› Issue (11): 220-226.doi: 10.11896/jsjkx.220900131
韦洪旭1, 陇盛2, 陶蔚3, 陶卿1
WEI Hongxu1, LONG Sheng2, TAO Wei3, TAO Qing1
摘要: 自适应策略与动量法是提升优化算法性能的常用方法。目前自适应梯度方法大多采用AdaGrad型策略,但该策略在约束优化中效果不佳,为此,研究人员提出了更适用于处理约束问题的AdaGrad+方法,但其与SGD一样在非光滑凸情形下未达到最优个体收敛速率,结合NAG动量也并未达到预期的加速效果。针对上述问题,文中将AdaGrad+调整步长的策略与Heavy-Ball型动量法加速收敛的优点相结合,提出了一种基于AdaGrad+的自适应动量法。通过设置加权动量项、巧妙选取时变参数和灵活处理自适应矩阵,证明了该方法对于非光滑一般凸问题具有最优个体收敛速率。最后在l∞∞范数约束下,通过求解典型的 hinge 损失函数优化问题验证了理论分析的正确性,通过深度卷积神经网络训练实验验证了该方法在实际应用中也具有良好性能。
中图分类号:
[1]ROBBINS H,MONRO S.A Stochastic Approximation Method[J].The Annals of Mathematical Statistics,1951,22(3):400-407. [2]SUN T,LI D S,QUAN Z,et al.Heavy-ball Algorithms Always Escape Saddle Points[C]//Proceedings of the 28th International Joint Conference on Artificial Intelligence(IJCAI 2019).San Francisco:Morgan Kaufmann,2019:3520-3526. [3]NESTEROV Y.A Method of Solving a Convex ProgrammingProblem with Convergence Rate O(1/k2)[J].Soviet Mathema-tics Doklady,1983,27(2):372-376. [4]DUCHI J,HAZAN E,SINGER Y.Adaptive Subgradient Me-thods for Online Learning and Stochastic Optimization[J].Journal of Machine Learning Research,2011,12:2121-2159. [5]TIELEMAN T,HINTON G.RMSProp:Divide the Gradient bya Running Average of its Recent Magnitude[D].Toronto:University of Toronto,2012. [6]MATTHEW D.ZEILE R.ADADELTA:An Adaptive Learning Rate Method[J].arXiv:1212.5701,2012. [7]LEVY K Y.Online to offline conversions,universality and adaptive minibatch sizes[C]//Proceedings of the 31st Annual Conf.on Neural Information Processing Systems(NeurIPS).Cambridge,MA:MIT Press,2017:1613-1622. [8]ALINE E,HUY L,ADRIAN V.Adaptive Gradient Methods for Constrained Convex Optimization and Variational Inequalities[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2021:7314-7321. [9]POLYAK B T.Some Methods of Speeding up the Convergence of Iteration Methods[J].USSR Computational Mathematics and Mathematical Physics,1964,4(5):1-17. [10]GHADIMI E,FEYZMAHDAVIAN H R,JOHANSSON M.GlobalConvergence of the Heavy-Ball Method for Convex Optimization[C]//Proceedings of the European Control Conference.Washington,USA:IEEE,2015:310-315. [11]CHENG Y J,TAO W,LIU Y X,et al.Optimal Iindividual Convergence Rate of the Heavy-Ball-based Momentum Methods[J].Chinese Journal of Computer Research and Development,2019,56(8):1686-1694. [12]TAO W,PAN Z S,CHU D J,et al.The Individual Convergence of Projected Subgradient Methods Using the Nesterov's Step-Size Strategy[J].Chinese Journal of Computers,2018,41(1):164-176. [13]TAO W,PAN Z S,WU G W,et al.The Strength of Nesterov's Extrapolation in the Individual Convergence of Non smooth Optimization[J].IEEE Transactions on Neural Networks and Learning Systems,2020,31(7):2557-2568. [14]KINGMA D P,BA J.Adam:A Method for Stochastic Optimization[J].arXiv:1412.6980.2014. [15]REDDI S J,KALE S,KUMAR S.On the Convergence of Adam and Beyond[J].arXiv:1904.09237,2019. [16]TAO W,LONG S,WU G W,et al.The Role of Momentum Parameters in the Optimal Convergence of Adaptive Polyak's Heavy-Ball Methods[J].arXiv:2102.07314,2021. [17]ZHANG Z D,LONG S,BAO L,et al.AdaBelief Based Heavy-Ball Momentum Method[J].Pattern Recognition and Artificial Intelligence,2022,35(2):106-115. [18]ZHUANG J T,TANG T,DING Y F,et al.AdaBelief Optimizer:Adapting Stepsizes by the Belief in Observed Gradients [J].arXiv:2010.07468,2020. [19]ZINKEVICH M.Online Convex Programming and Generalized Infinitesimal Gradient Ascent[C]//Proceedings of the 20th International Conference on Machine Learning.New York:ACM,2003. [20]AGARWAL A,BARTLETT P L,RAVIKUMAR P,et al.Information-Theoretic Lower Bounds on the Oracle Complexity of Stochastic Convex Optimization[J].IEEE Transactions on Information Theory,2012,58(5):3235-3249. [21]SHAMIR O,ZHANG T.Stochastic Gradient Descent for Non-smooth Optimization:Convergence Results and Optimal Averaging Schemes[C]//Proceedings of the 29th International Conference on Machine Learning.New York,USA:ACM,2013:71-79. [22]ZOU F Y,SHEN L,JIE Z Q,et al.A Sufficient Condition for Convergences of Adam and RMSprop[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Washington,USA,2019:11119-11127. [23]SHALEV S S,SINGER Y,SREBRO N,et al.Pegasos:Primal estimated sub-gradient solver for svm[J].Mathematical Programming,2011,127(1):3-30. [24]RAKHLIN A,SHAMIR O,SRIDHARAN K.Making gradient descent optimal for strongly convex stochastic optimization[C]//Proceedings s of the 29th International Conference on Machine Learning.New York,USA:ACM,2012:449-456. [25]LIU J,JI S,YE J.SLEP:Sparse learning with efficient projections[D].Arizona:Arizona State University,2009. |
|