计算机科学 ›› 2023, Vol. 50 ›› Issue (11): 220-226.doi: 10.11896/jsjkx.220900131

• 人工智能 • 上一篇    下一篇

基于AdaGrad+的自适应Heavy-Ball动量法及其最优个体收敛性

韦洪旭1, 陇盛2, 陶蔚3, 陶卿1   

  1. 1 中国人民解放军陆军炮兵防空兵学院信息工程系 合肥 230031
    2 国防科技大学系统工程学院大数据与决策实验室 长沙 410073
    3 中国人民解放军军事科学院评估论证研究中心 北京 100091
  • 收稿日期:2022-09-14 修回日期:2023-07-05 出版日期:2023-11-15 发布日期:2023-11-06
  • 通讯作者: 陶卿(qing.tao@ia.ac.cn)
  • 作者简介:(1140064271@qq.com)

Adaptive Heavy-Ball Momentum Method Based on AdaGrad+ and Its Optimal Individual Convergence

WEI Hongxu1, LONG Sheng2, TAO Wei3, TAO Qing1   

  1. 1 Department of Information Engineering,Army Academy of Artillery and Air Defense of PLA,Hefei 230031,China
    2 Laboratory for Big Data and Decision,College of Systems Engineering, National University of Defense Technology,Changsha 410073,China
    3 Institute of Evaluation and Assessment Research,PLA Academy of Military Science,Beijing 100091,China
  • Received:2022-09-14 Revised:2023-07-05 Online:2023-11-15 Published:2023-11-06
  • About author:WEI Hongxu,born in 1998,postgra-duate.His main research interests include pattern recognition and machine learning.TAO Qing,born in 1965,Ph.D,professor,Ph.D supervisor,is a member of China Computer Federation.His main research interests include pattern recognition,machine learning and applied mathematics.

摘要: 自适应策略与动量法是提升优化算法性能的常用方法。目前自适应梯度方法大多采用AdaGrad型策略,但该策略在约束优化中效果不佳,为此,研究人员提出了更适用于处理约束问题的AdaGrad+方法,但其与SGD一样在非光滑凸情形下未达到最优个体收敛速率,结合NAG动量也并未达到预期的加速效果。针对上述问题,文中将AdaGrad+调整步长的策略与Heavy-Ball型动量法加速收敛的优点相结合,提出了一种基于AdaGrad+的自适应动量法。通过设置加权动量项、巧妙选取时变参数和灵活处理自适应矩阵,证明了该方法对于非光滑一般凸问题具有最优个体收敛速率。最后在l∞∞范数约束下,通过求解典型的 hinge 损失函数优化问题验证了理论分析的正确性,通过深度卷积神经网络训练实验验证了该方法在实际应用中也具有良好性能。

关键词: 凸优化, 自适应策略, AdaGrad+, Heavy-Ball动量方法, 收敛速率

Abstract: Adaptive strategies and momentum methods are commonly used to improve the performance of optimization algorithms.Most of the adaptive gradient methods use the AdaGrad-type strategy at present.The AdaGrad+ method,which is more suitable for dealing with constrained problems,is proposed to solve the inefficiency of AdaGrad-type strategy on constrained optimization.But it is the same as SGD in non-smooth convex situations.The optimal individual convergence rate is not reached.Combining the strategy with NAG momentum but fail to achieve the expected acceleration effect.Aiming at the above problems,this paper proposes an adaptive momentum method based on AdaGrad+.The method uses the strategy of AdaGrad+ to adjust the step size,and inherits the advantages of the Heavy-Ball momentum method to accelerate the convergence.It is proved that the method achieves the optimal individual convergence rate for non-smooth convex problems by setting the weighted momentum term,selecting time-varying parameters skillfully and processing the adaptive matrix flexibly.Finally,experiments are conducted on the typical optimization problem of hinge loss function with l norm constraint,and the experiment results verify the correctness of the theoretical analysis.In addition,the deep learning experiments confirm that the proposed method also has good performance in practical applications.

Key words: Convex optimization, Adaptive strategy, AdaGrad+, Heavy-Ball momentum method, Convergence rate

中图分类号: 

  • TP181
[1]ROBBINS H,MONRO S.A Stochastic Approximation Method[J].The Annals of Mathematical Statistics,1951,22(3):400-407.
[2]SUN T,LI D S,QUAN Z,et al.Heavy-ball Algorithms Always Escape Saddle Points[C]//Proceedings of the 28th International Joint Conference on Artificial Intelligence(IJCAI 2019).San Francisco:Morgan Kaufmann,2019:3520-3526.
[3]NESTEROV Y.A Method of Solving a Convex ProgrammingProblem with Convergence Rate O(1/k2)[J].Soviet Mathema-tics Doklady,1983,27(2):372-376.
[4]DUCHI J,HAZAN E,SINGER Y.Adaptive Subgradient Me-thods for Online Learning and Stochastic Optimization[J].Journal of Machine Learning Research,2011,12:2121-2159.
[5]TIELEMAN T,HINTON G.RMSProp:Divide the Gradient bya Running Average of its Recent Magnitude[D].Toronto:University of Toronto,2012.
[6]MATTHEW D.ZEILE R.ADADELTA:An Adaptive Learning Rate Method[J].arXiv:1212.5701,2012.
[7]LEVY K Y.Online to offline conversions,universality and adaptive minibatch sizes[C]//Proceedings of the 31st Annual Conf.on Neural Information Processing Systems(NeurIPS).Cambridge,MA:MIT Press,2017:1613-1622.
[8]ALINE E,HUY L,ADRIAN V.Adaptive Gradient Methods for Constrained Convex Optimization and Variational Inequalities[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2021:7314-7321.
[9]POLYAK B T.Some Methods of Speeding up the Convergence of Iteration Methods[J].USSR Computational Mathematics and Mathematical Physics,1964,4(5):1-17.
[10]GHADIMI E,FEYZMAHDAVIAN H R,JOHANSSON M.GlobalConvergence of the Heavy-Ball Method for Convex Optimization[C]//Proceedings of the European Control Conference.Washington,USA:IEEE,2015:310-315.
[11]CHENG Y J,TAO W,LIU Y X,et al.Optimal Iindividual Convergence Rate of the Heavy-Ball-based Momentum Methods[J].Chinese Journal of Computer Research and Development,2019,56(8):1686-1694.
[12]TAO W,PAN Z S,CHU D J,et al.The Individual Convergence of Projected Subgradient Methods Using the Nesterov's Step-Size Strategy[J].Chinese Journal of Computers,2018,41(1):164-176.
[13]TAO W,PAN Z S,WU G W,et al.The Strength of Nesterov's Extrapolation in the Individual Convergence of Non smooth Optimization[J].IEEE Transactions on Neural Networks and Learning Systems,2020,31(7):2557-2568.
[14]KINGMA D P,BA J.Adam:A Method for Stochastic Optimization[J].arXiv:1412.6980.2014.
[15]REDDI S J,KALE S,KUMAR S.On the Convergence of Adam and Beyond[J].arXiv:1904.09237,2019.
[16]TAO W,LONG S,WU G W,et al.The Role of Momentum Parameters in the Optimal Convergence of Adaptive Polyak's Heavy-Ball Methods[J].arXiv:2102.07314,2021.
[17]ZHANG Z D,LONG S,BAO L,et al.AdaBelief Based Heavy-Ball Momentum Method[J].Pattern Recognition and Artificial Intelligence,2022,35(2):106-115.
[18]ZHUANG J T,TANG T,DING Y F,et al.AdaBelief Optimizer:Adapting Stepsizes by the Belief in Observed Gradients [J].arXiv:2010.07468,2020.
[19]ZINKEVICH M.Online Convex Programming and Generalized Infinitesimal Gradient Ascent[C]//Proceedings of the 20th International Conference on Machine Learning.New York:ACM,2003.
[20]AGARWAL A,BARTLETT P L,RAVIKUMAR P,et al.Information-Theoretic Lower Bounds on the Oracle Complexity of Stochastic Convex Optimization[J].IEEE Transactions on Information Theory,2012,58(5):3235-3249.
[21]SHAMIR O,ZHANG T.Stochastic Gradient Descent for Non-smooth Optimization:Convergence Results and Optimal Averaging Schemes[C]//Proceedings of the 29th International Conference on Machine Learning.New York,USA:ACM,2013:71-79.
[22]ZOU F Y,SHEN L,JIE Z Q,et al.A Sufficient Condition for Convergences of Adam and RMSprop[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Washington,USA,2019:11119-11127.
[23]SHALEV S S,SINGER Y,SREBRO N,et al.Pegasos:Primal estimated sub-gradient solver for svm[J].Mathematical Programming,2011,127(1):3-30.
[24]RAKHLIN A,SHAMIR O,SRIDHARAN K.Making gradient descent optimal for strongly convex stochastic optimization[C]//Proceedings s of the 29th International Conference on Machine Learning.New York,USA:ACM,2012:449-456.
[25]LIU J,JI S,YE J.SLEP:Sparse learning with efficient projections[D].Arizona:Arizona State University,2009.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!