Computer Science ›› 2024, Vol. 51 ›› Issue (6A): 230600235-5.doi: 10.11896/jsjkx.230600235

• Artificial Intelligenc • Previous Articles     Next Articles

Weighted Double Q-Learning Algorithm Based on Softmax

ZHONG Yuang, YUAN Weiwei, GUAN Donghai   

  1. College of Computer Science and Technology,Nanjing University of Aeronautics and Astronautics,Nanjing 211106,China
  • Published:2024-06-06
  • About author:ZHONG Yuang,born in 2000,postgra-duate.His main research interests include reinforcement learning and so on.
    YUAN Weiwei,born in 1981,Ph.D,Professor,Ph.D supervisor.Her main research interests include data mining and intelligence computing.

Abstract: As a branch of machine learning,einforcement learning is used to describe and solve the problem that agents maximize returns through learning strategies in the process of interaction with the environment.Q-Learning,as a classical model free reinforcement learning method,has the problem of maximizing the bias caused by overestimation,and performs poorly when there is noise in the environment.The emergence of double Q-Learning(DQL) solves the problem of overestimation,but at the same time causes the problem of underestimation.To solve the problem of high and low estimation in the above algorithms,weighted Q-Learning algorithm based on softmax is proposed.And combined with DQL,a new weighted double Q-Learning algorithm based on softmax(WDQL-Softmax) is proposed.This algorithm is based on the construction of weighted dual estimators,which perform softmax operations on the expected values of the samples to obtain weights.The weights are used to estimate the action value,effectively balancing the problem of overestimation and underestimation of the action value,making the estimated value closer to the theoretical value.Experimental results show that in the discrete action space,compared with Q-Learning algorithm,double Q-Learning algorithm and weighted double Q-learning algorithm,weighted double q-learning algorithm based on softmax has faster convergence rate and smaller error between the estimated value and the theoretical value.

Key words: Reinforcement learning, Q-Learning, Double Q-Learning, Softmax

CLC Number: 

  • TP181
[1]WIERING M,VAN OTTERLO M.Reinforcement Learning:State of the Art[M].New York:Springer,2012.
[2]LI Y.Deep reinforcement learning:An overview[J].arXiv:1701.07274,2017.
[3]KAISER L,BABAEIZADEH M,MILOS P,et al.Model Based Reinforcement Learning for Atari[C]//International Conference on Learning Representations.2019.
[4]JOHANNINK T,BAHL S,NAIR A,et al.Residual reinforcement learning for robot control[C]//2019 International Confer-ence on Robotics and Automation(ICRA).IEEE,2019:6023-6029.
[5]KIRAN B R,SOBH I,TALPAERTV,et al.Deep reinforcement learning for autonomous driving:A survey[J].IEEE Transactions on Intelligent Transportation Systems,2021,23(6):4909-4926.
[6]WU X,CHEN H,WANG J,et al.Adaptive stock trading strategies with deep reinforcement learning methods[J].Information Sciences,2020,538:142-158.
[7]WATKINS C J C H,DAYAN P.Q-learning[J].Machine lear-ning,1992,8:279-292.
[8]LEE D,DEFOURNY B,POWELLW B.Bias-corrected q-lear-ning to control max-operator bias in q-learning[C]//2013 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning(ADPRL).IEEE,2013:93-99.
[9]AZAR M G,MUNOS R,GHAVAMZADEH M,et al.Speedy Q-learning[C]//Advances in neural information processing systems.2011:2411-2419.
[10]HASSELT H.Double Q-learning[C]//Proceedings of the 23rd International Conference on Neural Information Processing Systems.2010:2613-2621.
[11]D’ERAMO C,RESTELLI M,NUARA A.Estimating maxi-mum expected value through gaussian approximation[C]//International Conference on Machine Learning.PMLR,2016:1032-1040.
[12]ZHANG Z,PAN Z,KOCHENDERFERM J.Weighted doubleQ-learning[C]//IJCAI.2017:3455-3461.
[13]REN Z,ZHU G,HU H,et al.On theEstimation Bias in Double Q-Learning[J].Advances in Neural Information Processing Systems,2021,34:10246-10259.
[14]WANG Y,LIU Y,CHENW,et al.Target transfer Q-learning and its convergence analysis[J].Neurocomputing,2020,392:11-22.
[15]SUTTON R S,BARTOA G.Reinforcement learning:An introduction[M].MIT press,2018.
[1] GAO Yuzhao, NIE Yiming. Survey of Multi-agent Deep Reinforcement Learning Based on Value Function Factorization [J]. Computer Science, 2024, 51(6A): 230300170-9.
[2] XU Haitao, CHENG Haiyan, TONG Mingwen. Study on Genetic Algorithm of Course Scheduling Based on Deep Reinforcement Learning [J]. Computer Science, 2024, 51(6A): 230600062-8.
[3] LI Danyang, WU Liangji, LIU Hui, JIANG Jingqing. Deep Reinforcement Learning Based Thermal Awareness Energy Consumption OptimizationMethod for Data Centers [J]. Computer Science, 2024, 51(6A): 230500109-8.
[4] WANG Shuanqi, ZHAO Jianxin, LIU Chi, WU Wei, LIU Zhao. Fuzz Testing Method of Binary Code Based on Deep Reinforcement Learning [J]. Computer Science, 2024, 51(6A): 230800078-7.
[5] HUANG Feihu, LI Peidong, PENG Jian, DONG Shilei, ZHAO Honglei, SONG Weiping, LI Qiang. Multi-agent Based Bidding Strategy Model Considering Wind Power [J]. Computer Science, 2024, 51(6A): 230600179-8.
[6] XIN Yuanxia, HUA Daoyang, ZHANG Li. Multi-agent Reinforcement Learning Algorithm Based on AI Planning [J]. Computer Science, 2024, 51(5): 179-192.
[7] YANG Xiuwen, CUI Yunhe, QIAN Qing, GUO Chun, SHEN Guowei. COURIER:Edge Computing Task Scheduling and Offloading Method Based on Non-preemptivePriorities Queuing and Prioritized Experience Replay DRL [J]. Computer Science, 2024, 51(5): 293-305.
[8] SHI Dianxi, HU Haomeng, SONG Linna, YANG Huanhuan, OUYANG Qianying, TAN Jiefu , CHEN Ying. Multi-agent Reinforcement Learning Method Based on Observation Reconstruction [J]. Computer Science, 2024, 51(4): 280-290.
[9] ZHAO Miao, XIE Liang, LIN Wenjing, XU Haijiao. Deep Reinforcement Learning Portfolio Model Based on Dynamic Selectors [J]. Computer Science, 2024, 51(4): 344-352.
[10] WANG Yao, LUO Junren, ZHOU Yanzhong, GU Xueqiang, ZHANG Wanpeng. Review of Reinforcement Learning and Evolutionary Computation Methods for StrategyExploration [J]. Computer Science, 2024, 51(3): 183-197.
[11] WANG Yan, WANG Tianjing, SHEN Hang, BAI Guangwei. Optimal Penetration Path Generation Based on Maximum Entropy Reinforcement Learning [J]. Computer Science, 2024, 51(3): 360-367.
[12] LI Junwei, LIU Quan, XU Yapeng. Option-Critic Algorithm Based on Mutual Information Optimization [J]. Computer Science, 2024, 51(2): 252-258.
[13] SHI Dianxi, PENG Yingxuan, YANG Huanhuan, OUYANG Qianying, ZHANG Yuhui, HAO Feng. DQN-based Multi-agent Motion Planning Method with Deep Reinforcement Learning [J]. Computer Science, 2024, 51(2): 268-277.
[14] WANG Yangmin, HU Chengyu, YAN Xuesong, ZENG Deze. Study on Deep Reinforcement Learning for Energy-aware Virtual Machine Scheduling [J]. Computer Science, 2024, 51(2): 293-299.
[15] ZHAO Xiaoyan, ZHAO Bin, ZHANG Junna, YUAN Peiyan. Study on Cache-oriented Dynamic Collaborative Task Migration Technology [J]. Computer Science, 2024, 51(2): 300-310.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!