基于逐次超松弛技术的Double Speedy Q-Learning算法

doi:10.11896/jsjkx.201200173

Abstract

Abstract: Q-Learning is a mainstream reinforcement learning algorithm atpresent,but its convergence speed is poor in random environment.Previous studies have improved the overestimation problem of Spee-dy Q-Learning,and have proposed Double Speedy Q-Learning algorithm.However,the Double Speedy Q-Learning algorithm does not consider the self-loop structure exis-ting in the random environment,that is,the probability of entering the current state when the agent performs an action,which will not be conducive to the agent’s learning in the random environment,thereby affecting the convergence speed of the algorithm.Aiming at the self-loop structure existing in Double Speedy Q-Learning,the Bellman operator of Double Speedy Q-Learning algorithm is improved by using successive over-relaxation technology,and the Double Speedy Q-Learning algorithm based on successive over relaxation (DSQL-SOR) is proposed to further improve the convergence speed of the Double Speedy Q-Learning algorithm.By using numerical experiments to compare the error between the actual rewards and expected rewards of DSQL-SOR and other algorithms,the experimental results show that the proposed algorithm has a lower error of 0.6 than the existing mainstream algorithm SQL,which is lower than the successive over-relaxation algorithm GSQL 0.5,indicating that the performance of the DSQL-SOR algorithm is better than other algorithms.The experiment also tests the scalability of the DSQL-SOR algorithm.When the state space is increased from 10 to 1 000,the average time of each iteration increases slowly,always maintaining at the magnitude of 10^－4,indicating that DSQL-SOR has strong scalability.

Key words: Markov decision process(MDP), Q-Learning, Reinforcement learning, Self-loop structure, Successive over relaxation(SOR)

CLC Number:

TP181

ZHOU Qin, LUO Fei, DING Wei-chao, GU Chun-hua, ZHENG Shuai. Double Speedy Q-Learning Based on Successive Over Relaxation[J].Computer Science, 2022, 49(3): 239-245.

References

[1]XIONG H,DIAO X.Safety robustness of reinforcement learning policies:A view from robust control[J].Neurocomputing,2021,422:12-21.
[2]WATKINS C,DAYAN P.Technical note:Q-learning[J].Machine Learning,1992,8314:279-292.
[3]SONG Z,HOFMANN H,LI J,et al.Optimization for a hybrid energy storage system in electric vehicles using dynamic programing approach[J].Applied Energy,2015,139:151-162.
[4]DANN C,NEUMANN G,PETERS J.Policy evaluation withtemporal differences:A survey and comparison[J].Journal of Machine Learning Research,2014,15(1):809-883.
[5]SCHILPEROOT J,MAK I,DRUGAN M,et al.Learning toPlay Pac-Xon with Q-Learning and Two Double Q-Learning Variants[C]//Institute of Electrical and Electronics Engineers Inc.IEEE Symposium Series on Computational Intelligence (SSCI).Bangalore,India,2018:1151-1158.
[6]GHESHLAGHI A M,MUNOS R,GHAVAMZADEH M,et al.Speedy Q-learning[C]//Curran Associates Inc.Advances in Neural Information Processing Systems (NIPS).Granada,Spain,2011:2411-2419.
[7]VAN H H.Double Q-learning[C] //Curran Associates Inc.Advances in Neural Information Processing Systems(NIPS).Canada,2010:2613-2621.
[8]ZHENG S,LUO F,GU C H,et al.Double Speedy Q Learning[J].Computer Science,2020,47(7):179-185.
[9]HADJIDIMOS A.Successive overrelaxation (SOR) and related methods[J].Journal of Computational and Applied Mathema-tics,2000,123(1):177-199.
[10]SAITO M,MASUDA K,KURIHARA K.A flexible Q-relear-ning method to accelerate learning under the change of environments by reusing a portion of useful policies[C] //Society of Instrument and Control Engineers (SICE).Akita,Japan,2012:1223-1227.
[11]JANG B,KIM M,HARERIMANA G,et al.Q-Learning Algorithms:A Comprehensive Classification and Applications[J].IEEE Access,2019,7:133653-133667.
[12]MNIH V,KAVUKCUOGLU K,SILVER D,et al.Playing Atari with Deep Reinforcement Learningar[J].arXiv:1312.5602,2013.
[13]WANG F,ZHANG J J,ZHENG X,et al.Where does AlphaGo go:From church-turing thesis to AlphaGo thesis and beyond[J].IEEE/CAA Journal of Automatica Sinica,2016,3(2):113-120.
[14]PENG J,WILLIAMS R J.Incremental multi-step Q-learning[J].Machine Learning,1996,22(1/2/3):283.
[15]HU J,WELLMAN M P.Nash Q-learning for general-sum stochastic games[J].Journal of Machine Learning Research,2004,4(6):1039-1069.
[16]FANG Z,WANG J,JIANG C,et al.QLACO:Q-learning Aided Ant Colony Routing Protocol for Underwater Acoustic Sensor Networks[C] //Republic of:Institute of Electrical and Electronics Engineers Inc.IEEE Wireless Communications and Networking Conference (WCNC).Seoul,Korea,2020:1-6.
[17]CUI R,WEIM.Research and application of Successive Over-Relaxation Iterative Algorithm[C]//Proceedings of 2011 International Conference on Electronic & Mechanical Engineering and Information Technology.2011:3856-3858.
[18]REETZ D.Solution of a Markovian decision problem by successive overrelaxation[J].Zeitschrift für Operations Research,1973,17(1):29-32.
[19]TOMMI J,MICHAEL I,SATINDER P.Convergence of Sto-chastic Iterative Dynamic Programming Algorithms [J].Neural Computation,1994,6:1185-1201.
[20]JOHN I,KAMANCHI C,BHATNAGAR S.Generalized Speedy Q-Learning[J].IEEE Control Systems Letters,2020,4(3):524-529.
[21]KAMANCHI C,DIDDIGI R B,BHATNAGAR S.SuccessiveOver-Relaxation Q-Learning[J].IEEE Control Systems Letters,2020,4(1):55-60.

Related Articles 15

[1]	LIU Xing-guang, ZHOU Li, LIU Yan, ZHANG Xiao-ying, TAN Xiang, WEI Ji-bo. Construction and Distribution Method of REM Based on Edge Intelligence [J]. Computer Science, 2022, 49(9): 236-241.
[2]	SHI Dian-xi, ZHAO Chen-ran, ZHANG Yao-wen, YANG Shao-wu, ZHANG Yong-jun. Adaptive Reward Method for End-to-End Cooperation Based on Multi-agent Reinforcement Learning [J]. Computer Science, 2022, 49(8): 247-256.
[3]	YUAN Wei-lin, LUO Jun-ren, LU Li-na, CHEN Jia-xing, ZHANG Wan-peng, CHEN Jing. Methods in Adversarial Intelligent Game:A Holistic Comparative Analysis from Perspective of Game Theory and Reinforcement Learning [J]. Computer Science, 2022, 49(8): 191-204.
[4]	YU Bin, LI Xue-hua, PAN Chun-yu, LI Na. Edge-Cloud Collaborative Resource Allocation Algorithm Based on Deep Reinforcement Learning [J]. Computer Science, 2022, 49(7): 248-253.
[5]	LI Meng-fei, MAO Ying-chi, TU Zi-jian, WANG Xuan, XU Shu-fang. Server-reliability Task Offloading Strategy Based on Deep Deterministic Policy Gradient [J]. Computer Science, 2022, 49(7): 271-279.
[6]	XIE Wan-cheng, LI Bin, DAI Yue-yue. PPO Based Task Offloading Scheme in Aerial Reconfigurable Intelligent Surface-assisted Edge Computing [J]. Computer Science, 2022, 49(6): 3-11.
[7]	HONG Zhi-li, LAI Jun, CAO Lei, CHEN Xi-liang, XU Zhi-xiong. Study on Intelligent Recommendation Method of Dueling Network Reinforcement Learning Based on Regret Exploration [J]. Computer Science, 2022, 49(6): 149-157.
[8]	GUO Yu-xin, CHEN Xiu-hong. Automatic Summarization Model Combining BERT Word Embedding Representation and Topic Information Enhancement [J]. Computer Science, 2022, 49(6): 313-318.
[9]	FAN Jing-yu, LIU Quan. Off-policy Maximum Entropy Deep Reinforcement Learning Algorithm Based on RandomlyWeighted Triple Q -Learning [J]. Computer Science, 2022, 49(6): 335-341.
[10]	ZHANG Jia-neng, LI Hui, WU Hao-lin, WANG Zhuang. Exploration and Exploitation Balanced Experience Replay [J]. Computer Science, 2022, 49(5): 179-185.
[11]	LI Peng, YI Xiu-wen, QI De-kang, DUAN Zhe-wen, LI Tian-rui. Heating Strategy Optimization Method Based on Deep Learning [J]. Computer Science, 2022, 49(4): 263-268.
[12]	OUYANG Zhuo, ZHOU Si-yuan, LYU Yong, TAN Guo-ping, ZHANG Yue, XIANG Liang-liang. DRL-based Vehicle Control Strategy for Signal-free Intersections [J]. Computer Science, 2022, 49(3): 46-51.
[13]	LI Su, SONG Bao-yan, LI Dong, WANG Jun-lu. Composite Blockchain Associated Event Tracing Method for Financial Activities [J]. Computer Science, 2022, 49(3): 346-353.
[14]	HUANG Xin-quan, LIU Ai-jun, LIANG Xiao-hu, WANG Heng. Load-balanced Geographic Routing Protocol in Aerial Sensor Network [J]. Computer Science, 2022, 49(2): 342-352.
[15]	HOU Hong-xu, SUN Shuo, WU Nier. Survey of Mongolian-Chinese Neural Machine Translation [J]. Computer Science, 2022, 49(1): 31-40.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Double Speedy Q-Learning Based on Successive Over Relaxation

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 15

Metrics

Comments

Recommended 0