基于随机加权三重Q学习的异策略最大熵强化学习算法

doi:10.11896/jsjkx.210300081

Computer Science ›› 2022, Vol. 49 ›› Issue (6): 335-341.doi: 10.11896/jsjkx.210300081

• Artificial Intelligence • Previous Articles Next Articles

Off-policy Maximum Entropy Deep Reinforcement Learning Algorithm Based on RandomlyWeighted Triple Q -Learning

FAN Jing-yu¹, LIU Quan^1,2,3,4

1 School of Computer Science and Technology,Soochow University,Suzhou,Jiangsu 215006,China
2 Provincial Key Laboratory for Computer Information Processing Technology,Soochow University,Suzhou,Jiangsu 215006,China
3 Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education,Jilin University,Changchun 130012,China
4 Collaborative Innovation Center of Novel Software Technology and Industrialization,Nanjing 210000,China

Received:2021-03-08 Revised:2022-01-21 Online:2022-06-15 Published:2022-06-08
About author:FAN Jing-yu,born in 1995,postgra-duate.His main research interests include deep reinforcement learning and so on.
LIU Quan,born in 1969,Ph.D,professor,is a member of China Computer Federation.His main research interests include deep reinforcement learning and automated reasoning.
Supported by:
National Natural Science Foundation of China(61772355,61702055,61502323,61502329),Jiangsu Province Natu-ral Science Research University Major Projects(18KJA520011,17KJA520004),Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education,Jilin University(93K172014K04,93K172017K18),Suzhou Industrial Application of Basic Research Program Part(SYG201422) and Project Funded by the Priority Academic Program Development of Jiangsu Higher Education Institutions.

Abstract

Abstract: Reinforcement learning is an important branch of machine learning.With the development of deep learning,deep reinforcement learning research has gradually developed into the focus of reinforcement learning research.Model-free off-policy deep reinforcement learning algorithms for continuous control attract everyone’s attention because of their strong practicality.Like Q-learning,algorithms based on actor-critic suffer from the problem of overestimations.To a certain extent,clipped double Q-lear-ning method solves the effect of the overestimation in actor-critic algorithms,but it also introduces underestimation to the lear-ning process.In order to further solve the problems of overestimation and underestimation in the actor-critic algorithms,a new learning method,randomly weighted triple Q-learning method is proposed.In addition,combining the new method with the soft actor critic algorithm,a new soft actor critic algorithm based on randomly weighted triple Q-learning is proposed.This algorithm not only limits the Q estimation value near the real Q value,but also increases the randomness of the Q estimation value through randomly weighted method,so as to solve the problems of overestimation and underestimation of action value in the learning process.Experiment results show that,compared to the SAC algorithm and other currently popular deep reinforcement learning algorithms such as DDPG,PPO and TD3,the SAC-RWTQ algorithm has better performance on several Mujoco tasks on the gym simulation platform.

Key words: Continuous action space, Deep learning, Maximum entropy, Off-policy reinforcement learning, Q-learning, Soft actor critic algorithm

CLC Number:

TP181

FAN Jing-yu, LIU Quan. Off-policy Maximum Entropy Deep Reinforcement Learning Algorithm Based on RandomlyWeighted Triple Q -Learning[J].Computer Science, 2022, 49(6): 335-341.

References

[1] SUTTON R S,BARTO A G.Reinforcement Learning:An In-troduction[M].Massachusetts:MIT Press,2018.
[2] HUA J,ZENG L,LI G,et al.Learning for a Robot:Deep Reinforcement Learning,Imitation Learning,Transfer Learning[J].Sensors,2021,21(4):1278.
[3] SILVER D,SCHRITTWIESER J,SIMONYAN K,et al.Mastering the Game of Go without Human Knowledge[J].Nature,2017,550(7676):354-359.
[4] ARTHUR L,SAMUE L.Some Studies in Machine LearningUsing the Game of Checkers[J].IBM Journal of Research and Development,2000,44(1/2):206-226.
[5] CHEN J P,ZOU F,LIU Q,et al.A Reinforcement Learning Algorithm Based on Generative Adversarial Networks[J].Theoretical Computer Science,2019,46(10):265-272.
[6] WATKINS C,DAYAN P.Technical Note Q-Learning[J].Machine Learning,1992,8:279-292.
[7] SUTTON R S.Learning to Predict by the Method of Temporal Differences[J].Machine Learning,1988,3(1):9-44.
[8] GOODFELLOW I,BENGIO Y,COURVILLE A.Deep Learning[M].Massachusetts:MIT Press,2016.
[9] MNIH V,KAVUKCUOGLU K,SILVER D,et al.Playing Atari with Deep Reinforcement Learning[J].arXiv:1312.5602,2013.
[10] HASSELT H V,GUEZ A,SILVER D.Deep ReinforcementLearning with Double Q-Learning[C]//Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence.AAAI Press,2016:2094-2100.
[11] LILLICRA T P,HUNT J J,PRITZEL A,et al.ContinuousControl with Deep Reinforcement Learning[C]//Proceedings of the 4th International Conference on Learning Representations.ICLR,2016.
[12] FUJIMOTO S,HOOF H V,MEGER D.Addressing Function Approximation Error in Actor-Critic Methods[C]//International Conference on Machine Learning.PMLR,2018:1587-1596.
[13] HAARNOJA T,ZHOU A,ABBEEL P,et al.SOFT ACTOR-CRITIC:Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor[C]//Proceedings of the 35th International Conference on Machine Learning.PMLR,2018:1856-1865.
[14] SCHULMAN J,LEVINE S,ABBEEL P,et al.Trust Region Policy Optimization[C]//International Conference on Machine Learning.PMLR,2015:1889-1897.
[15] SCHULMAN J,WOLSKI F,DHARIWAL P,et al.ProximalPolicy Optimization Algorithms[J].arXiv:/1707.06347,2017.
[16] MNIH V,BADIA A P,MIRZA M,et al.Asynchronous Me-thods for Deep Reinforcement Learning[C]//International Conference on Machine Learning.PMLR,2016:1928-1937.
[17] HASSELT H.Double Q-learning[J].Advances in Neural Information Processing Systems,2010,23:2613-2621.
[18] RUDER R.An Overview of Gradient Descent Optimization Algorithms[J].arXiv:1609.04747,2016.
[19] KINGMA D P,BA J.Adam:A Method for Stochastic Optimization[J].arXiv:1412.6980,2014.
[20] THRUN S,SCHWARTZ A.Issues in using Function Approximation for Reinforcement Learning[C]//Proceedings of the Fourth Connectionist Models Summer School.Erlbaum,1993:255-263.
[21] BROCKMAN G,CHEUNG V,PETTERSSON L,et al.Openai Gym[J].arXiv:1606.01540,2016.
[22] TODOROV E,EREZ T,TASSA Y.MuJoCo:A Physics Engine for Model-based Control[C]//Intelligent Robots and Systems.IEEE,2012:5026-5033.

Related Articles 15

[1]	RAO Zhi-shuang, JIA Zhen, ZHANG Fan, LI Tian-rui. Key-Value Relational Memory Networks for Question Answering over Knowledge Graph [J]. Computer Science, 2022, 49(9): 202-207.
[2]	TANG Ling-tao, WANG Di, ZHANG Lu-fei, LIU Sheng-yun. Federated Learning Scheme Based on Secure Multi-party Computation and Differential Privacy [J]. Computer Science, 2022, 49(9): 297-305.
[3]	XU Yong-xin, ZHAO Jun-feng, WANG Ya-sha, XIE Bing, YANG Kai. Temporal Knowledge Graph Representation Learning [J]. Computer Science, 2022, 49(9): 162-171.
[4]	WANG Jian, PENG Yu-qi, ZHAO Yu-fei, YANG Jian. Survey of Social Network Public Opinion Information Extraction Based on Deep Learning [J]. Computer Science, 2022, 49(8): 279-293.
[5]	HAO Zhi-rong, CHEN Long, HUANG Jia-cheng. Class Discriminative Universal Adversarial Attack for Text Classification [J]. Computer Science, 2022, 49(8): 323-329.
[6]	JIANG Meng-han, LI Shao-mei, ZHENG Hong-hao, ZHANG Jian-peng. Rumor Detection Model Based on Improved Position Embedding [J]. Computer Science, 2022, 49(8): 330-335.
[7]	SUN Qi, JI Gen-lin, ZHANG Jie. Non-local Attention Based Generative Adversarial Network for Video Abnormal Event Detection [J]. Computer Science, 2022, 49(8): 172-177.
[8]	HU Yan-yu, ZHAO Long, DONG Xiang-jun. Two-stage Deep Feature Selection Extraction Algorithm for Cancer Classification [J]. Computer Science, 2022, 49(7): 73-78.
[9]	CHENG Cheng, JIANG Ai-lian. Real-time Semantic Segmentation Method Based on Multi-path Feature Extraction [J]. Computer Science, 2022, 49(7): 120-126.
[10]	HOU Yu-tao, ABULIZI Abudukelimu, ABUDUKELIMU Halidanmu. Advances in Chinese Pre-training Models [J]. Computer Science, 2022, 49(7): 148-163.
[11]	ZHOU Hui, SHI Hao-chen, TU Yao-feng, HUANG Sheng-jun. Robust Deep Neural Network Learning Based on Active Sampling [J]. Computer Science, 2022, 49(7): 164-169.
[12]	SU Dan-ning, CAO Gui-tao, WANG Yan-nan, WANG Hong, REN He. Survey of Deep Learning for Radar Emitter Identification Based on Small Sample [J]. Computer Science, 2022, 49(7): 226-235.
[13]	LIU Wei-ye, LU Hui-min, LI Yu-peng, MA Ning. Survey on Finger Vein Recognition Research [J]. Computer Science, 2022, 49(6A): 1-11.
[14]	SUN Fu-quan, CUI Zhi-qing, ZOU Peng, ZHANG Kun. Brain Tumor Segmentation Algorithm Based on Multi-scale Features [J]. Computer Science, 2022, 49(6A): 12-16.
[15]	KANG Yan, XU Yu-long, KOU Yong-qi, XIE Si-yu, YANG Xue-kun, LI Hao. Drug-Drug Interaction Prediction Based on Transformer and LSTM [J]. Computer Science, 2022, 49(6A): 17-21.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Off-policy Maximum Entropy Deep Reinforcement Learning Algorithm Based on RandomlyWeighted Triple Q -Learning

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 15

Metrics

Comments

Recommended 0