计算机科学 ›› 2022, Vol. 49 ›› Issue (6): 335-341.doi: 10.11896/jsjkx.210300081

• 人工智能 • 上一篇    下一篇

基于随机加权三重Q学习的异策略最大熵强化学习算法

范静宇1, 刘全1,2,3,4   

  1. 1 苏州大学计算机科学与技术学院 江苏 苏州 215006
    2 苏州大学江苏省计算机信息处理技术重点实验室 江苏 苏州 215006
    3 吉林大学符号计算与知识工程教育部重点实验室 长春 130012
    4 软件新技术与产业化协同创新中心 南京 210000
  • 收稿日期:2021-03-08 修回日期:2022-01-21 出版日期:2022-06-15 发布日期:2022-06-08
  • 通讯作者: 刘全(quanliu@suda.edu.cn)
  • 作者简介:(20185227066@stu.suda.edu.cn)
  • 基金资助:
    国家自然科学基金(61772355,61702055,61502323,61502329);江苏省高等学校自然科学研究重大项目(18KJA520011,17KJA520004);吉林大学符号计算与知识工程教育部重点实验室资助项目(93K172014K04,93K172017K18);苏州市应用基础研究计划工业部分(SYG201422);江苏省高校优势学科建设工程资助项目

Off-policy Maximum Entropy Deep Reinforcement Learning Algorithm Based on RandomlyWeighted Triple Q -Learning

FAN Jing-yu1, LIU Quan1,2,3,4   

  1. 1 School of Computer Science and Technology,Soochow University,Suzhou,Jiangsu 215006,China
    2 Provincial Key Laboratory for Computer Information Processing Technology,Soochow University,Suzhou,Jiangsu 215006,China
    3 Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education,Jilin University,Changchun 130012,China
    4 Collaborative Innovation Center of Novel Software Technology and Industrialization,Nanjing 210000,China
  • Received:2021-03-08 Revised:2022-01-21 Online:2022-06-15 Published:2022-06-08
  • About author:FAN Jing-yu,born in 1995,postgra-duate.His main research interests include deep reinforcement learning and so on.
    LIU Quan,born in 1969,Ph.D,professor,is a member of China Computer Federation.His main research interests include deep reinforcement learning and automated reasoning.
  • Supported by:
    National Natural Science Foundation of China(61772355,61702055,61502323,61502329),Jiangsu Province Natu-ral Science Research University Major Projects(18KJA520011,17KJA520004),Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education,Jilin University(93K172014K04,93K172017K18),Suzhou Industrial Application of Basic Research Program Part(SYG201422) and Project Funded by the Priority Academic Program Development of Jiangsu Higher Education Institutions.

摘要: 强化学习是机器学习中一个重要的分支,随着深度学习的发展,深度强化学习逐渐发展为强化学习研究的重点。因应用广泛且实用性较强,面向连续控制问题的无模型异策略深度强化学习算法备受关注。同基于离散动作的Q学习一样,类行动者-评论家算法会受到动作值高估问题的影响。在类行动者-评论家算法的学习过程中,剪切双Q学习可以在一定程度上解决动作值高估的问题,但同时也引入了一定程度的低估问题。为了进一步解决类行动者-评论家算法中的高低估问题,提出了一种新的随机加权三重Q学习方法。该方法可以更好地解决类行动者-评论家算法中的高低估问题。此外,将这种新的方法与软行动者-评论家算法结合,提出了一种新的基于随机加权三重Q学习的软行动者-评论家算法,该算法在限制Q估计值在真实Q值附近的同时,通过随机加权方法增加Q估计值的随机性,从而有效解决了学习过程中对动作值的高低估问题。实验结果表明,相比SAC算法、DDPG算法、PPO算法与TD3算法等深度强化学习算法,SAC-RWTQ算法可以在gym仿真平台中的多个Mujoco任务上获得更好的表现。

关键词: Q学习, 连续动作空间, 软行动者-评论家算法, 深度学习, 异策略强化学习, 最大熵

Abstract: Reinforcement learning is an important branch of machine learning.With the development of deep learning,deep reinforcement learning research has gradually developed into the focus of reinforcement learning research.Model-free off-policy deep reinforcement learning algorithms for continuous control attract everyone’s attention because of their strong practicality.Like Q-learning,algorithms based on actor-critic suffer from the problem of overestimations.To a certain extent,clipped double Q-lear-ning method solves the effect of the overestimation in actor-critic algorithms,but it also introduces underestimation to the lear-ning process.In order to further solve the problems of overestimation and underestimation in the actor-critic algorithms,a new learning method,randomly weighted triple Q-learning method is proposed.In addition,combining the new method with the soft actor critic algorithm,a new soft actor critic algorithm based on randomly weighted triple Q-learning is proposed.This algorithm not only limits the Q estimation value near the real Q value,but also increases the randomness of the Q estimation value through randomly weighted method,so as to solve the problems of overestimation and underestimation of action value in the learning process.Experiment results show that,compared to the SAC algorithm and other currently popular deep reinforcement learning algorithms such as DDPG,PPO and TD3,the SAC-RWTQ algorithm has better performance on several Mujoco tasks on the gym simulation platform.

Key words: Continuous action space, Deep learning, Maximum entropy, Off-policy reinforcement learning, Q-learning, Soft actor critic algorithm

中图分类号: 

  • TP181
[1] SUTTON R S,BARTO A G.Reinforcement Learning:An In-troduction[M].Massachusetts:MIT Press,2018.
[2] HUA J,ZENG L,LI G,et al.Learning for a Robot:Deep Reinforcement Learning,Imitation Learning,Transfer Learning[J].Sensors,2021,21(4):1278.
[3] SILVER D,SCHRITTWIESER J,SIMONYAN K,et al.Mastering the Game of Go without Human Knowledge[J].Nature,2017,550(7676):354-359.
[4] ARTHUR L,SAMUE L.Some Studies in Machine LearningUsing the Game of Checkers[J].IBM Journal of Research and Development,2000,44(1/2):206-226.
[5] CHEN J P,ZOU F,LIU Q,et al.A Reinforcement Learning Algorithm Based on Generative Adversarial Networks[J].Theoretical Computer Science,2019,46(10):265-272.
[6] WATKINS C,DAYAN P.Technical Note Q-Learning[J].Machine Learning,1992,8:279-292.
[7] SUTTON R S.Learning to Predict by the Method of Temporal Differences[J].Machine Learning,1988,3(1):9-44.
[8] GOODFELLOW I,BENGIO Y,COURVILLE A.Deep Learning[M].Massachusetts:MIT Press,2016.
[9] MNIH V,KAVUKCUOGLU K,SILVER D,et al.Playing Atari with Deep Reinforcement Learning[J].arXiv:1312.5602,2013.
[10] HASSELT H V,GUEZ A,SILVER D.Deep ReinforcementLearning with Double Q-Learning[C]//Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence.AAAI Press,2016:2094-2100.
[11] LILLICRA T P,HUNT J J,PRITZEL A,et al.ContinuousControl with Deep Reinforcement Learning[C]//Proceedings of the 4th International Conference on Learning Representations.ICLR,2016.
[12] FUJIMOTO S,HOOF H V,MEGER D.Addressing Function Approximation Error in Actor-Critic Methods[C]//International Conference on Machine Learning.PMLR,2018:1587-1596.
[13] HAARNOJA T,ZHOU A,ABBEEL P,et al.SOFT ACTOR-CRITIC:Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor[C]//Proceedings of the 35th International Conference on Machine Learning.PMLR,2018:1856-1865.
[14] SCHULMAN J,LEVINE S,ABBEEL P,et al.Trust Region Policy Optimization[C]//International Conference on Machine Learning.PMLR,2015:1889-1897.
[15] SCHULMAN J,WOLSKI F,DHARIWAL P,et al.ProximalPolicy Optimization Algorithms[J].arXiv:/1707.06347,2017.
[16] MNIH V,BADIA A P,MIRZA M,et al.Asynchronous Me-thods for Deep Reinforcement Learning[C]//International Conference on Machine Learning.PMLR,2016:1928-1937.
[17] HASSELT H.Double Q-learning[J].Advances in Neural Information Processing Systems,2010,23:2613-2621.
[18] RUDER R.An Overview of Gradient Descent Optimization Algorithms[J].arXiv:1609.04747,2016.
[19] KINGMA D P,BA J.Adam:A Method for Stochastic Optimization[J].arXiv:1412.6980,2014.
[20] THRUN S,SCHWARTZ A.Issues in using Function Approximation for Reinforcement Learning[C]//Proceedings of the Fourth Connectionist Models Summer School.Erlbaum,1993:255-263.
[21] BROCKMAN G,CHEUNG V,PETTERSSON L,et al.Openai Gym[J].arXiv:1606.01540,2016.
[22] TODOROV E,EREZ T,TASSA Y.MuJoCo:A Physics Engine for Model-based Control[C]//Intelligent Robots and Systems.IEEE,2012:5026-5033.
[1] 徐涌鑫, 赵俊峰, 王亚沙, 谢冰, 杨恺.
时序知识图谱表示学习
Temporal Knowledge Graph Representation Learning
计算机科学, 2022, 49(9): 162-171. https://doi.org/10.11896/jsjkx.220500204
[2] 饶志双, 贾真, 张凡, 李天瑞.
基于Key-Value关联记忆网络的知识图谱问答方法
Key-Value Relational Memory Networks for Question Answering over Knowledge Graph
计算机科学, 2022, 49(9): 202-207. https://doi.org/10.11896/jsjkx.220300277
[3] 汤凌韬, 王迪, 张鲁飞, 刘盛云.
基于安全多方计算和差分隐私的联邦学习方案
Federated Learning Scheme Based on Secure Multi-party Computation and Differential Privacy
计算机科学, 2022, 49(9): 297-305. https://doi.org/10.11896/jsjkx.210800108
[4] 王剑, 彭雨琦, 赵宇斐, 杨健.
基于深度学习的社交网络舆情信息抽取方法综述
Survey of Social Network Public Opinion Information Extraction Based on Deep Learning
计算机科学, 2022, 49(8): 279-293. https://doi.org/10.11896/jsjkx.220300099
[5] 郝志荣, 陈龙, 黄嘉成.
面向文本分类的类别区分式通用对抗攻击方法
Class Discriminative Universal Adversarial Attack for Text Classification
计算机科学, 2022, 49(8): 323-329. https://doi.org/10.11896/jsjkx.220200077
[6] 姜梦函, 李邵梅, 郑洪浩, 张建朋.
基于改进位置编码的谣言检测模型
Rumor Detection Model Based on Improved Position Embedding
计算机科学, 2022, 49(8): 330-335. https://doi.org/10.11896/jsjkx.210600046
[7] 孙奇, 吉根林, 张杰.
基于非局部注意力生成对抗网络的视频异常事件检测方法
Non-local Attention Based Generative Adversarial Network for Video Abnormal Event Detection
计算机科学, 2022, 49(8): 172-177. https://doi.org/10.11896/jsjkx.210600061
[8] 侯钰涛, 阿布都克力木·阿布力孜, 哈里旦木·阿布都克里木.
中文预训练模型研究进展
Advances in Chinese Pre-training Models
计算机科学, 2022, 49(7): 148-163. https://doi.org/10.11896/jsjkx.211200018
[9] 周慧, 施皓晨, 屠要峰, 黄圣君.
基于主动采样的深度鲁棒神经网络学习
Robust Deep Neural Network Learning Based on Active Sampling
计算机科学, 2022, 49(7): 164-169. https://doi.org/10.11896/jsjkx.210600044
[10] 苏丹宁, 曹桂涛, 王燕楠, 王宏, 任赫.
小样本雷达辐射源识别的深度学习方法综述
Survey of Deep Learning for Radar Emitter Identification Based on Small Sample
计算机科学, 2022, 49(7): 226-235. https://doi.org/10.11896/jsjkx.210600138
[11] 胡艳羽, 赵龙, 董祥军.
一种用于癌症分类的两阶段深度特征选择提取算法
Two-stage Deep Feature Selection Extraction Algorithm for Cancer Classification
计算机科学, 2022, 49(7): 73-78. https://doi.org/10.11896/jsjkx.210500092
[12] 程成, 降爱莲.
基于多路径特征提取的实时语义分割方法
Real-time Semantic Segmentation Method Based on Multi-path Feature Extraction
计算机科学, 2022, 49(7): 120-126. https://doi.org/10.11896/jsjkx.210500157
[13] 王君锋, 刘凡, 杨赛, 吕坦悦, 陈峙宇, 许峰.
基于多源迁移学习的大坝裂缝检测
Dam Crack Detection Based on Multi-source Transfer Learning
计算机科学, 2022, 49(6A): 319-324. https://doi.org/10.11896/jsjkx.210500124
[14] 楚玉春, 龚航, 王学芳, 刘培顺.
基于YOLOv4的目标检测知识蒸馏算法研究
Study on Knowledge Distillation of Target Detection Algorithm Based on YOLOv4
计算机科学, 2022, 49(6A): 337-344. https://doi.org/10.11896/jsjkx.210600204
[15] 周志豪, 陈磊, 伍翔, 丘东亮, 梁广升, 曾凡巧.
基于SMOTE-SDSAE-SVM的车载CAN总线入侵检测算法
SMOTE-SDSAE-SVM Based Vehicle CAN Bus Intrusion Detection Algorithm
计算机科学, 2022, 49(6A): 562-570. https://doi.org/10.11896/jsjkx.210700106
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!