计算机科学 ›› 2022, Vol. 49 ›› Issue (6): 335-341.doi: 10.11896/jsjkx.210300081
范静宇1, 刘全1,2,3,4
FAN Jing-yu1, LIU Quan1,2,3,4
摘要: 强化学习是机器学习中一个重要的分支,随着深度学习的发展,深度强化学习逐渐发展为强化学习研究的重点。因应用广泛且实用性较强,面向连续控制问题的无模型异策略深度强化学习算法备受关注。同基于离散动作的Q学习一样,类行动者-评论家算法会受到动作值高估问题的影响。在类行动者-评论家算法的学习过程中,剪切双Q学习可以在一定程度上解决动作值高估的问题,但同时也引入了一定程度的低估问题。为了进一步解决类行动者-评论家算法中的高低估问题,提出了一种新的随机加权三重Q学习方法。该方法可以更好地解决类行动者-评论家算法中的高低估问题。此外,将这种新的方法与软行动者-评论家算法结合,提出了一种新的基于随机加权三重Q学习的软行动者-评论家算法,该算法在限制Q估计值在真实Q值附近的同时,通过随机加权方法增加Q估计值的随机性,从而有效解决了学习过程中对动作值的高低估问题。实验结果表明,相比SAC算法、DDPG算法、PPO算法与TD3算法等深度强化学习算法,SAC-RWTQ算法可以在gym仿真平台中的多个Mujoco任务上获得更好的表现。
中图分类号:
[1] SUTTON R S,BARTO A G.Reinforcement Learning:An In-troduction[M].Massachusetts:MIT Press,2018. [2] HUA J,ZENG L,LI G,et al.Learning for a Robot:Deep Reinforcement Learning,Imitation Learning,Transfer Learning[J].Sensors,2021,21(4):1278. [3] SILVER D,SCHRITTWIESER J,SIMONYAN K,et al.Mastering the Game of Go without Human Knowledge[J].Nature,2017,550(7676):354-359. [4] ARTHUR L,SAMUE L.Some Studies in Machine LearningUsing the Game of Checkers[J].IBM Journal of Research and Development,2000,44(1/2):206-226. [5] CHEN J P,ZOU F,LIU Q,et al.A Reinforcement Learning Algorithm Based on Generative Adversarial Networks[J].Theoretical Computer Science,2019,46(10):265-272. [6] WATKINS C,DAYAN P.Technical Note Q-Learning[J].Machine Learning,1992,8:279-292. [7] SUTTON R S.Learning to Predict by the Method of Temporal Differences[J].Machine Learning,1988,3(1):9-44. [8] GOODFELLOW I,BENGIO Y,COURVILLE A.Deep Learning[M].Massachusetts:MIT Press,2016. [9] MNIH V,KAVUKCUOGLU K,SILVER D,et al.Playing Atari with Deep Reinforcement Learning[J].arXiv:1312.5602,2013. [10] HASSELT H V,GUEZ A,SILVER D.Deep ReinforcementLearning with Double Q-Learning[C]//Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence.AAAI Press,2016:2094-2100. [11] LILLICRA T P,HUNT J J,PRITZEL A,et al.ContinuousControl with Deep Reinforcement Learning[C]//Proceedings of the 4th International Conference on Learning Representations.ICLR,2016. [12] FUJIMOTO S,HOOF H V,MEGER D.Addressing Function Approximation Error in Actor-Critic Methods[C]//International Conference on Machine Learning.PMLR,2018:1587-1596. [13] HAARNOJA T,ZHOU A,ABBEEL P,et al.SOFT ACTOR-CRITIC:Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor[C]//Proceedings of the 35th International Conference on Machine Learning.PMLR,2018:1856-1865. [14] SCHULMAN J,LEVINE S,ABBEEL P,et al.Trust Region Policy Optimization[C]//International Conference on Machine Learning.PMLR,2015:1889-1897. [15] SCHULMAN J,WOLSKI F,DHARIWAL P,et al.ProximalPolicy Optimization Algorithms[J].arXiv:/1707.06347,2017. [16] MNIH V,BADIA A P,MIRZA M,et al.Asynchronous Me-thods for Deep Reinforcement Learning[C]//International Conference on Machine Learning.PMLR,2016:1928-1937. [17] HASSELT H.Double Q-learning[J].Advances in Neural Information Processing Systems,2010,23:2613-2621. [18] RUDER R.An Overview of Gradient Descent Optimization Algorithms[J].arXiv:1609.04747,2016. [19] KINGMA D P,BA J.Adam:A Method for Stochastic Optimization[J].arXiv:1412.6980,2014. [20] THRUN S,SCHWARTZ A.Issues in using Function Approximation for Reinforcement Learning[C]//Proceedings of the Fourth Connectionist Models Summer School.Erlbaum,1993:255-263. [21] BROCKMAN G,CHEUNG V,PETTERSSON L,et al.Openai Gym[J].arXiv:1606.01540,2016. [22] TODOROV E,EREZ T,TASSA Y.MuJoCo:A Physics Engine for Model-based Control[C]//Intelligent Robots and Systems.IEEE,2012:5026-5033. |
[1] | 徐涌鑫, 赵俊峰, 王亚沙, 谢冰, 杨恺. 时序知识图谱表示学习 Temporal Knowledge Graph Representation Learning 计算机科学, 2022, 49(9): 162-171. https://doi.org/10.11896/jsjkx.220500204 |
[2] | 饶志双, 贾真, 张凡, 李天瑞. 基于Key-Value关联记忆网络的知识图谱问答方法 Key-Value Relational Memory Networks for Question Answering over Knowledge Graph 计算机科学, 2022, 49(9): 202-207. https://doi.org/10.11896/jsjkx.220300277 |
[3] | 汤凌韬, 王迪, 张鲁飞, 刘盛云. 基于安全多方计算和差分隐私的联邦学习方案 Federated Learning Scheme Based on Secure Multi-party Computation and Differential Privacy 计算机科学, 2022, 49(9): 297-305. https://doi.org/10.11896/jsjkx.210800108 |
[4] | 王剑, 彭雨琦, 赵宇斐, 杨健. 基于深度学习的社交网络舆情信息抽取方法综述 Survey of Social Network Public Opinion Information Extraction Based on Deep Learning 计算机科学, 2022, 49(8): 279-293. https://doi.org/10.11896/jsjkx.220300099 |
[5] | 郝志荣, 陈龙, 黄嘉成. 面向文本分类的类别区分式通用对抗攻击方法 Class Discriminative Universal Adversarial Attack for Text Classification 计算机科学, 2022, 49(8): 323-329. https://doi.org/10.11896/jsjkx.220200077 |
[6] | 姜梦函, 李邵梅, 郑洪浩, 张建朋. 基于改进位置编码的谣言检测模型 Rumor Detection Model Based on Improved Position Embedding 计算机科学, 2022, 49(8): 330-335. https://doi.org/10.11896/jsjkx.210600046 |
[7] | 孙奇, 吉根林, 张杰. 基于非局部注意力生成对抗网络的视频异常事件检测方法 Non-local Attention Based Generative Adversarial Network for Video Abnormal Event Detection 计算机科学, 2022, 49(8): 172-177. https://doi.org/10.11896/jsjkx.210600061 |
[8] | 侯钰涛, 阿布都克力木·阿布力孜, 哈里旦木·阿布都克里木. 中文预训练模型研究进展 Advances in Chinese Pre-training Models 计算机科学, 2022, 49(7): 148-163. https://doi.org/10.11896/jsjkx.211200018 |
[9] | 周慧, 施皓晨, 屠要峰, 黄圣君. 基于主动采样的深度鲁棒神经网络学习 Robust Deep Neural Network Learning Based on Active Sampling 计算机科学, 2022, 49(7): 164-169. https://doi.org/10.11896/jsjkx.210600044 |
[10] | 苏丹宁, 曹桂涛, 王燕楠, 王宏, 任赫. 小样本雷达辐射源识别的深度学习方法综述 Survey of Deep Learning for Radar Emitter Identification Based on Small Sample 计算机科学, 2022, 49(7): 226-235. https://doi.org/10.11896/jsjkx.210600138 |
[11] | 胡艳羽, 赵龙, 董祥军. 一种用于癌症分类的两阶段深度特征选择提取算法 Two-stage Deep Feature Selection Extraction Algorithm for Cancer Classification 计算机科学, 2022, 49(7): 73-78. https://doi.org/10.11896/jsjkx.210500092 |
[12] | 程成, 降爱莲. 基于多路径特征提取的实时语义分割方法 Real-time Semantic Segmentation Method Based on Multi-path Feature Extraction 计算机科学, 2022, 49(7): 120-126. https://doi.org/10.11896/jsjkx.210500157 |
[13] | 王君锋, 刘凡, 杨赛, 吕坦悦, 陈峙宇, 许峰. 基于多源迁移学习的大坝裂缝检测 Dam Crack Detection Based on Multi-source Transfer Learning 计算机科学, 2022, 49(6A): 319-324. https://doi.org/10.11896/jsjkx.210500124 |
[14] | 楚玉春, 龚航, 王学芳, 刘培顺. 基于YOLOv4的目标检测知识蒸馏算法研究 Study on Knowledge Distillation of Target Detection Algorithm Based on YOLOv4 计算机科学, 2022, 49(6A): 337-344. https://doi.org/10.11896/jsjkx.210600204 |
[15] | 周志豪, 陈磊, 伍翔, 丘东亮, 梁广升, 曾凡巧. 基于SMOTE-SDSAE-SVM的车载CAN总线入侵检测算法 SMOTE-SDSAE-SVM Based Vehicle CAN Bus Intrusion Detection Algorithm 计算机科学, 2022, 49(6A): 562-570. https://doi.org/10.11896/jsjkx.210700106 |
|