计算机科学 ›› 2022, Vol. 49 ›› Issue (1): 298-305.doi: 10.11896/jsjkx.201100194
敖天宇1, 刘全1,2,3,4
AO Tian-yu1, LIU Quan1,2,3,4
摘要: 深度强化学习(Deep Reinforcement Learning,DRL)方法在大状态空间控制任务上取得了出色效果,探索问题一直是该领域的一个研究热点。现有探索算法存在盲目探索、学习慢等问题。针对以上问题,提出了一种快速收敛的最大置信上界探索(Upper Confidence Bound Exploration with Fast Convergence,FAST-UCB)方法。该方法使用UCB算法探索大状态空间,提高探索效率。为缓解Q值高估的问题、平衡探索与利用关系,加入了Q值截断技巧。之后,为平衡算法偏差与方差,使智能体(agent)快速学习,在网络模型中加入长短时记忆(Long Short Term Memory,LSTM)单元,同时使用一种改进混合蒙特卡洛(Mixed Monte Carlo,MMC)方法计算网络误差。最后,将FAST-UCB应用到深度Q网络(Deep Q Network,DQN),在控制类环境中将其与ε-贪心(ε-greedy)、UCB算法进行对比,以验证其有效性。在雅达利(Atari) 2600环境中将其与噪声网络(Noisy-Network)探索、自举(Bootstrapped)探索、异步优势行动者评论家(Asynchronous Advantage Actor Critic,A3C)算法和近端策略优化(Proximal Policy Optimization,PPO)算法进行对比,以验证其泛化性。实验结果表明,FAST-UCB算法在这两类环境中均能取得优秀效果。
中图分类号:
[1]SUTTON R S,BARTO A G.Reinforcement Learning:An In-troduction[M].Cambridge,USA:MIT Press,2018:1-4. [2]FU W B,SUN T,LIANG J,et al.Review of Principle and Application of Deep Learning[J].Computer Science,2018,45(6A):11-15,40. [3]LIU Q,ZHAI J W,ZHANG Z C.et al.A Survey on Deep Reinforcement Learning[J].Chinese Journal of Computers,2018,41(1):1-27. [4]SILVER D,SCHRITTWIESER J,SIMONYAN K,et al.Mastering the Game of Go Without Human Knowledge[J].Nature,2017,550 (7676):354-359. [5]MNIH V,KAVUKCUOGLU K,SILVER D,et al.Playing Atari with Deep Reinforcement Learning[C]//Proceedings of the Workshops at the 27th Neural Information Proceeding Systems.Lake Tahoe,USA:MIT Press,2013:201-220. [6]MNIH V,KAVUKCUOGLU K,SILVER D,et al.Human-level Control through Deep Reinforcement Learning[J].Nature,2015,518(7540):529-533. [7]XU J N,ZENG J.Dynamic Target Following Based on Reinforcement Learning of Robotcar[J].Computer Science,2019,46(11A):94-97. [8]ZHANG H,GUAN X J,BAI G W.Optimization of MobileCharging Path of Wireless Rechargeable Sensor Networks Based on Reinforcement Learning[J].Computer Science,2020,47(11):316-321. [9]MICHAEL G,SCOTT S,CHI-GUHN L.Epsilon-BMC:ABayesian Ensemble Approach to Epsilon-Greedy Exploration in Model-Free Reinforcement Learning[C]//Proceedings of the 35th Conference on Uncertainty in Artificial Intelligence.Tel Aviv,Israel:AUAI Press,2019:476-485. [10]ZHOU Z H.Machine Learning[M].Beijing:Tsinghua University Press,2016:373-377. [11]AUER P,CESA-BIANCHI N,FISCHER P.Finite-time Analysis of the Multiarmed Bandit Problem[J].Machine Learning,2002,47(2-3):235-256. [12]ZHANG X F,ZHOU Q,LIANG B,et al.An Adaptive Algorithm in Multi-Armed Bandit Problem[J].Journal of Computer Research and Development,2019,56(3):643-654. [13]HAO B,ABBASI-YADKORI Y,WEN Z,et al.BootstrappingUpper Confidence Bound[J].Advances in Neural Information Processing Systems,2019,32:12123-12133. [14]KUMARASWAMY R,SCHLEGEL M,WHITE A,et al.Context-Dependent Upper-Confidence Bounds for Directed Exploration[C]//Advances in Neural Information Processing Systems.2018:4779-4789. [15]FORTUNATO M,AZAR M G,PIOT B,et al.Noisy Networks for Exploration[C]//International Conference on Learning Representations.2018. [16]OSBAND I,BLUNDELL C,PRITZEL A,et al.Deep Exploration via Bootstrapped DQN[C]//Advances in Neural Information Processing Systems.Barcelona,Spain,2016:4026-4034. [17]MNIH V,BADIA A P,MIRZA M,et al.Asynchronous Me-thods for Deep Reinforcement Learning[C]//International Conference on Machine Learning.New York,USA:ACM,2016:1928-1937. [18]SCHULMAN J,LEVINE S,ABBEEL P,et al.Trust Region Policy Optimization[C]//International Conference on Machine Learning.Lille,France,2015:1889-1897. [19]SCHULMAN J,WOLSKI F,DHARIWAL P,et al.ProximalPolicy Optimization Algorithms[J].arXiv:1707.06347,2017. [20]WATKINS C J C H,DAYAN P.Q-learning[J].Machine Lear-ning,1992,8(3/4):279-292. [21]SUTTON R.Introduction to Reinforcement Learning withFunction Approximation[C]//Tutorial at the Conference on Neural Information Processing Systems.Montreal,Canada:MIT Press,2015:33. [22]HAUSKNECHT M,STONE P.Deep Recurrent Q-Learning for Partially Observable MDPs[J].arXiv:1507.06527,2015. [23]WILLIAMS R J,ZIPSER D.A Learning Algorithm for Conti-nually Running Fully Recurrent Neural Networks[J].Neural Computation,1989,1(2):270-280. [24]HOCHREITER S,SCHMIDHUBER J.Long Short-Term Me-mory[J].Neural Computation,1997,9(8):1735-1780. [25]METROPOLIS N,ULAM S.The Monte Carlo Method[J].Journal of the American Statistical Association,1949,44(247):335-341. [26]BELLEMARE M,SRINIVASAN S,OSTROVSKI G,et al.Unifying Count-Based Exploration and Intrinsic Motivation[C]//Advances in Neural Information Processing Systems.Barcelona,Spain,2016:1471-1479. [27]HASSELT H V.Double Q-learning[C]//Advances in NeuralInformation Processing Systems.Vancouver,British Columbia,Canada,2010:2613-2621. [28]HASSELT H V,GUEZ A,SILVER D.Deep ReinforcementLearning with Double Q-Learning[J].arXiv:1509.06461,2015. [29]FUJIMOTO S,HOOF H V,MEGER D.Addressing Function Approximation Error in Actor-Critic Methods[J].arXiv:1802.09477,2018. [30]JIN C,ALLEN-ZHU Z,BUBECK S,et al.IsQ-Learning Pro-vably Efficient? [C]//Advances in Neural Information Proces-sing Systems.Montreal,Canada:MIT Press,2018:4863-4873. [31]SUTTON R S,BARTO A G.Reinforcement Learning:An Introduction[M].Cambridge,USA:MIT Press,2018:131-132. [32]BROCKMAN G,CHEUNG V,PETTERSSON L,et al.OpenAi gym[J].arXiv:1606.01540,2016. |
[1] | 金方焱, 王秀利. 融合RACNN和BiLSTM的金融领域事件隐式因果关系抽取 Implicit Causality Extraction of Financial Events Integrating RACNN and BiLSTM 计算机科学, 2022, 49(7): 179-186. https://doi.org/10.11896/jsjkx.210500190 |
[2] | 王杉, 徐楚怡, 师春香, 张瑛. 基于CNN-LSTM的卫星云图云分类方法研究 Study on Cloud Classification Method of Satellite Cloud Images Based on CNN-LSTM 计算机科学, 2022, 49(6A): 675-679. https://doi.org/10.11896/jsjkx.210300177 |
[3] | 洪志理, 赖俊, 曹雷, 陈希亮, 徐志雄. 基于遗憾探索的竞争网络强化学习智能推荐方法研究 Study on Intelligent Recommendation Method of Dueling Network Reinforcement Learning Based on Regret Exploration 计算机科学, 2022, 49(6): 149-157. https://doi.org/10.11896/jsjkx.210600226 |
[4] | 张佳能, 李辉, 吴昊霖, 王壮. 一种平衡探索和利用的优先经验回放方法 Exploration and Exploitation Balanced Experience Replay 计算机科学, 2022, 49(5): 179-185. https://doi.org/10.11896/jsjkx.210300084 |
[5] | 潘志豪, 曾碧, 廖文雄, 魏鹏飞, 文松. 基于交互注意力图卷积网络的方面情感分类 Interactive Attention Graph Convolutional Networks for Aspect-based Sentiment Classification 计算机科学, 2022, 49(3): 294-300. https://doi.org/10.11896/jsjkx.210100180 |
[6] | 丁锋, 孙晓. 基于注意力机制和BiLSTM-CRF的消极情绪意见目标抽取 Negative-emotion Opinion Target Extraction Based on Attention and BiLSTM-CRF 计算机科学, 2022, 49(2): 223-230. https://doi.org/10.11896/jsjkx.210100046 |
[7] | 林椹尠, 张梦凯, 吴成茂, 郑兴宁. 利用生成对抗网络的人脸图像分步补全法 Face Image Inpainting with Generative Adversarial Network 计算机科学, 2021, 48(9): 174-180. https://doi.org/10.11896/jsjkx.200800014 |
[8] | 黄志勇, 吴昊霖, 王壮, 李辉. 基于平均神经网络参数的DQN算法 DQN Algorithm Based on Averaged Neural Network Parameters 计算机科学, 2021, 48(4): 223-228. https://doi.org/10.11896/jsjkx.200600177 |
[9] | 沈夏炯, 杨继勇, 张磊. 基于不相关属性集合的属性探索算法 Attribute Exploration Algorithm Based on Unrelated Attribute Set 计算机科学, 2021, 48(4): 54-62. https://doi.org/10.11896/jsjkx.200800082 |
[10] | 刘晓璇, 季怡, 刘纯平. 基于LSTM神经网络的声纹识别 Voiceprint Recognition Based on LSTM Neural Network 计算机科学, 2021, 48(11A): 270-274. https://doi.org/10.11896/jsjkx.210400041 |
[11] | 张宁, 方靖雯, 赵雨宣. 基于LSTM混合模型的比特币价格预测 Bitcoin Price Forecast Based on Mixed LSTM Model 计算机科学, 2021, 48(11A): 39-45. https://doi.org/10.11896/jsjkx.210600124 |
[12] | 赵晓东, 苏公瑾, 李克利, 成杰, 徐江峰. 一种融合EMD分解和LSTM网络的频谱占用度预测模型 Spectrum Occupancy Prediction Model Based on EMD Decomposition and LSTM Networks 计算机科学, 2020, 47(6A): 294-298. https://doi.org/10.11896/JsJkx.190700097 |
[13] | 张新明, 李双倩, 刘艳, 毛文涛, 刘尚旺, 刘国奇. 信息共享模型和组外贪心策略的郊狼优化算法 Coyote Optimization Algorithm Based on Information Sharing and Static Greed Selection 计算机科学, 2020, 47(5): 217-224. https://doi.org/10.11896/jsjkx.190400039 |
[14] | 王国胤, 瞿中, 赵显莲. 交叉融合的“人工智能+”学科建设探索与实践 Practical Exploration of Discipline Construction of Artificial Intelligence+ 计算机科学, 2020, 47(4): 1-5. https://doi.org/10.11896/jsjkx.200300144 |
[15] | 刘云,尹传环,胡迪,赵田,梁宇. 基于循环神经网络的通信卫星故障检测 Communication Satellite Fault Detection Based on Recurrent Neural Network 计算机科学, 2020, 47(2): 227-232. https://doi.org/10.11896/jsjkx.190600147 |
|