一种快速收敛的最大置信上界探索方法

doi:10.11896/jsjkx.201100194

计算机科学 ›› 2022, Vol. 49 ›› Issue (1): 298-305.doi: 10.11896/jsjkx.201100194

一种快速收敛的最大置信上界探索方法

敖天宇¹, 刘全^1,2,3,4

1 苏州大学计算机科学与技术学院江苏苏州215006
2 苏州大学江苏省计算机信息处理技术重点实验室江苏苏州215006
3 吉林大学符号计算与知识工程教育部重点实验室长春130012
4 软件新技术与产业化协同创新中心南京210000

收稿日期:2020-11-26 修回日期:2021-03-02 出版日期:2022-01-15 发布日期:2022-01-18
通讯作者: 刘全(quanliu@suda.edu.cn)
作者简介:20184227003@stu.suda.edu.cn
基金资助:
国家自然科学基金(61772355,61702055,61502323,61502329);江苏省高等学校自然科学研究重大项目(18KJA520011,17KJA520004);吉林大学符号计算与知识工程教育部重点实验室资助项目(93K172014K04,93K172017K18);苏州市应用基础研究计划工业部分(SYG201422);江苏省高校优势学科建设工程资助项目

Upper Confidence Bound Exploration with Fast Convergence

AO Tian-yu¹, LIU Quan^1,2,3,4

1 School of Computer Science and Technology,Soochow University,Suzhou,Jiangsu 215006,China
2 Provincial Key Laboratory for Computer Information Processing Technology,Soochow University,Suzhou,Jiangsu 215006,China
3 Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education,Jilin University,Changchun 130012,China
4 Collaborative Innovation Center of Novel Software Technology and Industrialization,Nanjing 210000,China

Received:2020-11-26 Revised:2021-03-02 Online:2022-01-15 Published:2022-01-18
About author:AO Tian-yu,born in 1995,postgra-duate.His main research interests include deep reinforcement learning and so on.
LIU Quan,born in 1969,Ph.D,professor,is a member of China Computer Federation.His main research interests include deep reinforcement learning and automated reasoning.
Supported by:
National Natural Science Foundation of China(61772355,61702055,61502323,61502329),Jiangsu Province Na-tural Science Research University Major Projects(18KJA520011,17KJA520004),Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education,Jilin University(93K172014K04,93K172017K18),Suzhou Industrial Application of Basic Research Program Part(SYG201422) and Project Funded by the Priority Academic Program Development of Jiangsu Higher Education Institutions.

摘要/Abstract

摘要： 深度强化学习(Deep Reinforcement Learning,DRL)方法在大状态空间控制任务上取得了出色效果,探索问题一直是该领域的一个研究热点。现有探索算法存在盲目探索、学习慢等问题。针对以上问题,提出了一种快速收敛的最大置信上界探索(Upper Confidence Bound Exploration with Fast Convergence,FAST-UCB)方法。该方法使用UCB算法探索大状态空间,提高探索效率。为缓解Q值高估的问题、平衡探索与利用关系,加入了Q值截断技巧。之后,为平衡算法偏差与方差,使智能体(agent)快速学习,在网络模型中加入长短时记忆(Long Short Term Memory,LSTM)单元,同时使用一种改进混合蒙特卡洛(Mixed Monte Carlo,MMC)方法计算网络误差。最后,将FAST-UCB应用到深度Q网络(Deep Q Network,DQN),在控制类环境中将其与ε-贪心(ε-greedy)、UCB算法进行对比,以验证其有效性。在雅达利(Atari) 2600环境中将其与噪声网络(Noisy-Network)探索、自举(Bootstrapped)探索、异步优势行动者评论家(Asynchronous Advantage Actor Critic,A3C)算法和近端策略优化(Proximal Policy Optimization,PPO)算法进行对比,以验证其泛化性。实验结果表明,FAST-UCB算法在这两类环境中均能取得优秀效果。

关键词: Q值截断, 长短时记忆, 混合蒙特卡洛, 探索, 最大置信上界

Abstract: Deep reinforcement learning method has achieved excellent results in large state space control tasks.Exploration has always been a research hotspot in this field.There are some problems in the existing exploration algorithms,such as blind exploration,and slow learning.To solve these problems,an upper confidence bound exploration with fast convergence (FAST-UCB) method is proposed.This method uses UCB method to explore the environment and improve the exploration efficiency.In order to alleviate the overestimation of Q value and balance the relationship between exploration and utilization,Q value clipped technique is added.Then,in order to balance the deviation and variance of the algorithm and make the agent learn quickly,the long short term memory unit is added to the network model,and an improved mixed monte carlo method is used to calculate the network error.Finally,FAST-UCB is applied to deep Q network,and compared with epsilon-greedy and UCB algorithms in control environment to verify its effectiveness.Besides,the proposed algorithm is compared with noise network exploration,bootstrapped exploration,asynchronous advantage actor critical algorithm and proximal policy optimization algorithm in Atari 2600 environment to verify its generalization.The experimental results show that FAST-UCB algorithm can achieve excellent results in these two environments.

Key words: Clipped Q value, Exploration, Long short term memory, Mixed monte carlo, Upper confidence bound

中图分类号:

TP181

敖天宇, 刘全. 一种快速收敛的最大置信上界探索方法[J]. 计算机科学, 2022, 49(1): 298-305. https://doi.org/10.11896/jsjkx.201100194

AO Tian-yu, LIU Quan. Upper Confidence Bound Exploration with Fast Convergence[J]. Computer Science, 2022, 49(1): 298-305. https://doi.org/10.11896/jsjkx.201100194

参考文献

[1]SUTTON R S,BARTO A G.Reinforcement Learning:An In-troduction[M].Cambridge,USA:MIT Press,2018:1-4.
[2]FU W B,SUN T,LIANG J,et al.Review of Principle and Application of Deep Learning[J].Computer Science,2018,45(6A):11-15,40.
[3]LIU Q,ZHAI J W,ZHANG Z C.et al.A Survey on Deep Reinforcement Learning[J].Chinese Journal of Computers,2018,41(1):1-27.
[4]SILVER D,SCHRITTWIESER J,SIMONYAN K,et al.Mastering the Game of Go Without Human Knowledge[J].Nature,2017,550 (7676):354-359.
[5]MNIH V,KAVUKCUOGLU K,SILVER D,et al.Playing Atari with Deep Reinforcement Learning[C]//Proceedings of the Workshops at the 27th Neural Information Proceeding Systems.Lake Tahoe,USA:MIT Press,2013:201-220.
[6]MNIH V,KAVUKCUOGLU K,SILVER D,et al.Human-level Control through Deep Reinforcement Learning[J].Nature,2015,518(7540):529-533.
[7]XU J N,ZENG J.Dynamic Target Following Based on Reinforcement Learning of Robotcar[J].Computer Science,2019,46(11A):94-97.
[8]ZHANG H,GUAN X J,BAI G W.Optimization of MobileCharging Path of Wireless Rechargeable Sensor Networks Based on Reinforcement Learning[J].Computer Science,2020,47(11):316-321.
[9]MICHAEL G,SCOTT S,CHI-GUHN L.Epsilon-BMC:ABayesian Ensemble Approach to Epsilon-Greedy Exploration in Model-Free Reinforcement Learning[C]//Proceedings of the 35th Conference on Uncertainty in Artificial Intelligence.Tel Aviv,Israel:AUAI Press,2019:476-485.
[10]ZHOU Z H.Machine Learning[M].Beijing:Tsinghua University Press,2016:373-377.
[11]AUER P,CESA-BIANCHI N,FISCHER P.Finite-time Analysis of the Multiarmed Bandit Problem[J].Machine Learning,2002,47(2-3):235-256.
[12]ZHANG X F,ZHOU Q,LIANG B,et al.An Adaptive Algorithm in Multi-Armed Bandit Problem[J].Journal of Computer Research and Development,2019,56(3):643-654.
[13]HAO B,ABBASI-YADKORI Y,WEN Z,et al.BootstrappingUpper Confidence Bound[J].Advances in Neural Information Processing Systems,2019,32:12123-12133.
[14]KUMARASWAMY R,SCHLEGEL M,WHITE A,et al.Context-Dependent Upper-Confidence Bounds for Directed Exploration[C]//Advances in Neural Information Processing Systems.2018:4779-4789.
[15]FORTUNATO M,AZAR M G,PIOT B,et al.Noisy Networks for Exploration[C]//International Conference on Learning Representations.2018.
[16]OSBAND I,BLUNDELL C,PRITZEL A,et al.Deep Exploration via Bootstrapped DQN[C]//Advances in Neural Information Processing Systems.Barcelona,Spain,2016:4026-4034.
[17]MNIH V,BADIA A P,MIRZA M,et al.Asynchronous Me-thods for Deep Reinforcement Learning[C]//International Conference on Machine Learning.New York,USA:ACM,2016:1928-1937.
[18]SCHULMAN J,LEVINE S,ABBEEL P,et al.Trust Region Policy Optimization[C]//International Conference on Machine Learning.Lille,France,2015:1889-1897.
[19]SCHULMAN J,WOLSKI F,DHARIWAL P,et al.ProximalPolicy Optimization Algorithms[J].arXiv:1707.06347,2017.
[20]WATKINS C J C H,DAYAN P.Q-learning[J].Machine Lear-ning,1992,8(3/4):279-292.
[21]SUTTON R.Introduction to Reinforcement Learning withFunction Approximation[C]//Tutorial at the Conference on Neural Information Processing Systems.Montreal,Canada:MIT Press,2015:33.
[22]HAUSKNECHT M,STONE P.Deep Recurrent Q-Learning for Partially Observable MDPs[J].arXiv:1507.06527,2015.
[23]WILLIAMS R J,ZIPSER D.A Learning Algorithm for Conti-nually Running Fully Recurrent Neural Networks[J].Neural Computation,1989,1(2):270-280.
[24]HOCHREITER S,SCHMIDHUBER J.Long Short-Term Me-mory[J].Neural Computation,1997,9(8):1735-1780.
[25]METROPOLIS N,ULAM S.The Monte Carlo Method[J].Journal of the American Statistical Association,1949,44(247):335-341.
[26]BELLEMARE M,SRINIVASAN S,OSTROVSKI G,et al.Unifying Count-Based Exploration and Intrinsic Motivation[C]//Advances in Neural Information Processing Systems.Barcelona,Spain,2016:1471-1479.
[27]HASSELT H V.Double Q-learning[C]//Advances in NeuralInformation Processing Systems.Vancouver,British Columbia,Canada,2010:2613-2621.
[28]HASSELT H V,GUEZ A,SILVER D.Deep ReinforcementLearning with Double Q-Learning[J].arXiv:1509.06461,2015.
[29]FUJIMOTO S,HOOF H V,MEGER D.Addressing Function Approximation Error in Actor-Critic Methods[J].arXiv:1802.09477,2018.
[30]JIN C,ALLEN-ZHU Z,BUBECK S,et al.IsQ-Learning Pro-vably Efficient? [C]//Advances in Neural Information Proces-sing Systems.Montreal,Canada:MIT Press,2018:4863-4873.
[31]SUTTON R S,BARTO A G.Reinforcement Learning:An Introduction[M].Cambridge,USA:MIT Press,2018:131-132.
[32]BROCKMAN G,CHEUNG V,PETTERSSON L,et al.OpenAi gym[J].arXiv:1606.01540,2016.

相关文章 15

[1]	金方焱, 王秀利. 融合RACNN和BiLSTM的金融领域事件隐式因果关系抽取 Implicit Causality Extraction of Financial Events Integrating RACNN and BiLSTM 计算机科学, 2022, 49(7): 179-186. https://doi.org/10.11896/jsjkx.210500190
[2]	王杉, 徐楚怡, 师春香, 张瑛. 基于CNN-LSTM的卫星云图云分类方法研究 Study on Cloud Classification Method of Satellite Cloud Images Based on CNN-LSTM 计算机科学, 2022, 49(6A): 675-679. https://doi.org/10.11896/jsjkx.210300177
[3]	洪志理, 赖俊, 曹雷, 陈希亮, 徐志雄. 基于遗憾探索的竞争网络强化学习智能推荐方法研究 Study on Intelligent Recommendation Method of Dueling Network Reinforcement Learning Based on Regret Exploration 计算机科学, 2022, 49(6): 149-157. https://doi.org/10.11896/jsjkx.210600226
[4]	张佳能, 李辉, 吴昊霖, 王壮. 一种平衡探索和利用的优先经验回放方法 Exploration and Exploitation Balanced Experience Replay 计算机科学, 2022, 49(5): 179-185. https://doi.org/10.11896/jsjkx.210300084
[5]	潘志豪, 曾碧, 廖文雄, 魏鹏飞, 文松. 基于交互注意力图卷积网络的方面情感分类 Interactive Attention Graph Convolutional Networks for Aspect-based Sentiment Classification 计算机科学, 2022, 49(3): 294-300. https://doi.org/10.11896/jsjkx.210100180
[6]	丁锋, 孙晓. 基于注意力机制和BiLSTM-CRF的消极情绪意见目标抽取 Negative-emotion Opinion Target Extraction Based on Attention and BiLSTM-CRF 计算机科学, 2022, 49(2): 223-230. https://doi.org/10.11896/jsjkx.210100046
[7]	林椹尠, 张梦凯, 吴成茂, 郑兴宁. 利用生成对抗网络的人脸图像分步补全法 Face Image Inpainting with Generative Adversarial Network 计算机科学, 2021, 48(9): 174-180. https://doi.org/10.11896/jsjkx.200800014
[8]	黄志勇, 吴昊霖, 王壮, 李辉. 基于平均神经网络参数的DQN算法 DQN Algorithm Based on Averaged Neural Network Parameters 计算机科学, 2021, 48(4): 223-228. https://doi.org/10.11896/jsjkx.200600177
[9]	沈夏炯, 杨继勇, 张磊. 基于不相关属性集合的属性探索算法 Attribute Exploration Algorithm Based on Unrelated Attribute Set 计算机科学, 2021, 48(4): 54-62. https://doi.org/10.11896/jsjkx.200800082
[10]	刘晓璇, 季怡, 刘纯平. 基于LSTM神经网络的声纹识别 Voiceprint Recognition Based on LSTM Neural Network 计算机科学, 2021, 48(11A): 270-274. https://doi.org/10.11896/jsjkx.210400041
[11]	张宁, 方靖雯, 赵雨宣. 基于LSTM混合模型的比特币价格预测 Bitcoin Price Forecast Based on Mixed LSTM Model 计算机科学, 2021, 48(11A): 39-45. https://doi.org/10.11896/jsjkx.210600124
[12]	赵晓东, 苏公瑾, 李克利, 成杰, 徐江峰. 一种融合EMD分解和LSTM网络的频谱占用度预测模型 Spectrum Occupancy Prediction Model Based on EMD Decomposition and LSTM Networks 计算机科学, 2020, 47(6A): 294-298. https://doi.org/10.11896/JsJkx.190700097
[13]	张新明, 李双倩, 刘艳, 毛文涛, 刘尚旺, 刘国奇. 信息共享模型和组外贪心策略的郊狼优化算法 Coyote Optimization Algorithm Based on Information Sharing and Static Greed Selection 计算机科学, 2020, 47(5): 217-224. https://doi.org/10.11896/jsjkx.190400039
[14]	王国胤, 瞿中, 赵显莲. 交叉融合的“人工智能+”学科建设探索与实践 Practical Exploration of Discipline Construction of Artificial Intelligence+ 计算机科学, 2020, 47(4): 1-5. https://doi.org/10.11896/jsjkx.200300144
[15]	刘云,尹传环,胡迪,赵田,梁宇. 基于循环神经网络的通信卫星故障检测 Communication Satellite Fault Detection Based on Recurrent Neural Network 计算机科学, 2020, 47(2): 227-232. https://doi.org/10.11896/jsjkx.190600147

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

一种快速收敛的最大置信上界探索方法

Upper Confidence Bound Exploration with Fast Convergence

PDF (PC)

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

Metrics

本文评价

推荐阅读 0