计算机科学 ›› 2021, Vol. 48 ›› Issue (3): 180-187.doi: 10.11896/jsjkx.200700217
秦智慧1,2, 李宁1, 刘晓彤1,3,4,5, 刘秀磊1,2, 佟强1,2, 刘旭红1,2
QIN Zhi-hui1,2, LI Ning1, LIU Xiao-tong1,3,4,5, LIU Xiu-lei1,2, TONG Qiang1,2, LIU Xu-hong1,2
摘要: 强化学习(Reinforcement Learning,RL)作为机器学习领域中与监督学习、无监督学习并列的第三种学习范式,通过与环境进行交互来学习,最终将累积收益最大化。常用的强化学习算法分为模型化强化学习(Model-based Reinforcement Lear-ning)和无模型强化学习(Model-free Reinforcement Learning)。模型化强化学习需要根据真实环境的状态转移数据来预定义环境动态模型,随后在通过环境动态模型进行策略学习的过程中无须再与环境进行交互。在无模型强化学习中,智能体通过与环境进行实时交互来学习最优策略,该方法在实际任务中具有更好的通用性,因此应用范围更广。文中对无模型强化学习的最新研究进展与发展动态进行了综述。首先介绍了强化学习、模型化强化学习和无模型强化学习的基础理论;然后基于价值函数和策略函数归纳总结了无模型强化学习的经典算法及各自的优缺点;最后概述了无模型强化学习在游戏AI、化学材料设计、自然语言处理和机器人控制领域的最新研究现状,并对无模型强化学习的未来发展趋势进行了展望。
中图分类号:
[1]GAO Y,CHEN S F,LU X.Research on Reinforcement Lear-ning Technology:A Review[J].Acta Automatica Sinica,2004,30(1):86-100. [2]LECUN Y,BENGIO Y,HINTONG E,et al.Deep learning[J].Nature,2015,521(7553):436-444. [3]HENDERSON P A,ISLAM R,BACHMANP,et al.Deep Reinforcement Learning that Matters[J].arXiv:1709.06560. [4]SUTTON R S,BARTOA G.Reinforcement Learning:An Introduction[J].IEEEE Transactions on Neural Network,1998,9(5):1054. [5]TANGKARATT V,MORI S,ZHAO T,et al.Model-based policy gradients with parameter-based exploration by least-squares conditional density estimation[J].Neural Networks,2014,57:128-140. [6]WANG T,BAO X,CLAVERAI,et al.Benchmarking Model-Based Reinforcement Learning[J].arXiv:1907.02057. [7]BARFOOT T D.State estimation for robotics[M].Cambridge:Cambridge University Press,2017. [8]ZHAO T T,KONG L,HAN Y J,et al.A Review of Model-based Reinforcement Learning[J].Journal of Frontiers of Computer Science & Technology,2020,14(6):918-927. [9]BELLMAN R.AMarkovian Decision Process[J].Indiana Uni-versity Mathematics Journal,1957,6(4):679-684. [10]SUTTON R S.Learning to predict by the method of temporal differences[J].Machine Learning,1988,3(1):9-44. [11]ZHOU C H,XING Z H,LIU Z F,et al.Markov DecisionProcess Boundary Model Detection[J].Chinese Journal of Computers,2013,36(12):2587-2600. [12]BELLMAHN R.Dynamic Programming[J].Science,1966,153(3731):34-37. [13]RIDA M,MOUNCIF H,BOULMAKOUL A.Application ofMarkov Decision Processes for Modeling and Optimization of Decision-Making within a Container Port[J].Soft Computing in Industrial Applications,2011,96:349-358. [14]SINGH S,JAAKKOLA T,LITTMAN M L.Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms[J].Machine Learning,2000,38(3):287-308. [15]WANG X,WANG F.A review of dynamic pricing strategybased on reinforcement learning[J].Computer Applications and Software,2019,36(12):1-6,18. [16]RUMMERY G A,NIRANJAN M.On-Line Q-Learning Using Connectionist Systems[R].Department of Engineering,University of Cambridge,Cambridge,1994. [17]THAM C K.Modular on-line function approximation for scaling up reinforcement learning[D].Cambridge:Cambridge University,1994. [18]WATKINS C,DAYAN P.Technical Note:Q-Learning[J].Machine Learning,1992,8(3):279-292. [19]VAN SEIJEN H,VAN HASSELT H,WHITESON S,et al.A theoretical and empirical analysis of Expected Sarsa[C]//IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning.2009:177-184. [20]MNIH V,KAVUKCUOGLU K,SILVER D,et al.Playing Atari with Deep Reinforcement Learning[J].arXiv:1312.5602. [21]LAN Q,PAN Y,FYSHE A,et al.Maxmin Q-learning:Controlling the Estimation Bias of Q-learning[J].arXiv:2002.06487. [22]VAN HASSELT H,GUEZ A,SILVER D,et al.Deep reinforcement learning with double Q-Learning[C]//National Confe-rence on Artificial Intelligence.2016:2094-2100. [23]LAKSHMINARAYANAN A S,SHARMA S,RAVINDRANB,et al.Dynamic Frame skip Deep Q Network[J].arXiv:1605.05365. [24]WANG Z,SCHAUL T,HESSEL M,et al.Dueling network architectures for deep reinforcement learning[C]//International Conference on Machine Learning.2016:1995-2003. [25]HAUSKNECHT M,STONE P.Deep Recurrent Q-Learning for Partially Observable MDPs[J].arXiv:1507.06527v1. [26]FORTUNATO M,AZAR M G,PIOT B,et al.Noisy Networks for Exploration[J].arXiv:1706.10295. [27]ASADI K,LITTMAN M L.An alternative softmax operator for reinforcement learning[C]//International Conference on Machine Learning.2017:243-252. [28]ENGEL Y,MANNOR S,MEIR R,et al.Reinforcement learning with Gaussian processes[C]//International Conference on Machine Learning.2005:201-208. [29]SUTTON R S,MCALLESTER D,SINGH S,et al.Policy Gradient Methods for Reinforcement Learning with Function Approximation[C]//Neural Information Processing Systems.1999:1057-1063. [30]KONDA V R,TSITSIKLIS J N.On Actor-Critic Algorithms[J].Siam Journal on Control and Optimization,2003,42(4):1143-1166. [31]MNIH V,BADIA A P,MIRZA M,et al.Asynchronous methods for deep reinforcement learning[C]//International Conference on Machine Learning.2016:1928-1937. [32]BABAEIZADEH M,FROSIO I,TYREE S,et al.Reinforcement Learning through Asynchronous Advantage Actor-Critic on a GPU[J].arXiv:1611.06256v3. [33]TESAURO G.TD-Gammon,a self-teaching backgammon program,achieves master-level play[J].Neural Computation,1994,6(2):215-219. [34]WANG X,SANDHOLM T.Reinforcement Learning to Play an Optimal Nash Equilibrium in Team Markov Games[C]//Neural Information Processing Systems.2002:1603-1610. [35]KOBER J,BAGNELL J A,PETERS J,et al.Reinforcementlearning in robotics:A survey[J].The International Journal of Robotics Research,2013,32(11):1238-1274. [36]FU Q M,LIU Q,WANG H,et al.A novel off policy Q(λ) algorithm based on linear function approximation[J].Chinese Journal of Computers,2014,37:677-686. [37]CHEN G.Merging Deterministic Policy Gradient Estimationswith Varied Bias-Variance Tradeoff for Effective Deep Reinforcement Learning[J].arXiv:1911.10527. [38]SCHULMAN J,LEVINE S,ABBEEL P,et al.Trust Region Policy Optimization[C]//International Conference on Machine Learning.2015:1889-1897. [39]QI C,HUA Y,LI R,et al.Deep Reinforcement Learning With Discrete Normalized Advantage Functions for Resource Management in Network Slicing[J].IEEE Communications Letters,2019,23(8):1337-1341. [40]WANG Z,BAPST V,HEESS N,et al.Sample Efficient Actor-Critic with Experience Replay[J].arXiv:1611.01224. [41]SCHULMAN J,WOLSKI F,DHARIWALP,et al.ProximalPolicy Optimization Algorithms[J].arXiv:1707.06347v1. [42]SILVER D,SCHRITTWIESER J,SIMONYAN K,et al.Mastering the game of Go without human knowledge[J].Nature,2017,550(7676):354-359. [43]ARULKUMARAN K,CULLY A,TOGELIUS J,et al.Al-phaStar:an evolutionary computation perspective[C]//Genetic And Evolutionary Computation Conference.2019:314-315. [44]RAIMAN J,ZHANG S,WOLSKI F,et al.Long-Term Planning and Situational Awareness in OpenAI Five[J].arXiv:1912.06721. [45]VINYALS O,EWALDS T,BARTUNOV S,et al.StarCraft II:A New Challenge for Reinforcement Learning[J].arXiv:1708.04782. [46]TIAN Y,GONG Q,SHANG W,et al.ELF:An Extensive,Lightweight and Flexible Research Platform for Real-time Strate-gy Games[C]//Neural Information Processing Systems.2017:2659-2669. [47]SYNNAEVE G,BESSIERE P.A Bayesian model for RTS units control applied to StarCraft[C]//Computational Intelligence And Games.2011:190-196. [48]WENDER S,WATSON I.Applying reinforcement learning tosmall scale combat in the real-time strategy game StarCraft:Broodwar[C]//Computational Intelligence And Games.2012:402-408. [49]WU B,FU Q,LIANG J,et al.Hierarchical Macro StrategyModel for MOBA Game AI[J].arXiv:1812.07887. [50]YE D,LIU Z,SUN M,et al.Mastering Complex Control in MOBA Games with Deep Reinforcement Learning[C]//National Conference on Artificial Intelligence.2020:6672-6679. [51]GUIMARAES G L,SANCHEZLENGELING B,FARIAS P L,et al.Objective-Reinforced Generative Adversarial Networks (ORGAN) for Sequence Generation Models[J].arXiv:1705.10843. [52]YU L,ZHANG W,WANG J,et al.SeqGAN:Sequence Generative Adversarial Nets with Policy Gradient[C]//National Conference On Artificial Intelligence.2016:2852-2858. [53]ADLER J,LUNZ S.Banach Wasserstein GAN[C]//Neural Information Processing Systems.2018:6755-6764. [54]LI H,COLLINS C R,RIBELLI T G,et al.Tuning the molecular weight distribution from atom transfer radical polymerization using deep reinforcement learning[J].Molecular Systems Design &Engineering,2018,3(3):496-508. [55]ZHOU Z,LI X,ZARER N,et al.Optimizing Chemical Reactions with Deep Reinforcement Learning[J].ACS central science,2017,3(12):1337-1344. [56]GRAYVER A,KUVSHINOV A.Exploring equivalence domain in nonlinear inverse problems using Covariance Matrix Adaption Evolution Strategy (CMAES) and random sampling[J].Geophysical Journal International,2016,205(2):971-987. [57]ZHANG T,HUANG M,ZHAO L,et al.Learning StructuredRepresentation for Text Classification via Reinforcement Lear-ning[C]//National Conference on Artificial Intelligence.2018:6053-6060. [58]HAARNOJA T,HA S,ZHOU A,et al.Learning to Walk Via Deep Reinforcement Learning[J].arXiv:1812.11103. [59]HAFNER R,HERTWECK T,KLPPNERP,et al.TowardsGeneral and Autonomous Learning of Core Skills:A Case Study in Locomotion[J].arXiv:2008.12228. [60]PUIGDOMÈNECH B A,PIOT B,KAPTUROWSKI S,et al.Agent57:Outperforming the Atari Human Benchmark[J].ar-Xiv:2003.13350v1. [61]OPENA I.Faulty Reward Functions in the Wild[EB/OL].https://openai.com/blog/faulty-reward-functions.2017. [62]NG A Y,RUSSELLS J.Algorithms for inverse reinforcementlearning[C]//International Conference on Machine Learning.2000:663-670. |
[1] | 熊丽琴, 曹雷, 赖俊, 陈希亮. 基于值分解的多智能体深度强化学习综述 Overview of Multi-agent Deep Reinforcement Learning Based on Value Factorization 计算机科学, 2022, 49(9): 172-182. https://doi.org/10.11896/jsjkx.210800112 |
[2] | 刘兴光, 周力, 刘琰, 张晓瀛, 谭翔, 魏急波. 基于边缘智能的频谱地图构建与分发方法 Construction and Distribution Method of REM Based on Edge Intelligence 计算机科学, 2022, 49(9): 236-241. https://doi.org/10.11896/jsjkx.220400148 |
[3] | 袁唯淋, 罗俊仁, 陆丽娜, 陈佳星, 张万鹏, 陈璟. 智能博弈对抗方法:博弈论与强化学习综合视角对比分析 Methods in Adversarial Intelligent Game:A Holistic Comparative Analysis from Perspective of Game Theory and Reinforcement Learning 计算机科学, 2022, 49(8): 191-204. https://doi.org/10.11896/jsjkx.220200174 |
[4] | 史殿习, 赵琛然, 张耀文, 杨绍武, 张拥军. 基于多智能体强化学习的端到端合作的自适应奖励方法 Adaptive Reward Method for End-to-End Cooperation Based on Multi-agent Reinforcement Learning 计算机科学, 2022, 49(8): 247-256. https://doi.org/10.11896/jsjkx.210700100 |
[5] | 于滨, 李学华, 潘春雨, 李娜. 基于深度强化学习的边云协同资源分配算法 Edge-Cloud Collaborative Resource Allocation Algorithm Based on Deep Reinforcement Learning 计算机科学, 2022, 49(7): 248-253. https://doi.org/10.11896/jsjkx.210400219 |
[6] | 李梦菲, 毛莺池, 屠子健, 王瑄, 徐淑芳. 基于深度确定性策略梯度的服务器可靠性任务卸载策略 Server-reliability Task Offloading Strategy Based on Deep Deterministic Policy Gradient 计算机科学, 2022, 49(7): 271-279. https://doi.org/10.11896/jsjkx.210600040 |
[7] | 郭雨欣, 陈秀宏. 融合BERT词嵌入表示和主题信息增强的自动摘要模型 Automatic Summarization Model Combining BERT Word Embedding Representation and Topic Information Enhancement 计算机科学, 2022, 49(6): 313-318. https://doi.org/10.11896/jsjkx.210400101 |
[8] | 范静宇, 刘全. 基于随机加权三重Q学习的异策略最大熵强化学习算法 Off-policy Maximum Entropy Deep Reinforcement Learning Algorithm Based on RandomlyWeighted Triple Q -Learning 计算机科学, 2022, 49(6): 335-341. https://doi.org/10.11896/jsjkx.210300081 |
[9] | 谢万城, 李斌, 代玥玥. 空中智能反射面辅助边缘计算中基于PPO的任务卸载方案 PPO Based Task Offloading Scheme in Aerial Reconfigurable Intelligent Surface-assisted Edge Computing 计算机科学, 2022, 49(6): 3-11. https://doi.org/10.11896/jsjkx.220100249 |
[10] | 洪志理, 赖俊, 曹雷, 陈希亮, 徐志雄. 基于遗憾探索的竞争网络强化学习智能推荐方法研究 Study on Intelligent Recommendation Method of Dueling Network Reinforcement Learning Based on Regret Exploration 计算机科学, 2022, 49(6): 149-157. https://doi.org/10.11896/jsjkx.210600226 |
[11] | 张佳能, 李辉, 吴昊霖, 王壮. 一种平衡探索和利用的优先经验回放方法 Exploration and Exploitation Balanced Experience Replay 计算机科学, 2022, 49(5): 179-185. https://doi.org/10.11896/jsjkx.210300084 |
[12] | 李野, 陈松灿. 基于物理信息的神经网络:最新进展与展望 Physics-informed Neural Networks:Recent Advances and Prospects 计算机科学, 2022, 49(4): 254-262. https://doi.org/10.11896/jsjkx.210500158 |
[13] | 李鹏, 易修文, 齐德康, 段哲文, 李天瑞. 一种基于深度学习的供热策略优化方法 Heating Strategy Optimization Method Based on Deep Learning 计算机科学, 2022, 49(4): 263-268. https://doi.org/10.11896/jsjkx.210300155 |
[14] | 丛颖男, 王兆毓, 朱金清. 关于法律人工智能数据和算法问题的若干思考 Insights into Dataset and Algorithm Related Problems in Artificial Intelligence for Law 计算机科学, 2022, 49(4): 74-79. https://doi.org/10.11896/jsjkx.210900191 |
[15] | 周琴, 罗飞, 丁炜超, 顾春华, 郑帅. 基于逐次超松弛技术的Double Speedy Q-Learning算法 Double Speedy Q-Learning Based on Successive Over Relaxation 计算机科学, 2022, 49(3): 239-245. https://doi.org/10.11896/jsjkx.201200173 |
|