计算机科学 ›› 2021, Vol. 48 ›› Issue (3): 180-187.doi: 10.11896/jsjkx.200700217

• 人工智能 • 上一篇    下一篇

无模型强化学习研究综述

秦智慧1,2, 李宁1, 刘晓彤1,3,4,5, 刘秀磊1,2, 佟强1,2, 刘旭红1,2   

  1. 1 北京材料基因工程高精尖创新中心(北京信息科技大学) 北京100101
    2 北京信息科技大学数据与科学情报分析实验室 北京100101
    3 中国科学院煤炭化学研究所煤转化国家重点实验室 太原030001
    4 中科合成油技术有限公司国家能源煤基液体燃料研发中心 北京101400
    5 中国科学院大学 北京100049
  • 收稿日期:2020-07-31 修回日期:2020-12-02 出版日期:2021-03-15 发布日期:2021-03-05
  • 通讯作者: 刘秀磊(xiuleiliu@hotmail.com)
  • 作者简介:qzh@bistu.edu.cn
  • 基金资助:
    国家重点研发计划(2018YFC0830202);北京信息科技大学“勤信人才”培育计划项目(2020);北京信息科技大学促进高校内涵发展——信息+项目-面向大数据的竞争情报分析关键技术研究;北京市教育委员会科技计划一般项目(KM202111232003);北京市自然基金(4204100)

Overview of Research on Model-free Reinforcement Learning

QIN Zhi-hui1,2, LI Ning1, LIU Xiao-tong1,3,4,5, LIU Xiu-lei1,2, TONG Qiang1,2, LIU Xu-hong1,2   

  1. 1 Beijing Advanced Innovation Center for Materials Genome Engineering (Beijing Information Science and Technology University),Beijing 100101,China
    2 Laboratory of Data Science and Information Studies,Beijing Information Science and Technology University,Beijing 100101,China
    3 State Key Laboratory of Coal Conversion,Institute of Coal Chemistry,Chinese Academy of Sciences,Taiyuan 030001,China
    4 National Energy Center for Coal to Liquids,Synfuels China Co.,Ltd,Beijing 101400,China
    5 University of Chinese Academy of Sciences,Beijing 100049,China
  • Received:2020-07-31 Revised:2020-12-02 Online:2021-03-15 Published:2021-03-05
  • About author:QIN Zhi-hui,born in 1996,postgraduate.Her main research interests include reinforcement learning and computational chemistry.
    LIU Xiu-lei,born in 1981,Ph.D,asso-ciate professor,is a member of China Computer Federation.His main research interests include semantic sensor,semantic web,knowledge graph,semantic information retrieval,and so on.
  • Supported by:
    National Key R&D Program of China(2018YFC0830202),Qin Xin Talents Cultivation Program, Beijing Information Science & Technology University(2020),Beijing University of Information Science and Technology to Promote the Development of the Connotation of Colleges and Universities——Information+Project-Key Technology Research for Competitive Analysis of Big Data,Beijing Education Commission for General Project of Science and Technology Plan(KM202111232003) and Beijing Natural Science Foundation(4204100).

摘要: 强化学习(Reinforcement Learning,RL)作为机器学习领域中与监督学习、无监督学习并列的第三种学习范式,通过与环境进行交互来学习,最终将累积收益最大化。常用的强化学习算法分为模型化强化学习(Model-based Reinforcement Lear-ning)和无模型强化学习(Model-free Reinforcement Learning)。模型化强化学习需要根据真实环境的状态转移数据来预定义环境动态模型,随后在通过环境动态模型进行策略学习的过程中无须再与环境进行交互。在无模型强化学习中,智能体通过与环境进行实时交互来学习最优策略,该方法在实际任务中具有更好的通用性,因此应用范围更广。文中对无模型强化学习的最新研究进展与发展动态进行了综述。首先介绍了强化学习、模型化强化学习和无模型强化学习的基础理论;然后基于价值函数和策略函数归纳总结了无模型强化学习的经典算法及各自的优缺点;最后概述了无模型强化学习在游戏AI、化学材料设计、自然语言处理和机器人控制领域的最新研究现状,并对无模型强化学习的未来发展趋势进行了展望。

关键词: 马尔可夫决策过程, 强化学习, 人工智能, 深度强化学习, 无模型强化学习

Abstract: Reinforcement Learning (RL) is a different learning paradigm from supervised learning and unsupervised learning.It focuses on the interacting process between agent and environment to maximize the accumulated reward.The commonly used RL algorithm is divided into Model-based Reinforcement Learning (MBRL) and Model-free Reinforcement Learning (MFRL).In MBRL,there is a well-designed model to fit the state transition of the environment.In most cases,it is difficult to build an accurate enough model under prior knowledge.In MFRL,parameters in the model are fine-tuned through continuous interactions with the environment.The whole process has good portability.Therefore,MFRL is widely used in various fields.This paper reviews the recent research progress of MFRL.Firstly,an overview of basic theory is given.Then,three types of classical algorithms of MFRL based on value function and strategy function are introduced.Finally,the related researches of MFRL are summarized and prospected.

Key words: Artificial intelligence, Deep reinforcement learning, Mar-kov decision process, Model-free reinforcement learning, Reinforcement learning

中图分类号: 

  • TP181
[1]GAO Y,CHEN S F,LU X.Research on Reinforcement Lear-ning Technology:A Review[J].Acta Automatica Sinica,2004,30(1):86-100.
[2]LECUN Y,BENGIO Y,HINTONG E,et al.Deep learning[J].Nature,2015,521(7553):436-444.
[3]HENDERSON P A,ISLAM R,BACHMANP,et al.Deep Reinforcement Learning that Matters[J].arXiv:1709.06560.
[4]SUTTON R S,BARTOA G.Reinforcement Learning:An Introduction[J].IEEEE Transactions on Neural Network,1998,9(5):1054.
[5]TANGKARATT V,MORI S,ZHAO T,et al.Model-based policy gradients with parameter-based exploration by least-squares conditional density estimation[J].Neural Networks,2014,57:128-140.
[6]WANG T,BAO X,CLAVERAI,et al.Benchmarking Model-Based Reinforcement Learning[J].arXiv:1907.02057.
[7]BARFOOT T D.State estimation for robotics[M].Cambridge:Cambridge University Press,2017.
[8]ZHAO T T,KONG L,HAN Y J,et al.A Review of Model-based Reinforcement Learning[J].Journal of Frontiers of Computer Science & Technology,2020,14(6):918-927.
[9]BELLMAN R.AMarkovian Decision Process[J].Indiana Uni-versity Mathematics Journal,1957,6(4):679-684.
[10]SUTTON R S.Learning to predict by the method of temporal differences[J].Machine Learning,1988,3(1):9-44.
[11]ZHOU C H,XING Z H,LIU Z F,et al.Markov DecisionProcess Boundary Model Detection[J].Chinese Journal of Computers,2013,36(12):2587-2600.
[12]BELLMAHN R.Dynamic Programming[J].Science,1966,153(3731):34-37.
[13]RIDA M,MOUNCIF H,BOULMAKOUL A.Application ofMarkov Decision Processes for Modeling and Optimization of Decision-Making within a Container Port[J].Soft Computing in Industrial Applications,2011,96:349-358.
[14]SINGH S,JAAKKOLA T,LITTMAN M L.Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms[J].Machine Learning,2000,38(3):287-308.
[15]WANG X,WANG F.A review of dynamic pricing strategybased on reinforcement learning[J].Computer Applications and Software,2019,36(12):1-6,18.
[16]RUMMERY G A,NIRANJAN M.On-Line Q-Learning Using Connectionist Systems[R].Department of Engineering,University of Cambridge,Cambridge,1994.
[17]THAM C K.Modular on-line function approximation for scaling up reinforcement learning[D].Cambridge:Cambridge University,1994.
[18]WATKINS C,DAYAN P.Technical Note:Q-Learning[J].Machine Learning,1992,8(3):279-292.
[19]VAN SEIJEN H,VAN HASSELT H,WHITESON S,et al.A theoretical and empirical analysis of Expected Sarsa[C]//IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning.2009:177-184.
[20]MNIH V,KAVUKCUOGLU K,SILVER D,et al.Playing Atari with Deep Reinforcement Learning[J].arXiv:1312.5602.
[21]LAN Q,PAN Y,FYSHE A,et al.Maxmin Q-learning:Controlling the Estimation Bias of Q-learning[J].arXiv:2002.06487.
[22]VAN HASSELT H,GUEZ A,SILVER D,et al.Deep reinforcement learning with double Q-Learning[C]//National Confe-rence on Artificial Intelligence.2016:2094-2100.
[23]LAKSHMINARAYANAN A S,SHARMA S,RAVINDRANB,et al.Dynamic Frame skip Deep Q Network[J].arXiv:1605.05365.
[24]WANG Z,SCHAUL T,HESSEL M,et al.Dueling network architectures for deep reinforcement learning[C]//International Conference on Machine Learning.2016:1995-2003.
[25]HAUSKNECHT M,STONE P.Deep Recurrent Q-Learning for Partially Observable MDPs[J].arXiv:1507.06527v1.
[26]FORTUNATO M,AZAR M G,PIOT B,et al.Noisy Networks for Exploration[J].arXiv:1706.10295.
[27]ASADI K,LITTMAN M L.An alternative softmax operator for reinforcement learning[C]//International Conference on Machine Learning.2017:243-252.
[28]ENGEL Y,MANNOR S,MEIR R,et al.Reinforcement learning with Gaussian processes[C]//International Conference on Machine Learning.2005:201-208.
[29]SUTTON R S,MCALLESTER D,SINGH S,et al.Policy Gradient Methods for Reinforcement Learning with Function Approximation[C]//Neural Information Processing Systems.1999:1057-1063.
[30]KONDA V R,TSITSIKLIS J N.On Actor-Critic Algorithms[J].Siam Journal on Control and Optimization,2003,42(4):1143-1166.
[31]MNIH V,BADIA A P,MIRZA M,et al.Asynchronous methods for deep reinforcement learning[C]//International Conference on Machine Learning.2016:1928-1937.
[32]BABAEIZADEH M,FROSIO I,TYREE S,et al.Reinforcement Learning through Asynchronous Advantage Actor-Critic on a GPU[J].arXiv:1611.06256v3.
[33]TESAURO G.TD-Gammon,a self-teaching backgammon program,achieves master-level play[J].Neural Computation,1994,6(2):215-219.
[34]WANG X,SANDHOLM T.Reinforcement Learning to Play an Optimal Nash Equilibrium in Team Markov Games[C]//Neural Information Processing Systems.2002:1603-1610.
[35]KOBER J,BAGNELL J A,PETERS J,et al.Reinforcementlearning in robotics:A survey[J].The International Journal of Robotics Research,2013,32(11):1238-1274.
[36]FU Q M,LIU Q,WANG H,et al.A novel off policy Q(λ) algorithm based on linear function approximation[J].Chinese Journal of Computers,2014,37:677-686.
[37]CHEN G.Merging Deterministic Policy Gradient Estimationswith Varied Bias-Variance Tradeoff for Effective Deep Reinforcement Learning[J].arXiv:1911.10527.
[38]SCHULMAN J,LEVINE S,ABBEEL P,et al.Trust Region Policy Optimization[C]//International Conference on Machine Learning.2015:1889-1897.
[39]QI C,HUA Y,LI R,et al.Deep Reinforcement Learning With Discrete Normalized Advantage Functions for Resource Management in Network Slicing[J].IEEE Communications Letters,2019,23(8):1337-1341.
[40]WANG Z,BAPST V,HEESS N,et al.Sample Efficient Actor-Critic with Experience Replay[J].arXiv:1611.01224.
[41]SCHULMAN J,WOLSKI F,DHARIWALP,et al.ProximalPolicy Optimization Algorithms[J].arXiv:1707.06347v1.
[42]SILVER D,SCHRITTWIESER J,SIMONYAN K,et al.Mastering the game of Go without human knowledge[J].Nature,2017,550(7676):354-359.
[43]ARULKUMARAN K,CULLY A,TOGELIUS J,et al.Al-phaStar:an evolutionary computation perspective[C]//Genetic And Evolutionary Computation Conference.2019:314-315.
[44]RAIMAN J,ZHANG S,WOLSKI F,et al.Long-Term Planning and Situational Awareness in OpenAI Five[J].arXiv:1912.06721.
[45]VINYALS O,EWALDS T,BARTUNOV S,et al.StarCraft II:A New Challenge for Reinforcement Learning[J].arXiv:1708.04782.
[46]TIAN Y,GONG Q,SHANG W,et al.ELF:An Extensive,Lightweight and Flexible Research Platform for Real-time Strate-gy Games[C]//Neural Information Processing Systems.2017:2659-2669.
[47]SYNNAEVE G,BESSIERE P.A Bayesian model for RTS units control applied to StarCraft[C]//Computational Intelligence And Games.2011:190-196.
[48]WENDER S,WATSON I.Applying reinforcement learning tosmall scale combat in the real-time strategy game StarCraft:Broodwar[C]//Computational Intelligence And Games.2012:402-408.
[49]WU B,FU Q,LIANG J,et al.Hierarchical Macro StrategyModel for MOBA Game AI[J].arXiv:1812.07887.
[50]YE D,LIU Z,SUN M,et al.Mastering Complex Control in MOBA Games with Deep Reinforcement Learning[C]//National Conference on Artificial Intelligence.2020:6672-6679.
[51]GUIMARAES G L,SANCHEZLENGELING B,FARIAS P L,et al.Objective-Reinforced Generative Adversarial Networks (ORGAN) for Sequence Generation Models[J].arXiv:1705.10843.
[52]YU L,ZHANG W,WANG J,et al.SeqGAN:Sequence Generative Adversarial Nets with Policy Gradient[C]//National Conference On Artificial Intelligence.2016:2852-2858.
[53]ADLER J,LUNZ S.Banach Wasserstein GAN[C]//Neural Information Processing Systems.2018:6755-6764.
[54]LI H,COLLINS C R,RIBELLI T G,et al.Tuning the molecular weight distribution from atom transfer radical polymerization using deep reinforcement learning[J].Molecular Systems Design &Engineering,2018,3(3):496-508.
[55]ZHOU Z,LI X,ZARER N,et al.Optimizing Chemical Reactions with Deep Reinforcement Learning[J].ACS central science,2017,3(12):1337-1344.
[56]GRAYVER A,KUVSHINOV A.Exploring equivalence domain in nonlinear inverse problems using Covariance Matrix Adaption Evolution Strategy (CMAES) and random sampling[J].Geophysical Journal International,2016,205(2):971-987.
[57]ZHANG T,HUANG M,ZHAO L,et al.Learning StructuredRepresentation for Text Classification via Reinforcement Lear-ning[C]//National Conference on Artificial Intelligence.2018:6053-6060.
[58]HAARNOJA T,HA S,ZHOU A,et al.Learning to Walk Via Deep Reinforcement Learning[J].arXiv:1812.11103.
[59]HAFNER R,HERTWECK T,KLPPNERP,et al.TowardsGeneral and Autonomous Learning of Core Skills:A Case Study in Locomotion[J].arXiv:2008.12228.
[60]PUIGDOMÈNECH B A,PIOT B,KAPTUROWSKI S,et al.Agent57:Outperforming the Atari Human Benchmark[J].ar-Xiv:2003.13350v1.
[61]OPENA I.Faulty Reward Functions in the Wild[EB/OL].https://openai.com/blog/faulty-reward-functions.2017.
[62]NG A Y,RUSSELLS J.Algorithms for inverse reinforcementlearning[C]//International Conference on Machine Learning.2000:663-670.
[1] 熊丽琴, 曹雷, 赖俊, 陈希亮.
基于值分解的多智能体深度强化学习综述
Overview of Multi-agent Deep Reinforcement Learning Based on Value Factorization
计算机科学, 2022, 49(9): 172-182. https://doi.org/10.11896/jsjkx.210800112
[2] 刘兴光, 周力, 刘琰, 张晓瀛, 谭翔, 魏急波.
基于边缘智能的频谱地图构建与分发方法
Construction and Distribution Method of REM Based on Edge Intelligence
计算机科学, 2022, 49(9): 236-241. https://doi.org/10.11896/jsjkx.220400148
[3] 袁唯淋, 罗俊仁, 陆丽娜, 陈佳星, 张万鹏, 陈璟.
智能博弈对抗方法:博弈论与强化学习综合视角对比分析
Methods in Adversarial Intelligent Game:A Holistic Comparative Analysis from Perspective of Game Theory and Reinforcement Learning
计算机科学, 2022, 49(8): 191-204. https://doi.org/10.11896/jsjkx.220200174
[4] 史殿习, 赵琛然, 张耀文, 杨绍武, 张拥军.
基于多智能体强化学习的端到端合作的自适应奖励方法
Adaptive Reward Method for End-to-End Cooperation Based on Multi-agent Reinforcement Learning
计算机科学, 2022, 49(8): 247-256. https://doi.org/10.11896/jsjkx.210700100
[5] 于滨, 李学华, 潘春雨, 李娜.
基于深度强化学习的边云协同资源分配算法
Edge-Cloud Collaborative Resource Allocation Algorithm Based on Deep Reinforcement Learning
计算机科学, 2022, 49(7): 248-253. https://doi.org/10.11896/jsjkx.210400219
[6] 李梦菲, 毛莺池, 屠子健, 王瑄, 徐淑芳.
基于深度确定性策略梯度的服务器可靠性任务卸载策略
Server-reliability Task Offloading Strategy Based on Deep Deterministic Policy Gradient
计算机科学, 2022, 49(7): 271-279. https://doi.org/10.11896/jsjkx.210600040
[7] 郭雨欣, 陈秀宏.
融合BERT词嵌入表示和主题信息增强的自动摘要模型
Automatic Summarization Model Combining BERT Word Embedding Representation and Topic Information Enhancement
计算机科学, 2022, 49(6): 313-318. https://doi.org/10.11896/jsjkx.210400101
[8] 范静宇, 刘全.
基于随机加权三重Q学习的异策略最大熵强化学习算法
Off-policy Maximum Entropy Deep Reinforcement Learning Algorithm Based on RandomlyWeighted Triple Q -Learning
计算机科学, 2022, 49(6): 335-341. https://doi.org/10.11896/jsjkx.210300081
[9] 谢万城, 李斌, 代玥玥.
空中智能反射面辅助边缘计算中基于PPO的任务卸载方案
PPO Based Task Offloading Scheme in Aerial Reconfigurable Intelligent Surface-assisted Edge Computing
计算机科学, 2022, 49(6): 3-11. https://doi.org/10.11896/jsjkx.220100249
[10] 洪志理, 赖俊, 曹雷, 陈希亮, 徐志雄.
基于遗憾探索的竞争网络强化学习智能推荐方法研究
Study on Intelligent Recommendation Method of Dueling Network Reinforcement Learning Based on Regret Exploration
计算机科学, 2022, 49(6): 149-157. https://doi.org/10.11896/jsjkx.210600226
[11] 张佳能, 李辉, 吴昊霖, 王壮.
一种平衡探索和利用的优先经验回放方法
Exploration and Exploitation Balanced Experience Replay
计算机科学, 2022, 49(5): 179-185. https://doi.org/10.11896/jsjkx.210300084
[12] 李野, 陈松灿.
基于物理信息的神经网络:最新进展与展望
Physics-informed Neural Networks:Recent Advances and Prospects
计算机科学, 2022, 49(4): 254-262. https://doi.org/10.11896/jsjkx.210500158
[13] 李鹏, 易修文, 齐德康, 段哲文, 李天瑞.
一种基于深度学习的供热策略优化方法
Heating Strategy Optimization Method Based on Deep Learning
计算机科学, 2022, 49(4): 263-268. https://doi.org/10.11896/jsjkx.210300155
[14] 丛颖男, 王兆毓, 朱金清.
关于法律人工智能数据和算法问题的若干思考
Insights into Dataset and Algorithm Related Problems in Artificial Intelligence for Law
计算机科学, 2022, 49(4): 74-79. https://doi.org/10.11896/jsjkx.210900191
[15] 周琴, 罗飞, 丁炜超, 顾春华, 郑帅.
基于逐次超松弛技术的Double Speedy Q-Learning算法
Double Speedy Q-Learning Based on Successive Over Relaxation
计算机科学, 2022, 49(3): 239-245. https://doi.org/10.11896/jsjkx.201200173
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!