Computer Science ›› 2021, Vol. 48 ›› Issue (3): 180-187.doi: 10.11896/jsjkx.200700217

• Artificial Intelligence • Previous Articles     Next Articles

Overview of Research on Model-free Reinforcement Learning

QIN Zhi-hui1,2, LI Ning1, LIU Xiao-tong1,3,4,5, LIU Xiu-lei1,2, TONG Qiang1,2, LIU Xu-hong1,2   

  1. 1 Beijing Advanced Innovation Center for Materials Genome Engineering (Beijing Information Science and Technology University),Beijing 100101,China
    2 Laboratory of Data Science and Information Studies,Beijing Information Science and Technology University,Beijing 100101,China
    3 State Key Laboratory of Coal Conversion,Institute of Coal Chemistry,Chinese Academy of Sciences,Taiyuan 030001,China
    4 National Energy Center for Coal to Liquids,Synfuels China Co.,Ltd,Beijing 101400,China
    5 University of Chinese Academy of Sciences,Beijing 100049,China
  • Received:2020-07-31 Revised:2020-12-02 Online:2021-03-15 Published:2021-03-05
  • About author:QIN Zhi-hui,born in 1996,postgraduate.Her main research interests include reinforcement learning and computational chemistry.
    LIU Xiu-lei,born in 1981,Ph.D,asso-ciate professor,is a member of China Computer Federation.His main research interests include semantic sensor,semantic web,knowledge graph,semantic information retrieval,and so on.
  • Supported by:
    National Key R&D Program of China(2018YFC0830202),Qin Xin Talents Cultivation Program, Beijing Information Science & Technology University(2020),Beijing University of Information Science and Technology to Promote the Development of the Connotation of Colleges and Universities——Information+Project-Key Technology Research for Competitive Analysis of Big Data,Beijing Education Commission for General Project of Science and Technology Plan(KM202111232003) and Beijing Natural Science Foundation(4204100).

Abstract: Reinforcement Learning (RL) is a different learning paradigm from supervised learning and unsupervised learning.It focuses on the interacting process between agent and environment to maximize the accumulated reward.The commonly used RL algorithm is divided into Model-based Reinforcement Learning (MBRL) and Model-free Reinforcement Learning (MFRL).In MBRL,there is a well-designed model to fit the state transition of the environment.In most cases,it is difficult to build an accurate enough model under prior knowledge.In MFRL,parameters in the model are fine-tuned through continuous interactions with the environment.The whole process has good portability.Therefore,MFRL is widely used in various fields.This paper reviews the recent research progress of MFRL.Firstly,an overview of basic theory is given.Then,three types of classical algorithms of MFRL based on value function and strategy function are introduced.Finally,the related researches of MFRL are summarized and prospected.

Key words: Artificial intelligence, Deep reinforcement learning, Mar-kov decision process, Model-free reinforcement learning, Reinforcement learning

CLC Number: 

  • TP181
[1]GAO Y,CHEN S F,LU X.Research on Reinforcement Lear-ning Technology:A Review[J].Acta Automatica Sinica,2004,30(1):86-100.
[2]LECUN Y,BENGIO Y,HINTONG E,et al.Deep learning[J].Nature,2015,521(7553):436-444.
[3]HENDERSON P A,ISLAM R,BACHMANP,et al.Deep Reinforcement Learning that Matters[J].arXiv:1709.06560.
[4]SUTTON R S,BARTOA G.Reinforcement Learning:An Introduction[J].IEEEE Transactions on Neural Network,1998,9(5):1054.
[5]TANGKARATT V,MORI S,ZHAO T,et al.Model-based policy gradients with parameter-based exploration by least-squares conditional density estimation[J].Neural Networks,2014,57:128-140.
[6]WANG T,BAO X,CLAVERAI,et al.Benchmarking Model-Based Reinforcement Learning[J].arXiv:1907.02057.
[7]BARFOOT T D.State estimation for robotics[M].Cambridge:Cambridge University Press,2017.
[8]ZHAO T T,KONG L,HAN Y J,et al.A Review of Model-based Reinforcement Learning[J].Journal of Frontiers of Computer Science & Technology,2020,14(6):918-927.
[9]BELLMAN R.AMarkovian Decision Process[J].Indiana Uni-versity Mathematics Journal,1957,6(4):679-684.
[10]SUTTON R S.Learning to predict by the method of temporal differences[J].Machine Learning,1988,3(1):9-44.
[11]ZHOU C H,XING Z H,LIU Z F,et al.Markov DecisionProcess Boundary Model Detection[J].Chinese Journal of Computers,2013,36(12):2587-2600.
[12]BELLMAHN R.Dynamic Programming[J].Science,1966,153(3731):34-37.
[13]RIDA M,MOUNCIF H,BOULMAKOUL A.Application ofMarkov Decision Processes for Modeling and Optimization of Decision-Making within a Container Port[J].Soft Computing in Industrial Applications,2011,96:349-358.
[14]SINGH S,JAAKKOLA T,LITTMAN M L.Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms[J].Machine Learning,2000,38(3):287-308.
[15]WANG X,WANG F.A review of dynamic pricing strategybased on reinforcement learning[J].Computer Applications and Software,2019,36(12):1-6,18.
[16]RUMMERY G A,NIRANJAN M.On-Line Q-Learning Using Connectionist Systems[R].Department of Engineering,University of Cambridge,Cambridge,1994.
[17]THAM C K.Modular on-line function approximation for scaling up reinforcement learning[D].Cambridge:Cambridge University,1994.
[18]WATKINS C,DAYAN P.Technical Note:Q-Learning[J].Machine Learning,1992,8(3):279-292.
[19]VAN SEIJEN H,VAN HASSELT H,WHITESON S,et al.A theoretical and empirical analysis of Expected Sarsa[C]//IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning.2009:177-184.
[20]MNIH V,KAVUKCUOGLU K,SILVER D,et al.Playing Atari with Deep Reinforcement Learning[J].arXiv:1312.5602.
[21]LAN Q,PAN Y,FYSHE A,et al.Maxmin Q-learning:Controlling the Estimation Bias of Q-learning[J].arXiv:2002.06487.
[22]VAN HASSELT H,GUEZ A,SILVER D,et al.Deep reinforcement learning with double Q-Learning[C]//National Confe-rence on Artificial Intelligence.2016:2094-2100.
[23]LAKSHMINARAYANAN A S,SHARMA S,RAVINDRANB,et al.Dynamic Frame skip Deep Q Network[J].arXiv:1605.05365.
[24]WANG Z,SCHAUL T,HESSEL M,et al.Dueling network architectures for deep reinforcement learning[C]//International Conference on Machine Learning.2016:1995-2003.
[25]HAUSKNECHT M,STONE P.Deep Recurrent Q-Learning for Partially Observable MDPs[J].arXiv:1507.06527v1.
[26]FORTUNATO M,AZAR M G,PIOT B,et al.Noisy Networks for Exploration[J].arXiv:1706.10295.
[27]ASADI K,LITTMAN M L.An alternative softmax operator for reinforcement learning[C]//International Conference on Machine Learning.2017:243-252.
[28]ENGEL Y,MANNOR S,MEIR R,et al.Reinforcement learning with Gaussian processes[C]//International Conference on Machine Learning.2005:201-208.
[29]SUTTON R S,MCALLESTER D,SINGH S,et al.Policy Gradient Methods for Reinforcement Learning with Function Approximation[C]//Neural Information Processing Systems.1999:1057-1063.
[30]KONDA V R,TSITSIKLIS J N.On Actor-Critic Algorithms[J].Siam Journal on Control and Optimization,2003,42(4):1143-1166.
[31]MNIH V,BADIA A P,MIRZA M,et al.Asynchronous methods for deep reinforcement learning[C]//International Conference on Machine Learning.2016:1928-1937.
[32]BABAEIZADEH M,FROSIO I,TYREE S,et al.Reinforcement Learning through Asynchronous Advantage Actor-Critic on a GPU[J].arXiv:1611.06256v3.
[33]TESAURO G.TD-Gammon,a self-teaching backgammon program,achieves master-level play[J].Neural Computation,1994,6(2):215-219.
[34]WANG X,SANDHOLM T.Reinforcement Learning to Play an Optimal Nash Equilibrium in Team Markov Games[C]//Neural Information Processing Systems.2002:1603-1610.
[35]KOBER J,BAGNELL J A,PETERS J,et al.Reinforcementlearning in robotics:A survey[J].The International Journal of Robotics Research,2013,32(11):1238-1274.
[36]FU Q M,LIU Q,WANG H,et al.A novel off policy Q(λ) algorithm based on linear function approximation[J].Chinese Journal of Computers,2014,37:677-686.
[37]CHEN G.Merging Deterministic Policy Gradient Estimationswith Varied Bias-Variance Tradeoff for Effective Deep Reinforcement Learning[J].arXiv:1911.10527.
[38]SCHULMAN J,LEVINE S,ABBEEL P,et al.Trust Region Policy Optimization[C]//International Conference on Machine Learning.2015:1889-1897.
[39]QI C,HUA Y,LI R,et al.Deep Reinforcement Learning With Discrete Normalized Advantage Functions for Resource Management in Network Slicing[J].IEEE Communications Letters,2019,23(8):1337-1341.
[40]WANG Z,BAPST V,HEESS N,et al.Sample Efficient Actor-Critic with Experience Replay[J].arXiv:1611.01224.
[41]SCHULMAN J,WOLSKI F,DHARIWALP,et al.ProximalPolicy Optimization Algorithms[J].arXiv:1707.06347v1.
[42]SILVER D,SCHRITTWIESER J,SIMONYAN K,et al.Mastering the game of Go without human knowledge[J].Nature,2017,550(7676):354-359.
[43]ARULKUMARAN K,CULLY A,TOGELIUS J,et al.Al-phaStar:an evolutionary computation perspective[C]//Genetic And Evolutionary Computation Conference.2019:314-315.
[44]RAIMAN J,ZHANG S,WOLSKI F,et al.Long-Term Planning and Situational Awareness in OpenAI Five[J].arXiv:1912.06721.
[45]VINYALS O,EWALDS T,BARTUNOV S,et al.StarCraft II:A New Challenge for Reinforcement Learning[J].arXiv:1708.04782.
[46]TIAN Y,GONG Q,SHANG W,et al.ELF:An Extensive,Lightweight and Flexible Research Platform for Real-time Strate-gy Games[C]//Neural Information Processing Systems.2017:2659-2669.
[47]SYNNAEVE G,BESSIERE P.A Bayesian model for RTS units control applied to StarCraft[C]//Computational Intelligence And Games.2011:190-196.
[48]WENDER S,WATSON I.Applying reinforcement learning tosmall scale combat in the real-time strategy game StarCraft:Broodwar[C]//Computational Intelligence And Games.2012:402-408.
[49]WU B,FU Q,LIANG J,et al.Hierarchical Macro StrategyModel for MOBA Game AI[J].arXiv:1812.07887.
[50]YE D,LIU Z,SUN M,et al.Mastering Complex Control in MOBA Games with Deep Reinforcement Learning[C]//National Conference on Artificial Intelligence.2020:6672-6679.
[51]GUIMARAES G L,SANCHEZLENGELING B,FARIAS P L,et al.Objective-Reinforced Generative Adversarial Networks (ORGAN) for Sequence Generation Models[J].arXiv:1705.10843.
[52]YU L,ZHANG W,WANG J,et al.SeqGAN:Sequence Generative Adversarial Nets with Policy Gradient[C]//National Conference On Artificial Intelligence.2016:2852-2858.
[53]ADLER J,LUNZ S.Banach Wasserstein GAN[C]//Neural Information Processing Systems.2018:6755-6764.
[54]LI H,COLLINS C R,RIBELLI T G,et al.Tuning the molecular weight distribution from atom transfer radical polymerization using deep reinforcement learning[J].Molecular Systems Design &Engineering,2018,3(3):496-508.
[55]ZHOU Z,LI X,ZARER N,et al.Optimizing Chemical Reactions with Deep Reinforcement Learning[J].ACS central science,2017,3(12):1337-1344.
[56]GRAYVER A,KUVSHINOV A.Exploring equivalence domain in nonlinear inverse problems using Covariance Matrix Adaption Evolution Strategy (CMAES) and random sampling[J].Geophysical Journal International,2016,205(2):971-987.
[57]ZHANG T,HUANG M,ZHAO L,et al.Learning StructuredRepresentation for Text Classification via Reinforcement Lear-ning[C]//National Conference on Artificial Intelligence.2018:6053-6060.
[58]HAARNOJA T,HA S,ZHOU A,et al.Learning to Walk Via Deep Reinforcement Learning[J].arXiv:1812.11103.
[59]HAFNER R,HERTWECK T,KLPPNERP,et al.TowardsGeneral and Autonomous Learning of Core Skills:A Case Study in Locomotion[J].arXiv:2008.12228.
[60]PUIGDOMÈNECH B A,PIOT B,KAPTUROWSKI S,et al.Agent57:Outperforming the Atari Human Benchmark[J].ar-Xiv:2003.13350v1.
[61]OPENA I.Faulty Reward Functions in the Wild[EB/OL].https://openai.com/blog/faulty-reward-functions.2017.
[62]NG A Y,RUSSELLS J.Algorithms for inverse reinforcementlearning[C]//International Conference on Machine Learning.2000:663-670.
[1] LIU Xing-guang, ZHOU Li, LIU Yan, ZHANG Xiao-ying, TAN Xiang, WEI Ji-bo. Construction and Distribution Method of REM Based on Edge Intelligence [J]. Computer Science, 2022, 49(9): 236-241.
[2] SHI Dian-xi, ZHAO Chen-ran, ZHANG Yao-wen, YANG Shao-wu, ZHANG Yong-jun. Adaptive Reward Method for End-to-End Cooperation Based on Multi-agent Reinforcement Learning [J]. Computer Science, 2022, 49(8): 247-256.
[3] YUAN Wei-lin, LUO Jun-ren, LU Li-na, CHEN Jia-xing, ZHANG Wan-peng, CHEN Jing. Methods in Adversarial Intelligent Game:A Holistic Comparative Analysis from Perspective of Game Theory and Reinforcement Learning [J]. Computer Science, 2022, 49(8): 191-204.
[4] YU Bin, LI Xue-hua, PAN Chun-yu, LI Na. Edge-Cloud Collaborative Resource Allocation Algorithm Based on Deep Reinforcement Learning [J]. Computer Science, 2022, 49(7): 248-253.
[5] LI Meng-fei, MAO Ying-chi, TU Zi-jian, WANG Xuan, XU Shu-fang. Server-reliability Task Offloading Strategy Based on Deep Deterministic Policy Gradient [J]. Computer Science, 2022, 49(7): 271-279.
[6] XIE Wan-cheng, LI Bin, DAI Yue-yue. PPO Based Task Offloading Scheme in Aerial Reconfigurable Intelligent Surface-assisted Edge Computing [J]. Computer Science, 2022, 49(6): 3-11.
[7] HONG Zhi-li, LAI Jun, CAO Lei, CHEN Xi-liang, XU Zhi-xiong. Study on Intelligent Recommendation Method of Dueling Network Reinforcement Learning Based on Regret Exploration [J]. Computer Science, 2022, 49(6): 149-157.
[8] GUO Yu-xin, CHEN Xiu-hong. Automatic Summarization Model Combining BERT Word Embedding Representation and Topic Information Enhancement [J]. Computer Science, 2022, 49(6): 313-318.
[9] FAN Jing-yu, LIU Quan. Off-policy Maximum Entropy Deep Reinforcement Learning Algorithm Based on RandomlyWeighted Triple Q -Learning [J]. Computer Science, 2022, 49(6): 335-341.
[10] ZHANG Jia-neng, LI Hui, WU Hao-lin, WANG Zhuang. Exploration and Exploitation Balanced Experience Replay [J]. Computer Science, 2022, 49(5): 179-185.
[11] LI Ye, CHEN Song-can. Physics-informed Neural Networks:Recent Advances and Prospects [J]. Computer Science, 2022, 49(4): 254-262.
[12] LI Peng, YI Xiu-wen, QI De-kang, DUAN Zhe-wen, LI Tian-rui. Heating Strategy Optimization Method Based on Deep Learning [J]. Computer Science, 2022, 49(4): 263-268.
[13] OUYANG Zhuo, ZHOU Si-yuan, LYU Yong, TAN Guo-ping, ZHANG Yue, XIANG Liang-liang. DRL-based Vehicle Control Strategy for Signal-free Intersections [J]. Computer Science, 2022, 49(3): 46-51.
[14] ZHOU Qin, LUO Fei, DING Wei-chao, GU Chun-hua, ZHENG Shuai. Double Speedy Q-Learning Based on Successive Over Relaxation [J]. Computer Science, 2022, 49(3): 239-245.
[15] LI Su, SONG Bao-yan, LI Dong, WANG Jun-lu. Composite Blockchain Associated Event Tracing Method for Financial Activities [J]. Computer Science, 2022, 49(3): 346-353.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!