无模型强化学习研究综述

doi:10.11896/jsjkx.200700217

Abstract

Abstract: Reinforcement Learning (RL) is a different learning paradigm from supervised learning and unsupervised learning.It focuses on the interacting process between agent and environment to maximize the accumulated reward.The commonly used RL algorithm is divided into Model-based Reinforcement Learning (MBRL) and Model-free Reinforcement Learning (MFRL).In MBRL,there is a well-designed model to fit the state transition of the environment.In most cases,it is difficult to build an accurate enough model under prior knowledge.In MFRL,parameters in the model are fine-tuned through continuous interactions with the environment.The whole process has good portability.Therefore,MFRL is widely used in various fields.This paper reviews the recent research progress of MFRL.Firstly,an overview of basic theory is given.Then,three types of classical algorithms of MFRL based on value function and strategy function are introduced.Finally,the related researches of MFRL are summarized and prospected.

Key words: Artificial intelligence, Deep reinforcement learning, Mar-kov decision process, Model-free reinforcement learning, Reinforcement learning

CLC Number:

TP181

QIN Zhi-hui, LI Ning, LIU Xiao-tong, LIU Xiu-lei, TONG Qiang, LIU Xu-hong. Overview of Research on Model-free Reinforcement Learning[J].Computer Science, 2021, 48(3): 180-187.

References

[1]GAO Y,CHEN S F,LU X.Research on Reinforcement Lear-ning Technology:A Review[J].Acta Automatica Sinica,2004,30(1):86-100.
[2]LECUN Y,BENGIO Y,HINTONG E,et al.Deep learning[J].Nature,2015,521(7553):436-444.
[3]HENDERSON P A,ISLAM R,BACHMANP,et al.Deep Reinforcement Learning that Matters[J].arXiv:1709.06560.
[4]SUTTON R S,BARTOA G.Reinforcement Learning:An Introduction[J].IEEEE Transactions on Neural Network,1998,9(5):1054.
[5]TANGKARATT V,MORI S,ZHAO T,et al.Model-based policy gradients with parameter-based exploration by least-squares conditional density estimation[J].Neural Networks,2014,57:128-140.
[6]WANG T,BAO X,CLAVERAI,et al.Benchmarking Model-Based Reinforcement Learning[J].arXiv:1907.02057.
[7]BARFOOT T D.State estimation for robotics[M].Cambridge:Cambridge University Press,2017.
[8]ZHAO T T,KONG L,HAN Y J,et al.A Review of Model-based Reinforcement Learning[J].Journal of Frontiers of Computer Science & Technology,2020,14(6):918-927.
[9]BELLMAN R.AMarkovian Decision Process[J].Indiana Uni-versity Mathematics Journal,1957,6(4):679-684.
[10]SUTTON R S.Learning to predict by the method of temporal differences[J].Machine Learning,1988,3(1):9-44.
[11]ZHOU C H,XING Z H,LIU Z F,et al.Markov DecisionProcess Boundary Model Detection[J].Chinese Journal of Computers,2013,36(12):2587-2600.
[12]BELLMAHN R.Dynamic Programming[J].Science,1966,153(3731):34-37.
[13]RIDA M,MOUNCIF H,BOULMAKOUL A.Application ofMarkov Decision Processes for Modeling and Optimization of Decision-Making within a Container Port[J].Soft Computing in Industrial Applications,2011,96:349-358.
[14]SINGH S,JAAKKOLA T,LITTMAN M L.Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms[J].Machine Learning,2000,38(3):287-308.
[15]WANG X,WANG F.A review of dynamic pricing strategybased on reinforcement learning[J].Computer Applications and Software,2019,36(12):1-6,18.
[16]RUMMERY G A,NIRANJAN M.On-Line Q-Learning Using Connectionist Systems[R].Department of Engineering,University of Cambridge,Cambridge,1994.
[17]THAM C K.Modular on-line function approximation for scaling up reinforcement learning[D].Cambridge:Cambridge University,1994.
[18]WATKINS C,DAYAN P.Technical Note:Q-Learning[J].Machine Learning,1992,8(3):279-292.
[19]VAN SEIJEN H,VAN HASSELT H,WHITESON S,et al.A theoretical and empirical analysis of Expected Sarsa[C]//IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning.2009:177-184.
[20]MNIH V,KAVUKCUOGLU K,SILVER D,et al.Playing Atari with Deep Reinforcement Learning[J].arXiv:1312.5602.
[21]LAN Q,PAN Y,FYSHE A,et al.Maxmin Q-learning:Controlling the Estimation Bias of Q-learning[J].arXiv:2002.06487.
[22]VAN HASSELT H,GUEZ A,SILVER D,et al.Deep reinforcement learning with double Q-Learning[C]//National Confe-rence on Artificial Intelligence.2016:2094-2100.
[23]LAKSHMINARAYANAN A S,SHARMA S,RAVINDRANB,et al.Dynamic Frame skip Deep Q Network[J].arXiv:1605.05365.
[24]WANG Z,SCHAUL T,HESSEL M,et al.Dueling network architectures for deep reinforcement learning[C]//International Conference on Machine Learning.2016:1995-2003.
[25]HAUSKNECHT M,STONE P.Deep Recurrent Q-Learning for Partially Observable MDPs[J].arXiv:1507.06527v1.
[26]FORTUNATO M,AZAR M G,PIOT B,et al.Noisy Networks for Exploration[J].arXiv:1706.10295.
[27]ASADI K,LITTMAN M L.An alternative softmax operator for reinforcement learning[C]//International Conference on Machine Learning.2017:243-252.
[28]ENGEL Y,MANNOR S,MEIR R,et al.Reinforcement learning with Gaussian processes[C]//International Conference on Machine Learning.2005:201-208.
[29]SUTTON R S,MCALLESTER D,SINGH S,et al.Policy Gradient Methods for Reinforcement Learning with Function Approximation[C]//Neural Information Processing Systems.1999:1057-1063.
[30]KONDA V R,TSITSIKLIS J N.On Actor-Critic Algorithms[J].Siam Journal on Control and Optimization,2003,42(4):1143-1166.
[31]MNIH V,BADIA A P,MIRZA M,et al.Asynchronous methods for deep reinforcement learning[C]//International Conference on Machine Learning.2016:1928-1937.
[32]BABAEIZADEH M,FROSIO I,TYREE S,et al.Reinforcement Learning through Asynchronous Advantage Actor-Critic on a GPU[J].arXiv:1611.06256v3.
[33]TESAURO G.TD-Gammon,a self-teaching backgammon program,achieves master-level play[J].Neural Computation,1994,6(2):215-219.
[34]WANG X,SANDHOLM T.Reinforcement Learning to Play an Optimal Nash Equilibrium in Team Markov Games[C]//Neural Information Processing Systems.2002:1603-1610.
[35]KOBER J,BAGNELL J A,PETERS J,et al.Reinforcementlearning in robotics:A survey[J].The International Journal of Robotics Research,2013,32(11):1238-1274.
[36]FU Q M,LIU Q,WANG H,et al.A novel off policy Q(λ) algorithm based on linear function approximation[J].Chinese Journal of Computers,2014,37:677-686.
[37]CHEN G.Merging Deterministic Policy Gradient Estimationswith Varied Bias-Variance Tradeoff for Effective Deep Reinforcement Learning[J].arXiv:1911.10527.
[38]SCHULMAN J,LEVINE S,ABBEEL P,et al.Trust Region Policy Optimization[C]//International Conference on Machine Learning.2015:1889-1897.
[39]QI C,HUA Y,LI R,et al.Deep Reinforcement Learning With Discrete Normalized Advantage Functions for Resource Management in Network Slicing[J].IEEE Communications Letters,2019,23(8):1337-1341.
[40]WANG Z,BAPST V,HEESS N,et al.Sample Efficient Actor-Critic with Experience Replay[J].arXiv:1611.01224.
[41]SCHULMAN J,WOLSKI F,DHARIWALP,et al.ProximalPolicy Optimization Algorithms[J].arXiv:1707.06347v1.
[42]SILVER D,SCHRITTWIESER J,SIMONYAN K,et al.Mastering the game of Go without human knowledge[J].Nature,2017,550(7676):354-359.
[43]ARULKUMARAN K,CULLY A,TOGELIUS J,et al.Al-phaStar:an evolutionary computation perspective[C]//Genetic And Evolutionary Computation Conference.2019:314-315.
[44]RAIMAN J,ZHANG S,WOLSKI F,et al.Long-Term Planning and Situational Awareness in OpenAI Five[J].arXiv:1912.06721.
[45]VINYALS O,EWALDS T,BARTUNOV S,et al.StarCraft II:A New Challenge for Reinforcement Learning[J].arXiv:1708.04782.
[46]TIAN Y,GONG Q,SHANG W,et al.ELF:An Extensive,Lightweight and Flexible Research Platform for Real-time Strate-gy Games[C]//Neural Information Processing Systems.2017:2659-2669.
[47]SYNNAEVE G,BESSIERE P.A Bayesian model for RTS units control applied to StarCraft[C]//Computational Intelligence And Games.2011:190-196.
[48]WENDER S,WATSON I.Applying reinforcement learning tosmall scale combat in the real-time strategy game StarCraft:Broodwar[C]//Computational Intelligence And Games.2012:402-408.
[49]WU B,FU Q,LIANG J,et al.Hierarchical Macro StrategyModel for MOBA Game AI[J].arXiv:1812.07887.
[50]YE D,LIU Z,SUN M,et al.Mastering Complex Control in MOBA Games with Deep Reinforcement Learning[C]//National Conference on Artificial Intelligence.2020:6672-6679.
[51]GUIMARAES G L,SANCHEZLENGELING B,FARIAS P L,et al.Objective-Reinforced Generative Adversarial Networks (ORGAN) for Sequence Generation Models[J].arXiv:1705.10843.
[52]YU L,ZHANG W,WANG J,et al.SeqGAN:Sequence Generative Adversarial Nets with Policy Gradient[C]//National Conference On Artificial Intelligence.2016:2852-2858.
[53]ADLER J,LUNZ S.Banach Wasserstein GAN[C]//Neural Information Processing Systems.2018:6755-6764.
[54]LI H,COLLINS C R,RIBELLI T G,et al.Tuning the molecular weight distribution from atom transfer radical polymerization using deep reinforcement learning[J].Molecular Systems Design &Engineering,2018,3(3):496-508.
[55]ZHOU Z,LI X,ZARER N,et al.Optimizing Chemical Reactions with Deep Reinforcement Learning[J].ACS central science,2017,3(12):1337-1344.
[56]GRAYVER A,KUVSHINOV A.Exploring equivalence domain in nonlinear inverse problems using Covariance Matrix Adaption Evolution Strategy (CMAES) and random sampling[J].Geophysical Journal International,2016,205(2):971-987.
[57]ZHANG T,HUANG M,ZHAO L,et al.Learning StructuredRepresentation for Text Classification via Reinforcement Lear-ning[C]//National Conference on Artificial Intelligence.2018:6053-6060.
[58]HAARNOJA T,HA S,ZHOU A,et al.Learning to Walk Via Deep Reinforcement Learning[J].arXiv:1812.11103.
[59]HAFNER R,HERTWECK T,KLPPNERP,et al.TowardsGeneral and Autonomous Learning of Core Skills:A Case Study in Locomotion[J].arXiv:2008.12228.
[60]PUIGDOMÈNECH B A,PIOT B,KAPTUROWSKI S,et al.Agent57:Outperforming the Atari Human Benchmark[J].ar-Xiv:2003.13350v1.
[61]OPENA I.Faulty Reward Functions in the Wild[EB/OL].https://openai.com/blog/faulty-reward-functions.2017.
[62]NG A Y,RUSSELLS J.Algorithms for inverse reinforcementlearning[C]//International Conference on Machine Learning.2000:663-670.

Related Articles 15

[1]	LIU Xing-guang, ZHOU Li, LIU Yan, ZHANG Xiao-ying, TAN Xiang, WEI Ji-bo. Construction and Distribution Method of REM Based on Edge Intelligence [J]. Computer Science, 2022, 49(9): 236-241.
[2]	SHI Dian-xi, ZHAO Chen-ran, ZHANG Yao-wen, YANG Shao-wu, ZHANG Yong-jun. Adaptive Reward Method for End-to-End Cooperation Based on Multi-agent Reinforcement Learning [J]. Computer Science, 2022, 49(8): 247-256.
[3]	YUAN Wei-lin, LUO Jun-ren, LU Li-na, CHEN Jia-xing, ZHANG Wan-peng, CHEN Jing. Methods in Adversarial Intelligent Game:A Holistic Comparative Analysis from Perspective of Game Theory and Reinforcement Learning [J]. Computer Science, 2022, 49(8): 191-204.
[4]	YU Bin, LI Xue-hua, PAN Chun-yu, LI Na. Edge-Cloud Collaborative Resource Allocation Algorithm Based on Deep Reinforcement Learning [J]. Computer Science, 2022, 49(7): 248-253.
[5]	LI Meng-fei, MAO Ying-chi, TU Zi-jian, WANG Xuan, XU Shu-fang. Server-reliability Task Offloading Strategy Based on Deep Deterministic Policy Gradient [J]. Computer Science, 2022, 49(7): 271-279.
[6]	XIE Wan-cheng, LI Bin, DAI Yue-yue. PPO Based Task Offloading Scheme in Aerial Reconfigurable Intelligent Surface-assisted Edge Computing [J]. Computer Science, 2022, 49(6): 3-11.
[7]	HONG Zhi-li, LAI Jun, CAO Lei, CHEN Xi-liang, XU Zhi-xiong. Study on Intelligent Recommendation Method of Dueling Network Reinforcement Learning Based on Regret Exploration [J]. Computer Science, 2022, 49(6): 149-157.
[8]	GUO Yu-xin, CHEN Xiu-hong. Automatic Summarization Model Combining BERT Word Embedding Representation and Topic Information Enhancement [J]. Computer Science, 2022, 49(6): 313-318.
[9]	FAN Jing-yu, LIU Quan. Off-policy Maximum Entropy Deep Reinforcement Learning Algorithm Based on RandomlyWeighted Triple Q -Learning [J]. Computer Science, 2022, 49(6): 335-341.
[10]	ZHANG Jia-neng, LI Hui, WU Hao-lin, WANG Zhuang. Exploration and Exploitation Balanced Experience Replay [J]. Computer Science, 2022, 49(5): 179-185.
[11]	LI Ye, CHEN Song-can. Physics-informed Neural Networks:Recent Advances and Prospects [J]. Computer Science, 2022, 49(4): 254-262.
[12]	LI Peng, YI Xiu-wen, QI De-kang, DUAN Zhe-wen, LI Tian-rui. Heating Strategy Optimization Method Based on Deep Learning [J]. Computer Science, 2022, 49(4): 263-268.
[13]	OUYANG Zhuo, ZHOU Si-yuan, LYU Yong, TAN Guo-ping, ZHANG Yue, XIANG Liang-liang. DRL-based Vehicle Control Strategy for Signal-free Intersections [J]. Computer Science, 2022, 49(3): 46-51.
[14]	ZHOU Qin, LUO Fei, DING Wei-chao, GU Chun-hua, ZHENG Shuai. Double Speedy Q-Learning Based on Successive Over Relaxation [J]. Computer Science, 2022, 49(3): 239-245.
[15]	LI Su, SONG Bao-yan, LI Dong, WANG Jun-lu. Composite Blockchain Associated Event Tracing Method for Financial Activities [J]. Computer Science, 2022, 49(3): 346-353.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Overview of Research on Model-free Reinforcement Learning

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 15

Metrics

Comments

Recommended 0