面向策略探索的强化学习与进化计算方法综述

doi:10.11896/jsjkx.230400058

摘要/Abstract

摘要： 强化学习与进化计算作为两类自然启发的学习范式,是当前求解策略探索问题的主流方法,两类方法的融合为策略探索问题的求解提供了通用解决方案。通过对比强化学习与进化计算,从强化学习与进化计算的基本方法、策略探索的基础方法分析、策略探索的融合式方法分析以及前沿挑战4个方面全面分析了策略探索问题的方法,以期能够为该领域的交叉融合研究带来启发。

关键词: 马尔可夫决策过程, 强化学习, 进化计算, 策略搜索, 元学习

Abstract: Reinforcement learning and evolutionary computation,as two types of nature-inspired learning paradigms,are the mainstream methods for solving strategy exploration problems,and the fusion of these two types of methods provides a general solution for solving strategy exploration problems.This paper analyzes the basic methods of reinforcement learning and evolutionary computation,the basic methods of strategy exploration,the fused methods of strategy exploration,and the frontier challenges in four aspects,and it is expected to bring inspiration to the cross-fertilization research in this field.

Key words: Markov decision-making process, Reinforcement learning, Evolutionary computation, Strategy exploration, Meta lear-ning

中图分类号:

TP391

王尧, 罗俊仁, 周棪忠, 谷学强, 张万鹏. 面向策略探索的强化学习与进化计算方法综述[J]. 计算机科学, 2024, 51(3): 183-197. https://doi.org/10.11896/jsjkx.230400058

WANG Yao, LUO Junren, ZHOU Yanzhong, GU Xueqiang, ZHANG Wanpeng. Review of Reinforcement Learning and Evolutionary Computation Methods for StrategyExploration[J]. Computer Science, 2024, 51(3): 183-197. https://doi.org/10.11896/jsjkx.230400058

参考文献

[1]DRUGAN M M.Reinforcement learning versus evolutionarycomputation:A survey on hybrid algorithms[J].Swarm and Evolutionary Computation,2019,44:228-246.
[2]FRANCOIS-LAVET V,HENDERSON P,ISLAM R,et al.An introduction to deep reinforcement learning[J].Foundations and Trends in Machine Learning,2018,11(3/4):219-354.
[3]GONG X,YU J,LU S,et al.Actor-critic with familiarity-based trajectory experience replay[J].Information Sciences,2022,582:633-647.
[4]EIBEN A E,SMITH J E.Introduction to evolutionary computing[M].Berlin Heidelberg:Springer-Verlag,2015.
[5]GE J K,QIU Y H,WU C M,et al.Survey on Genetic Algorithm[J].Computer Application Research,2008,25(10):2911-2916.
[6]YANG T,TANG H,BAI C,et al.Exploration in deep reinforcement learning:a comprehensive survey [J].arXiv:2109.06668,2021.
[7]AMIN S,GOMROKCHI M,SATIJA H,et al.A survey of exploration methods in reinforcement learning [J].arXiv:2109.00157,2021.
[8]MAJID A Y,SAAYBI S,VAN RIETBERGEN T,et al.Deepreinforcement learning versus evolution strategies:a comparative survey [J].arXiv:2110.01411,2021.
[9]SIGAUD O.Combining Evolution and Deep ReinforcementLearning for Policy Search:A Survey [J].ACM Transactions on Evolutionary Learning and Optimization,2023,3(3):1-20.
[10]BAI H,CHENG R,JIN Y.Evolutionary Reinforcement Lear-ning:A Survey [J].arXiv:2303.04150,2023.
[11]SUTTON R S,MCALLESTER D,SINGH S,et al.Policy gradient methods for reinforcement learning with function approximation[J].Advances in Neural Information Processing Systems,1999,12:1057-1063.
[12]MNIH V,BADIA A P,MIRZA M,et al.Asynchronous methods for deep reinforcement learning[C]//International Conference on Machine Learning.New York:ACM,2016:1928-1937.
[13]SCHULMAN J,LEVINE S,ABBEEL P,et al.Trust region po-licy optimization[C]//International Conference on Machine Learning.New York:ACM,2015:1889-1897.
[14]SCHULMAN J,WOLSKI F,DHARIWAL P,et al.Proximal policy optimization algorithms [J].arXiv:1707.06347,2017.
[15]MNIH V,KAVUKCUOGLU K,SILVER D,et al.Playing atari with deep reinforcement learning [J].arXiv:1312.5602,2013.
[16]VAN HASSELT H,GUEZ A,SILVER D.Deep reinforcement learning with double q-learning[C]//Proceedings of the AAAI Conference on Artificial Intelligence.Menlo Park CA:AAAI,2016.
[17]WANG Z,SCHAUL T,HESSEL M,et al.Dueling network architectures for deep reinforcement learning[C]//International Conference on Machine Learning.New York:ACM,2016:1995-2003.
[18]LILLICRAP T P,HUNT J J,PRITZEL A,et al.Continuouscontrol with deep reinforcement learning [J].arXiv:1509.02971,2015.
[19]FUJIMOTO S,MEGER D,PRECUP D.Off-policy deep rein-forcement learning without exploration[C]//International Conference on Machine Learning.New York:ACM,2019:2052-2062.
[20]KUMAR A,FU J,SOH M.Stabilizing off-policy Q-learning via bootstrapping error reduction[J].Advances in Neural Information Processing Systems,2019,32(11):11784-11794.
[21]WU Y,TUCKER G,NACHUM O,Behavior regularized offline reinforcement learning [J].arXiv:1911.11361,2019.
[22]KOSTRIKOV I,FERGUS R,TOMPSON J,et al.Offline reinforcement learning with fisher divergence critic regularization[C]//International Conference on Machine Learning.New York:ACM,2021:5774-5783.
[23]WANG Q,XIONG J,HAN L,et al.Exponentially weighted imitation learning for batched historical data[J].Advances in Neural Information Processing Systems,2018,31:6288-6297.
[24]NAIR A,GUPTA A,DALAL M,et al.AWAC:Acceleratingonline reinforcement learning with offline dataset [J].arXiv:2006.09359,2020.
[25]FUJIMOTO S,GU S S.A minimalist approach to offline reinforcement learning[J].Advances in Neural Information Proces-sing Systems,2021,34:20132-20145.
[26]HAARNOJA T,ZHOU A,ABBEEL P,et al.Soft actor-critic:Off-policy maximum entropy deep reinforcement learning with a stochastic actor[C]//International Conference on Machine Learning.New York:ACM,2018:1861-1870.
[27]NACHUM O,DAI B,KOSTRIKOV I,et al.Algaedice:Policy gradient from arbitrary experience[J].arXiv:1912.02074,2019.
[28]KUMAR A,ZHOU A,TUCKER G,et al.Conservative q-lear-ning for offline reinforcement learning[J].Advances in Neural Information Processing Systems,2020,33:1179-1191.
[29]AGARWAL R,SCHUURMANS D,NOROUZI M.An optimistic perspective on offline reinforcement learning[C]//International Conference on Machine Learning.New York:ACM,2020:104-114.
[30]KIDAMBI R,RAJESWARAN A,NETRAPALLI P,et al.Morel:Model-based offline reinforcement learning[J].Advances in Neural Information Processing Systems,2020,33:21810-21823.
[31]YU T,THOMAS G,YU L,et al.MOPO:Model-based offline policy optimization[J].Advances in Neural Information Proces-sing Systems,2020,33:14129-14142.
[32]MATSUSHIMA T,FURUTA H,MATSUO Y,et al.Deploy-ment-efficient reinforcement learning via model-based offline optimization [J].arXiv:2006.03647,2020.
[33]YU T,KUMAR A,RAFAILOV R,et al.Combo:Conservative offline model-based policy optimization[J].Advances in Neural Information Processing Systems,2021,34:28954-28967.
[34]MIRJALILI S.Evolutionary Algorithms and Neural Networks:Theory and Applications[M].Springer,2019.
[35]HANSEN N,ARNOLD D V,AUGER A.Evolution strategies[M]//Springer Handbook Of Computational Intelligence.Berlin Heidelberg:Springer,2015.
[36]HAUSCHILD M,PELIKAN M.An introduction and survey of estimation of distribution algorithms[J].Swarm and Evolutio-nary Computation,2011,1(3):111-128.
[37]HANSEN N,OSTERMEIER A.Adapting arbitrary normal mutation distributions in evolution strategies:The covariance matrix adaptation[C]// Proceedings of IEEE International Confe-rence on Evolutionary Computation.Nagoya Japan:IEEE,1996:312-317.
[38]WIERSTRA D,SCHAUL T,GLASMACHERS T,et al.Natural evolution strategies[J].The Journal of Machine Learning Research,2014,15(1):949-980.
[39]MA X,LI X,ZHANG Q,et al.A survey on cooperative co-evolutionary algorithms[J].IEEE Transactions on Evolutionary Computation,2018,23(3):421-441.
[40]SALIMANS T,HO J,CHEN X,et al.Evolution strategies as a scalable alternative to reinforcement learning [J].arXiv:1703.03864,2017.
[41]MORITZ P,NISHIHARA R,WANG S,et al.Ray:A distributed framework for emerging AI applications[C]//13th USENIX Symposium on Operating Systems Design and Implementation.Carlsbad,CA,2018:561-577.
[42]LIANG E,LIAW R,NISHIHARA R,et al.Ray rllib:A composable and scalable reinforcement learning library [J].arXiv:1712.09381,2017.
[43]AUER P,CESA-BIANCHI N,FISCHER P.Finite-time analysis of the multiarmed bandit problem[J].Machine Learning,2002,47:235-256.
[44]RUSSO D J,VAN ROY B,KAZEROUNI A,et al.A tutorial on Thompson sampling[J].Foundations and Trends© in Machine Learning,2018,11(1):1-96.
[45]KIRSCHNER J,KRAUSE A.Information directed samplingand bandits with heteroscedastic noise[C]// Conference on Learning Theory.New York:PMLR,2018:358-384.
[46]NIKOLOV N,KIRSCHNER J,BERKENKAMP F,et al.Information-directed exploration for deep reinforcement learning [J].arXiv:1812.07544,2019.
[47]MAVRIN B,YAO H,KONG L,et al.Distributional reinforcement learning for efficient exploration[C]//International Conference on Machine Learning.New York:ACM,2019:4424-4434.
[48]ZHOU F,WANG J,FENG X.Non-crossing quantile regression for distributional reinforcement learning[J].Advances in Neural Information Processing Systems,2020,33:15909-15919.
[49]OSBAND I,VAN ROY B.Bootstrapped Thompson samplingand deep exploration [J].arXiv:1507.00300,2015.
[50]OSBAND I,BLUNDELL C,PRITZEL A,et al.Deep exploration via bootstrapped DQN[J].Advances in Neural Information Processing Systems,2016,29:4026-4034.
[51]ZHANG Y,GOH W B.Bootstrapped policy gradient for difficulty adaptation in intelligent tutoring systems[C]//Proceedings of the 18th International Conference on Autonomous Agents and Multi-Agent Systems.Cham Switzerland:Springer,2019:711-719.
[52]KALWEIT G,BOEDECKER J.Uncertainty-driven imagination for continuous deep reinforcement learning[C]//Conference on Robot Learning.New York:PMLR,2017:195-206.
[53]YANG Z,MERRICK K E,ABBASS H A,et al.Multi-task deep reinforcement learning for continuous action control[C]//International Joint Conference on Artificial Intelligence.San Francisco,USA:Morgan Kaufmann,2017:3301-3307.
[54]ZHENG Z,YUAN C,CHENG Y.Self-adaptive double boot-strapped DDPG[C]//International Joint Conference on Artificial Intelligence.San Francisco,USA,Morgan Kaufmann,2018:3198-3204.
[55]OSBAND I,VAN ROY B,WEN Z.Generalization and exploration via randomized value functions[C]// International Confe-rence on Machine Learning.New York:ACM,2016:2377-2386.
[56]ZANETTE A,BRANDFONBRENER D,BRUNSKILL E,et al.Frequentist regret bounds for randomized least-squares value iteration[C]//International Conference on Artificial Intelligence and Statistics.New York:ACM,2020:1954-1964.
[57]AZIZZADENESHELI K,BRUNSKILL E,ANANDKUMARA.Efficient exploration through Bayesian deep q-networks[C]//2018 Information Theory and Applications Workshop(ITA).CA USA:IEEE,2018:1-9.
[58]STADIE B C,LEVINE S,ABBEEL P.Incentivizing exploration in reinforcement learning with deep predictive models [J].ar-Xiv:1507.00814,2015.
[59]PATHAK D,AGRAWAL P,EFROS A A,et al.Curiosity-dri-ven exploration by self-supervised prediction[C]//International Conference on Machine Learning.New York:ACM,2017:2778-2787.
[60]CHARIKAR M S.Similarity estimation techniques from roun-ding algorithms[C]//Proceedings of the Thirty-fourth Annual ACM Symposium on Theory of Computing.Quebec,Canada,2002:380-388.
[61]BELLEMARE M,SRINIVASAN S,OSTROVSKI G,et al.Unifying count-based exploration and intrinsic motivation[J].Advances in Neural Information Processing Systems,2016,29:1471-1479.
[62]OSTROVSKI G,BELLEMARE M G,OORD A,et al.Count-based exploration with neural density models[C]//International Conference on Machine Learning.New York:ACM,2017:2721-2730.
[63]BELLEMARE M,VENESS J,TALVITIE E.Skip context tree switching[C]// International Conference on Machine Learning.New York:ACM,2014:1458-1466.
[64]MACHADO M C,BELLEMARE M G,BOWLING M.Count-based exploration with the successor representation[C]//Proceedings of the AAAI Conference on Artificial Intelligence.Menlo Park,CA:AAAI,2020:5125-5133.
[65]CHOSHEN L,FOX L,LOEWENSTEIN Y.Dora the explorer:Directed outreaching reinforcement action-selection [J].arXiv:1804.04012,2018.
[66]CHOI J,GUO Y,MOCZULSKI M,et al.Contingency-aware exploration in reinforcement learning [J].arXiv:1811.01483,2019.
[67]BURDA Y,EDWARDS H,STORKEY A,et al.Exploration by random network distillation[C]//Seventh International Confe-rence on Learning Representations.New Orleans,LA,2019:1-17.
[68]FU J,CO-REYES J,LEVINE S.Ex2:Exploration with exemplar models for deep reinforcement learning[J].Advances in Neural Information Processing Systems,2017,30:2577-2587.
[69]ZHANG J,WETZEL N,DORKA N,et al.Scheduled intrinsic drive:A hierarchical take on intrinsically motivated exploration [J].arXiv:1903.07400,2019.
[70]BELLEMARE M,SRINIVASAN S,OSTROVSKI G,et al.Unifying count-based exploration and intrinsic motivation[J].Advances in Neural Information Processing Systems,2016,29:1479-1487.
[71]HOUTHOOFT R,CHEN X,DUAN Y,et al.VIME:variational information maximizing exploration[C]//Proceedings of the 30th International Conference on Neural Information Processing Systems.Cambridge,MA:MIT Press,2016:1117-1125.
[72]SAVINOV N,RAICHUK A,MARINIER R,et al.Episodic Curiosity through Reachability [J].arXiv:1810.02274,2019.
[73]TAO R Y,FRANÇOIS-LAVET V,PINEAU J.Novelty search in representational space for sample efficient exploration[J].Advances in Neural Information Processing Systems,2020,33:8114-8126.
[74]BADIA A P,SPRECHMANN P,VITVITSKYI A,et al.Never give up:Learning directed exploration strategies [J].arXiv:2002.06038,2020.
[75]SCHAUL T,SCHMIDHUBER J.Meta learning[J].Scholarpedia,2010,5(6):4650.
[76]GUPTA A,MENDONCA R,LIU Y X,et al.Meta-reinforce-ment learning of structured exploration strategies[C]//Proceedings of the 32nd International Conference on Neural Information Processing Systems.Cambridge,MA:MIT Pres,2018:5307-5316.
[77]HOCHREITER S,YOUNGER A S,CONWELL P R.Learning to learn using gradient descent[J].Lecture Notesin Computer Science,2001,2130:87-94.
[78]WANG J,KURTH-NELSON Z,SOYER H,et al.Learning to reinforcement learn [J].arXiv:1611.05763,2017.
[79]LI H,CHEN W C,LEVY A,et al.One-shot learning with me-mory-augmented neural networks using a 64-kbit,118 GOPS/W RRAM-based non-volatile associative memory[C]//2021 Symposium on VLSI Technology.Piscataway,NJ:IEEE,2021:1-2.
[80]DUAN Y,SCHULMAN J,CHEN X.Rl2:Fast reinforcementlearning via slow reinforcement learning [J].arXiv:1611.02779,2016.
[81]ROBLES J G,VANSCHOREN J.Learning to reinforcementlearn for neural architecture search [J].arXiv:1911.03769,2019.
[82]MISHRA N,ROHANINEJAD M,CHEN X,et al.A simpleneural attentive meta-learner [J].arXiv:1707.03141,2018.
[83]FINN C,ABBEEL P,LEVINE S.Model-agnostic meta-learning for fast adaptation of deep networks[C]//International Confe-rence on Machine Learning.New York:ACM,2017:1126-1135.
[84]NICHOL A,ACHIAM J,SCHULMAN J.On First-Order Meta-Learning Algorithms [J].arXiv:1803.02999,2018.
[85]ZHANG J X,TRAN H,ZHANG G N.Accelerating reinforcement learning with a Directional-Gaussian-Smoothing evolution strategy[J].Electronic Research Archive,2021,29(6):4119-4135.
[86]JADERBERG M,DALIBARD V,OSINDERO S,et al.Population based training of neural networks [J].arXiv:1711.09846,2017.
[87]FRANKE J K II,KOEHLER G,BIEDENKAPP A,et al.Sample efficient automated deep reinforcement learning[C]//Procee-dings of the 8th International Conference on Learning Representations.New Orleans,LA:OpenReview.net,2020:1-12.
[88]STULP F,SIGAUD O.Path integral policy improvement withcovariance matrix adaptation[C]//Proceedings of the 29th International Conference on Machine Learning.New York:ACM,2012:1547-1554.
[89]HANSEN N,OSTERMEIER A.Completely derandomized self-adaptation in evolution strategies[J].Evolutionary Computation,2001,9(2):159-195.
[90]KOUTNIK J,GOMEZ F,SCHMIDHUBER J.Evolving neural networks in compressed weight space[C]//Proceedings of the 12th Annual Conferenceon Genetic and Evolutionary Computation.New York:ACM,2010:619-626.
[91]SALIMANS T,HO J,CHEN X,et al.Evolution Strategies as a Scalable Alternative to Reinforcement Learning [J].arXiv:1703.03864,2017.
[92]SEHNKE F,OSENDORFER C,RÜCKSTIEß T,et al.Parameter exploring policy gradients[J].Neural Networks,2010,23(4):551-559.
[93]SCHULMAN J,LEVINE S,ABBEEL P,et al.Trust region po-licy optimization[C]//International Conference On Machine Learning.New York:ACM,2015:1889-1897.
[94]MNIH V,BADIA A P,MIRZA M,et al.Asynchronous methods for deep reinforcement learning[C]//International Conference on Machine Learning.New York:ACM,2016:1928-1937.
[95]CONTI E,MADHAVAN V,SUCH F P,et al.Improving exploration in evolution strategies for deep reinforcement learning via a population of novelty-seeking agents[C]//Proceedings of the 32nd International Conference on Neural Information Processing Systems.Cambridge,MA:MIT Press,2018:5032-5043.
[96]SALIMANS T,HO J,CHEN X,et al.Evolution Strategies as a Scalable Alternative to Reinforcement Learning [J].arXiv:1703.03864,2017.
[97]SALIMANS T,GOODFELLOW I,ZAREMBA W,et al.Im-proved techniques for training GANs[C]//Proceedings of the 30th International Conference on Neural Information Processing Systems.Cambridge,MA:MIT Press,2016:2234-2242.
[98]GEWEKE J.Antithetic acceleration of Monte Carlo integration in Bayesian inference[J].Journal of Econometrics,1988,38(1/2):73-89.
[99]WIERSTRA D,SCHAUL T,GLASMACHERS T,et al.Natural evolution strategies[J].The Journal of Machine Learning Research,2014,15(1):949-980.
[100]CHOROMANSKI K,ROWLAND M,SINDHWANI V,et al.Structured evolution with compact architectures for scalable policy optimization[C]//International Conference on Machine Learning.New York:ACM,2018:970-978.
[101]MAHESWARANATHAN N,METZ L,TUCKER G,et al.Guided evolutionary strategies:Augmenting random search with surrogate gradients[C]//International Conference on Machine Learning.New York:ACM,2019:4264-4273.
[102]CHOROMANSKI K,PACCHIANO A,PARKER-HOLDER J,et al.From complexity to simplicity:adaptive ES-Active subspaces for black box optimization[C]//Proceedings of the 33rd International Conference on Neural Information Processing Systems.Cambridge,MA:MIT Press,2019:10299-10309.
[103]LIU F Y,LI Z N,QIAN C.Self-guided evolution strategies with historical estimated gradients[C]// International Joint Confe-rence on Artificial Intelligence.San Francisco,USA:Morgan Kaufmann,2020:1474-1480.
[104]ZHANG J,TRAN H,LU D,et al.A novel evolution strategy with directional gaussian smoothing for black box optimization [J].arXiv:2002.03001,2020.
[105]LEHMAN J,STANLEY K O.Novelty search and the problem with objectives[J].Genetic Programming Theory And Practice,2011,21:37-56.
[106]PUGH J K,SOROS L B,STANLEY K O.Quality diversity:A new frontier for evolutionary computation[J].Frontiers in Robotics and AI,2016,3:40.
[107]GAJEWSKI A,CLUNE J,STANLEY K O,et al.Evolvability ES:Scalable and direct optimization of evolvability[C]//Proceedings of the Genetic and Evolutionary Computation Confe-rence.New York;ACM,2019:107-115.
[108]MENGISTU H,LEHMAN J,CLUNE J.Evolvability search:di-rectly selecting for evolvability in order to study and produce it[C]//Proceedings of the Genetic and Evolutionary Computation Conference.New York:ACM,2016:141-148.
[109]KATONA A,FRANKS D W,WALKER J A.Quality evolvabi-lity es:Evolving individuals with a distribution of well perfor-ming and diverse offspring [J].arXiv:2103.10790,2021.
[110]HOSPEDALES T,ANTONIOU A,MICAELLI P,et al.Meta-learning in neural networks:A survey[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2021,44(9):5149-5169.
[111]SONG X,GAO W,YANG Y,et al.ES-MAML:Simple hessian-free meta learning [J].arXiv:1910.01215,2020.
[112]SONG X,YANG Y,CHOROMANSKI K,et al.Rapidly adaptable legged robots via evolutionary meta-learning[C]//International Conference on Intelligent Robots and Systems.Pisca-taway,NJ:IEEE,2020:3769-3776.
[113]WANG Z,CHEN C,DONG D.Instance weighted incrementalevolution strategies for reinforcement learning in dynamic environments [J].arXiv:2010.04605,2022.
[114]TAN J,ZHANG T,COUMANS E,et al.Sim-to-real:Learning agile locomotion for quadruped robots [J].arXiv:1804.10332,2018.
[115]NAGABANDI A,CLAVERA I,LIU S,et al.Learning to adapt in dynamic,real-world environments via Meta-reinforcement learning [J].arXiv:1803.11347,2019.
[116]ARNDT K,HAZARA M,GHADIRZADEH A,et al.Meta reinforcement learning for sim-to-real domain adaptation[C]//2020 IEEE International Conferenceon Robotics and Automation.Piscataway,NJ:IEEE,2020:2725-2731.
[117]HANSEL K,MOOS J,DERSTROFF C.Benchmarking the na-tural gradient in policy gradient methods and evolution strategies[J].Reinforcement Learning Algorithms:Analysis and Applications,2021,883:69-84.
[118]ECOFFET P,FONTBONNE N,ANDRÉ J B,et al.Policysearch with rare significant events:Choosing the right partner to cooperate with[J].PLOS one,2022,17(4):e0266841.
[119]KHADKA S,TUMER K.Evolution-guided policy gradient inreinforcement learning[J].Advances in Neural Information Processing Systems,2018,31:1196-1208.
[120]KHADKA S,MAJUMDAR S,NASSAR T,et al.Collaborative evolutionary reinforcement learning[C]//International Confe-rence On Machine Learning.New York:ACM,2019:3341-3350.
[121]BODNAR C,DAY B,LIÓ P.Proximal distilled evolutionary reinforcement learning[C]//Proceedings of the AAAI Conference on Artificial Intelligence.Menlo Park CA:AAAI,2020.
[122]SURI K,SHI X Q,PLATANIOTIS K N,et al.Maximum mutation reinforcement learning for scalable control [J].arXiv:2007.13690,2020.
[123]SHI L,LI S,CAO L,et al.FiDi-RL:Incorporating deep rein-forcement learning with finite-difference policy search for efficient learning of continuous control [J].arXiv:1907.00526,2019.
[124]POURCHOT A,SIGAUD O.CEM-RL:Combining evolutionary and gradient-based methods for policy search[C]//7th International Conference on Learning Representations.New Orleans,LA:OpenReview.net,2019.
[125]TANG Y.Guiding evolutionary strategies with off-policy actor-critic[C]//Proceedings of the 20th International Conference on Autonomous Agents and MultiAgent Systems.2021:1317-1325.
[126]CHANG S,YANG J,CHOI J,et al.Genetic-gated networks for deep reinforcement learning[J].Advances in Neural Information Processing Systems,2018,31:1754-1763.
[127]MARCHESINI E,CORSI D,FARINELLI A.Genetic soft updates for policy evolution in deep reinforcement learning[C]//International Conference on Learning Representations.New Orleans,LA:OpenReview.net,2021.
[128]LIU Q,WANG Y,LIU X.PNS:Population-Guided noveltysearch for reinforcement learning in hard exploration environments[C]//2021 International Conference on Intelligent Robots and Systems.Piscataway,NJ:IEEE,2021:5627-5634.
[129]SHI L,LI S,ZHENG Q,et al.Efficient novelty search through deep reinforcement learning[J].IEEE Access,2020,8:128809-128818.
[130]NILSSON O,CULLY A.Policy gradient assisted map-elites[C]//Proceedings of the Genetic and Evolutionary Computation Conference.New York:ACM,2021:866-875.
[131]CIDERON G,PIERROT T,PERRIN N,et al.QD-RL:Efficient mixing of quality and diversity in reinforcement learning [J].arXiv:2006.08505,2020.
[132]PIERROT T,MACÉ V,CIDERON G,et al.Sample efficientquality diversity for neural continuous control [C]//ICLR 2021 Conference.2021.
[133]TJANAKA B,FONTAINE M C,TOGELIUS J,et al.Approximating gradients for differentiable quality diversity in reinforcement learning[C]//Proceedings of the Genetic and Evolutionary Computation Conference.New York:ACM,2022:1102-1111.
[134]SUTTON R S,BARTO A G.Reinforcement learning:An introduction[M].Massachusetts,USA:MIT press,2018.
[135]OSBAND I,BLUNDELL C,PRITZEL A,et al.Deep explora-tion via bootstrapped DQN[C]// Proceedings of the 30th International Conference on Neural Information Processing Systems.Cambridge,MA:MIT Press,2016:4033-4041.
[136]BELLEMARE M G,SRINIVASAN S,OSTROVSKI G,et al.Unifying count-based exploration and intrinsic motivation[C]//Proceedings of the 30th International Conference on Neural Information Processing Systems.Cambridge,MA:MIT Press,2016:1479-1487.
[137]POURCHOT A,PERRIN N,SIGAUD O.Importance mixing:Improving sample reuse in evolutionary policy search methods [J].arXiv:1808.05832,2018.
[138]KRAUSE O.Large-scale noise-resilient evolution strategies[C]// Proceedings of the Genetic and Evolutionary Computation Conference.New York:ACM,2019:682-690.
[139]MÜLLER N,GLASMACHERS T.Challenges in high-dimen-sional reinforcement learning with evolution strategies[C]//Parallel Problem Solving from Nature--PPSN XV:15th International Conference.Coimbra Portugal:Springer International Publishing,2018:411-423.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed