计算机科学 ›› 2023, Vol. 50 ›› Issue (5): 201-216.doi: 10.11896/jsjkx.220400235
张启阳, 陈希亮, 曹雷, 赖俊, 盛蕾
ZHANG Qiyang, CHEN Xiliang, CAO Lei, LAI Jun, SHENG Lei
摘要: 深度强化学习是人工智能研究中的热点问题,随着研究的深入,其中的短板也逐渐暴露出来,如数据利用率低、泛化能力弱、探索困难、缺乏推理和表征能力等,这些问题极大地制约着深度强化学习方法在现实问题中的应用。知识迁移是解决此问题的非常有效的方法,文中从深度强化学习的视角探讨了如何使用知识迁移加速智能体训练和跨领域迁移过程,对深度强化学习中知识的存在形式及作用方式进行了分析,并按照强化学习的基本构成要素对深度强化学习中的知识迁移方法进行了分类总结,最后总结了目前深度强化学习中的知识迁移在算法、理论和应用方面存在的问题和发展方向。
中图分类号:
[1]MNIH V,KAVUKCUOGLU K,SILVER D,et al.Playing Atari with Deep Reinforcement Learning[J].arXiv:1312.5602,2013. [2]TORRADO R R,BONTRAGER P,TOGELIUS J,et al.Deep Reinforcement Learning for General Video Game AI[C]//14th IEEE Conference on Computational Intelligence and Games,CIG 2018.IEEE Computer Society,2018:14-17. [3]SILVER D,SCHRITTWIESER J,SIMONYAN K,et al.Mastering the game of go without human knowledge[J].Nature,2017,550(7676):354-359. [4]GU S,HOLLY E,LILLICRAP T,et al.Deep reinforcementlearning for robotic manipulation with asynchronous off-policy updates[C]//2017 IEEE International Conference on Robotics and Automation(ICRA).IEEE,2017:3389-3396. [5]LI J,MONROE W,RITTER A,et al.Deep ReinforcementLearning for Dialogue Generation[C]//Proceedings of the 2016 Conference on Empirical Methods in Natural Language Proces-sing.2016:1192-1202. [6]ANDERSEN P A,GOODWIN M,GRANMO O C.Deep RTS:a game environment for deep reinforcement learning in real-time strategy games[C]//2018 IEEE Conference on Computational Intelligence and Games(CIG).IEEE,2018:1-8. [7]LING Y,HASAN S A,DATLA V,et al.Learning to diagnose:assimilating clinical narratives using deep reinforcement learning[C]//Proceedings of the Eighth International Joint Conference on Natural Language Processing(Volume 1:Long Papers).2017:895-905. [8]HESSEL M,MODAYIL J,VAN HASSELT H,et al.Rainbow:Combining improvements in deep reinforcement learning[C]//Thirty-second AAAI Conference on Artificial Intelligence.2018:156-167. [9]SUTTON R S,BARTO A G.Reinforcement learning:An introduction[M].Massachusetts:MIT press,2018. [10]MNIH V,KAVUKCUOGLU K,SILVER D,et al.Human-level control through deep reinforcement learning[J].Nature,2015,518(7540):529-533. [11]SILVER D,LEVER G,HEESS N,et al.Deterministic policygradient algorithms[C]//International Conference on Machine Learning.PMLR,2014:387-395. [12]LILLICRAP T P,HUNT J J,PRITZEL A,et al.Continuouscontrol with deep reinforcement learning[C]//ICLR(Poster).2016. [13]MNIH V,BADIA A P,MIRZA M,et al.Asynchronous methods for deep reinforcement learning[C]//International Conference on Machine Learning.PMLR,2016:1928-1937. [14]SCHULMAN J,LEVINE S,ABBEEL P,et al.Trust region po-licy optimization[C]//International Conference on Machine Learning.PMLR,2015:1889-1897. [15]SCHULMAN J,WOLSKI F,DHARIWAL P,et al.Proximalpolicy optimization algorithms[J].arXiv:1707.06347,2017. [16]RASHID T,SAMVELYAN M,SCHROEDER C,et al.Qmix:Monotonic value function factorisation for deep multi-agent reinforcement learning[C]//International Conference on Machine Learning.PMLR,2018:4295-4304. [17]SUNEHAG P,LEVER G,GRUSLYS A,et al.Value-Decomposition Networks For Cooperative Multi-Agent Learning Based on Team Reward[C]//Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems.2018:2085-2087. [18]LOWE R,WU Y,TAMAR A,et al.Multi-agent actor-critic for mixed cooperative-competitive environments[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems.2017:6382-6393. [19]ASLANIDES J,LEIKE J,HUTTER M.Universal reinforce-ment learning algorithms:survey and experiments[C]//Proceedings of the 26th International Joint Conference on Artificial Intelligence.2017:1403-1410. [20]WATKINS C J C H,DAYAN P.Q-learning[J].Machine Lear-ning,1992,8(3/4):279-292. [21]FISCHER A,IGEL C.An introduction to restricted Boltzmann machines[C]//Iberoamerican Congress on Pattern Recognition.Berlin:Springer,2012:14-36. [22]RIEDMILLER M.Neural fitted Q iteration-first experienceswith a data efficient neural reinforcement learning method[C]//European Conference on Machine Learning.Berlin:Springer,2005:317-328. [23]ZHOU Z H,FENG J.Deep forest:towards an alternative to deep neural networks[C]//Proceedings of the 26th Interna-tional Joint Conference on Artificial Intelligence.2017:3553-3559. [24]YOSINSKI J,CLUNE J,BENGIO Y,et al.How transferable are features in deep neural networks?[C]//Proceedings of the 27th International Conference on Neural Information Processing Systems-Volume 2.2014:3320-3328. [25]KRIZHEVSKY A,SUTSKEVER I,HINTON G E.ImageNet Classification with Deep Convolutional Neural Networks[C]//NIPS.2012:654-669. [26]AYTAR Y,PFAFF T,BUDDEN D,et al.Playing hard exploration games by watching YouTube[C]//Proceedings of the 32nd International Conference on Neural Information Processing Systems.2018:2935-2945. [27]HENDERSON P,ISLAM R,BACHMAN P,et al.Deep reinforcement learning that matters[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2018,32(1):245-256. [28]BELLEMARE M G,SRINIVASAN S,OSTROVSKI G,et al.Unifying count-based exploration and intrinsic motivation[C]//Proceedings of the 30th International Conference on Neural Information Processing Systems.2016:1479-1487. [29]WATKINS C J C H.Learning from delayed rewards[D].Cambridge:University of Cambridge,1989. [30]KAELBLING L P,LITTMAN M L,MOORE A W.Reinforcement learning:A survey[J].Journal of Artificial Intelligence Research(S1076-9757),1996,4:237-285. [31]LI C X,CAO L,CHEN X L,et al.Cloud Reasoning Model-based Exploration for Deep Reinforcement Learning[J].Journal of Electronics & Information Technology,2018,40(1):244-248. [32]ECOFFET A,HUIZINGA J,LEHMAN J,et al.First return,then explore[J].Nature,2021,590(7847):580-586. [33]SCHAUL T,QUAN J,ANTONOGLOU I,et al.Prioritized Experience Replay[C]//ICLR(Poster).2016:1312-1320. [34]NARASIMHAN K,KULKARNI T,BARZILAY R.LanguageUnderstanding for Text-based Games using Deep Reinforcement Learning[C]//Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing.2015:1-11. [35]HOCHREITER S,SCHMIDHUBER J.Long short-term memory[J].Neural Computation,1997,9(8):1735-1780. [36]HOU Y,LIU L,WEI Q,et al.A novel DDPG method withprioritized experience replay[C]//2017 IEEE International Confe-rence on Systems,Man,nd Cybernetics(SMC).IEEE,2017:316-321. [37]HEESS N,WAYNE G,SILVER D,et al.Learning continuous control policies by stochastic value gradients[J].Advances in Neural Information Processing Systems,2015,28:1056-1068. [38]MAHMOOD A R,VAN HASSELT H,SUTTON R S.Weighted importance sampling for off-policy learning with linear function approximation[C]//NIPS.2014:3014-3022. [39]HORGAN D,QUAN J,BUDDEN D,et al.Distributed Prioritized Experience Replay[C]//International Conference on Learning Representations.2018. [40]FEDUS W,RAMACHANDRAN P,AGARWAL R,et al.Re-visiting fundamentals of experience replay[C]//International Conference on Machine Learning.PMLR,2020:3061-3071. [41]PATHAK D,AGRAWAL P,EFROS A A,et al.Curiosity-dri-ven exploration by self-supervised prediction[C]//International Conference on Machine Learning.PMLR,2017:2778-2787. [42]MOHAMED S,REZENDE D J.Variational Information Ma-ximisation for Intrinsically Motivated Reinforcement Learning[C]//NIPS.2015:456-468. [43]HOUTHOOFT R,CHEN X,DUAN Y,et al.VIME:variational information maximizing exploration[C]//Proceedings of the 30th International Conference on Neural Information Processing Systems.2016:1117-1125. [44]SILVER D,SINGH S,PRECUP D,et al.Reward is enough[J].Artificial Intelligence,2021:1035-1046. [45]ARGALL B D,CHERNOVA S,VELOSO M,et al.A survey of robot learning from demonstration[J].Robotics and Autonomous Systems,2009,57(5):469-483. [46]SCHAAL S.Is imitation learning the route to humanoid robots?[J].Trends in Cognitive Sciences,1999,3(6):233-242. [47]ABBEEL P,NG A Y.Exploration and apprenticeship learning in reinforcement learning[C]//Proceedings of the 22nd International Conference on Machine Learning.2005:1-8. [48]MUNOS R.Error bounds for approximate policy iteration[C]//ICML.2003:560-567. [49]THIERY C,SCHERRER B.Least-squares λ policy iteration:Bias-variance trade-off in control problems[C]//International Conference on Machine Learning.2010:2058-2072. [50]BERTSEKAS D P.Approximate policy iteration:A survey and some new methods[J].Journal of Control Theory and Applications,2011,9(3):310-335. [51]KIM B,FARAHMAND A,PINEAU J,et al.Learning fromLimited Demonstrations[C]//NIPS.2013:2859-2867. [52]PIOT B,GEIST M,PIETQUIN O.Boosted bellman residual minimization handling expert demonstrations[C]//Joint European Conference on Machine Learning and Knowledge Discoveryin Databases.Berlin:Springer,2014:549-564. [53]CHEMALI J,LAZARIC A.Direct policy iteration with demonstrations[C]//Twenty-Fourth International Joint Conference on Artificial Intelligence.2015:1045-1065. [54]HESTER T,VECERIK M,PIETQUIN O,et al.Deep Q-lear-ning from demonstrations[C]//Thirty-second AAAI Conference on Artificial Intelligence.2018:746-752. [55]VECERIK M,HESTER T,SCHOLZ J,et al.Leveraging de-monstrations for deep reinforcement learning on robotics problems with sparse rewards[J].arXiv:1707.08817,2017. [56]NAIR A,MCGREW B,ANDRYCHOWICZ M,et al.Overcoming exploration in reinforcement learning with demonstrations[C]//2018 IEEE International Conference on Robotics and Automation(ICRA).IEEE,2018:6292-6299. [57]KANG B,JIE Z,FENG J.Policy optimization with demonstrations[C]//International Conference on Machine Learning.PMLR,2018:2469-2478. [58]HO J,ERMON S.Generative Adversarial Imitation Learning[C]//NIPS.2016:198-211. [59]BURDA Y,EDWARDS H,STORKEY A,et al.Exploration by random network distillation[C]//Seventh International Confe-rence on Learning Representations.2019:1-17. [60]LI Z,CHEN X H.Efficient Exploration by Novelty-Pursuit[C]//International Confe-rence on Distributed Artificial Intelligence.Cham:Springer,2020:85-102. [61]NG A Y,HARADA D,RUSSELL S J.Policy Invariance Under Reward Transformations:Theory and Application to Reward Shaping[C]//Proceedings of the Sixteenth International Confe-rence on Machine Learning.1999:278-287. [62]DEVLIN S M,KUDENKO D.Dynamic potential-based reward shaping[C]//Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems.IFAAMAS,2012:433-440. [63]LIU Y,HU Y,GAO Y,et al.Value Function Transfer for Deep Multi-Agent Reinforcement Learning Based on N-Step Returns[C]//IJCAI.2019:457-463. [64]TIRINZONI A,RODRÍGUEZ-SÁNCHEZ R,RESTELLI M.Transfer of Value Functions via Variational Methods[C]//NeurIPS.2018:6182-6192. [65]GE H,SONG Y,WU C,et al.Cooperative deep Q-learning with Q-value transfer for multi-intersection signal control[J].IEEE Access,2019,7:40797-40809. [66]RUSU A A,COLMENAREJO S G,GÜLÇEHRE Ç,et al.Policy Distillation[C]//ICLR(Poster).2016. [67]TEH Y W,BAPST V,CZARNECKI W M,et al.Distral:Robust multitask reinforcement learning[C]//NIPS.2017. [68]PARISOTTO E,BA L J,SALAKHUTDINOV R.Actor-Mimic:Deep Multitask and Transfer Reinforcement Learning[C]//International Conference on Learning Representations.2016:23-28. [69]YIN H,PAN S J.Knowledge transfer for deep reinforcementlearning with hierarchical experience replay[C]//Thirty-First AAAI Conference on Artificial Intelligence.2017:68-82. [70]ARNEKVIST I,KRAGIC D,STORK J A.Vpe:Variational po-licy embedding for transfer reinforcement learning[C]//2019 International Conference on Robotics and Automation(ICRA).IEEE,2019:36-42. [71]YANG J,PETERSEN B,ZHA H,et al.Single Episode Policy Transfer in Reinforcement Learning[C]//International Confe-rence on Learning Representations.2019:1256-1268. [72]DAYAN P.Improving generalization for temporal differencelearning:The successor representation[J].Neural Computation,1993,5(4):613-624. [73]BARRETO A,DABNEY W,MUNOS R,et al.Successor fea-tures for transfer in reinforcement learning[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems.2017:4058-4068. [74]RUSU A A,RABINOWITZ N C,DESJARDINS G,et al.Progressive neural networks[J].arXiv:1606.04671,2016. [75]ZHANG A,SATIJA H,PINEAU J.Decoupling dynamics andreward for transfer learning[J].arXiv:1804.10689,2018. [76]BARRETO A,BORSA D,QUAN J,et al.Transfer in deep reinforcement learning using successor features and generalized policy improvement[C]//International Conference on Machine Learning.PMLR,2018:501-510. [77]BARRETO A,HOU S,BORSA D,et al.Fast reinforcementlearning with generalized policy updates[J].Proceedings of the National Academy of Sciences,2020,117(48):30079-30087. [78]SCHAUL T,HORGAN D,GREGOR K,et al.Universal Value Function Approximators[C]//International Conference on Machine Learning.PMLR,2015:1312-1320. [79]BORSA D,BARRETO A,QUAN J,et al.Universal Successor Features Approximators[C]//International Conference on Learning Representations.2018:359-369. [80]BRAYLAN A E,MIIKKULAINEN R.Object-model transfer in the general video game domain[C]//Twelfth Artificial Intelligence and Interactive Digital Entertainment Conference.2016:142-168. [81]BARRETT S,STONE P.Cooperating with unknown teammates in complex domains:A robot soccer case study of ad hoc teamwork[C]//Twenty-ninth AAAI Conference on Artificial Intelligence.2015:178-190. [82]BARRETT S,STONE P,KRAUS S,et al.Teamwork with li-mited knowledge of teammates[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2013:6984-6992. [83]ROY S,MINCU D,LOREAUX E,et al.Multitask prediction of organ dysfunction in the intensive care unit using sequential subnetwork routing[J].Journal of the American Medical Informa-tics Association,2021,28(9):986-997. [84]JOHNSON A E W,POLLARD T J,SHEN L,et al.MIMIC-III,a freely accessible critical care database[J].Scientific Data,2016,3(1):1-9. [85]MCGRATH T,KAPISHNIKOV A,TOMAEV N,et al.Ac-quisition of Chess Knowledge in AlphaZero[J].arXiv:2111.09259,2021. [86]VON RUEDEN L,MAYER S,BECKH K,et al.Informed Machine Learning-A Taxonomy and Survey of Integrating Prior Knowledge into Learning Systems[J].IEEE Transactions on Knowledge and Data Engineering,2021,35(1):12-25. [87]AMMANABROLU P,RIEDL M.Transfer in Deep Reinforcement Learning Using Knowledge Graphs[C]//Proceedings of the Thirteenth Workshop on Graph-Based Methods for Natural Language Processing(TextGraphs-13).2019:1-10. [88]HU Y,GAO Y,AN B.Accelerating multiagent reinforcementlearning by equilibrium transfer[J].IEEE Transactions on Cybernetics,2014,45(7):1289-1302. |
|