计算机科学 ›› 2024, Vol. 51 ›› Issue (2): 252-258.doi: 10.11896/jsjkx.221100019
栗军伟1, 刘全1,2,3,4, 徐亚鹏1
LI Junwei1, LIU Quan1,2,3,4, XU Yapeng1
摘要: 时序抽象作为分层强化学习的重要研究内容,允许分层强化学习智能体在不同的时间尺度上学习策略,可以有效解决深度强化学习难以处理的稀疏奖励问题。如何端到端地学习到优秀的时序抽象策略一直是分层强化学习研究面临的挑战。Option-Critic(OC)框架在Option框架的基础上,通过策略梯度理论,可以有效解决此问题。然而,在策略学习过程中,OC框架会出现Option内部策略动作分布变得十分相似的退化问题。该退化问题影响了OC框架的实验性能,导致Option的可解释性变差。为了解决上述问题,引入互信息知识作为内部奖励,并提出基于互信息优化的Option-Critic算法(Option-Critic Algorithm with Mutual Information Optimization,MIOOC)。MIOOC算法结合了近端策略Option-Critic(Proximal Policy Option-Critic,PPOC)算法,可以保证下层策略的多样性。为了验证算法的有效性,把MIOOC算法和几种常见的强化学习方法在连续实验环境中进行对比实验。实验结果表明,MIOOC算法可以加快模型学习速度,实验性能更优,Option内部策略更有区分度。
中图分类号:
[1]SUTTON R S,BARTO A G.Reinforcement learning:An introduction[M].MIT Press,1998. [2]LIU Q,ZHAI J W,ZHANG Z Z,et al.A survey on deep reinforcement learning[J].Chinese Journal of Computers,2018,41(1):1-27. [3]LIU J W,GAO F,LUO X L.Survey of deep reinforcementlearning based on value function and policy gradient[J].Chinese Journal of Computers,2019,42(6):1406-1438. [4]MNIH V,KAVUKCUOGLU K,SILVER D,et al.Human-level control through deep reinforcement learning[J].Nature,2015,518(7540):529-533. [5]LILLICRAP T P,HUNT J J,PRITZEL A,et al.Continuouscontrol with deep reinforcement learning[J].arXiv:1509.02971,2015. [6]FUJIMOTO S,HOOF H,MEGER D.Addressing function approximation error in actor-critic methods[C]//Proceedings of the International Conference on Machine Learning.2018:1587-1596. [7]SCHULMAN J,LEVINE S,ABBEEL P,et al.Trust region po-licy optimization[C]//Proceedings of the International Confe-rence on Machine Learning.2015:1889-1897. [8]HAARNOJA T,ZHOU A,ABBEEL P,et al.Soft actor-critic:Off-policy maximum entropy deep reinforcement learning with a stochastic actor[C]//Proceedings of the International Confe-rence on Machine Learning.2018:1861-1870. [9]KULKARNI T D,NARASIMHAN K,SAEEDI A,et al.Hie-rarchical deep reinforcement learning:Integrating temporal abstraction and intrinsic motivation[C]//Advances in Neural Information Processing Systems.2016:3675-3683. [10]ZHAO D,ZHANG L,ZHANG B,et al.Mahrl:Multi-goals abstraction based deep hierarchical reinforcement learning for re-commendations[C]//Proceedings of International ACM SIGIR Conference on Research and Development in Information Retrieval.2020:871-880. [11]DUAN J,LI S E,GUAN Y,et al.Hierarchical reinforcementlearning for self-driving decision-making without reliance on labelled driving data[J].IET Intelligent Transport Systems,2020,14(5):297-305. [12]LIU J,PAN F,LUO L.Gochat:Goal-oriented chatbots withhierarchical reinforcement learning[C]//Proceedings of International ACM SIGIR Conference on Research and Development in Information Retrieval.2020:1793 -1796. [13]SUTTON R S,PRECUP D,SINGH S.Between mdps and semi-mdps:A framework for temporal abstraction in reinforcement learning[J].Artificial Intelligence,1999,112(1/2):181-211. [14]LIU C H,ZHU F,LIU Q.Option-Critic Algorithm Based onSub-Goal Quantity Optimization[J].Chinese Journal of Computers,2021,44(9):1922-1933. [15]HUANG Z G,LIU Q,ZHANG L H,et al.Research and Deve-lopment on Deep Hierarchical Reinforcement Learning[J].Journal of Software,2023,34(2):733-760. [16]BACON P L,HARB J,PRECUP D.The option-critic architecture[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2017. [17]SUTTON R S,MCALLESTER D A,SINGH SP,et al.Policy gradient methods for reinforcement learning with function approximation[C]//Proceedings of the Advances in Neural Information Processing Systems.2000:1057-1063. [18]EYSENBACH B,GUPTA A,IBARZ J,et al.Diversity is all you need:Learning skills without a reward function[J].arXiv:1802.06070,2018. [19]BAUMLI K,WARDE F D,HANSEN S,et al.Relative variational intrinsic control[C]//Proceeding of the AAAI Conference on Artificial Intelligence.2021:6732-6740. [20]ZHANG J,YU H,XU W.Hierarchical reinforcement learning by discovering intrinsic options[C]//Proceeding of the International Conference on Learning Representations.2021. [21]SCHULMAN J,WOLSKI F,DHARIWAL P,et al.Proximal policy optimization algorithms[J].arXiv:1707.06347,2017. [22]KLISSAROV M,BACON P L,HARB J,et al.Learnings options end-to-end for continuous action tasks[J].arXiv:1712.00004,2017. [23]BROCKMAN G,CHEUNG V,PETTERSSON L,et at.Openai gym[J].arXiv:1606.01540,2016. |
|