计算机科学 ›› 2024, Vol. 51 ›› Issue (2): 252-258.doi: 10.11896/jsjkx.221100019

• 人工智能 • 上一篇    下一篇

基于互信息优化的Option-Critic算法

栗军伟1, 刘全1,2,3,4, 徐亚鹏1   

  1. 1 苏州大学计算机科学与技术学院 江苏 苏州215006
    2 软件新技术与产业化协同创新中心 南京210000
    3 吉林大学符号计算与知识工程教育部重点实验室 长春130012
    4 苏州大学江苏省计算机信息处理技术重点实验室 江苏 苏州215006
  • 收稿日期:2022-11-03 修回日期:2023-03-15 出版日期:2024-02-15 发布日期:2024-02-22
  • 通讯作者: 刘全(quanliu@suda.edu.cn)
  • 作者简介:(20205227020@stu.suda.edu.cn)
  • 基金资助:
    国家自然科学基金(61772355,61702055);江苏省高等学校自然科学研究重大项目(18KJA520011,17KJA520004);吉林大学符号计算与知识工程教育部重点实验室资助项目(93K172014K04,93K172017K18);苏州市应用基础研究计划工业部分(SYG201422);江苏省高校优势学科建设工程资助项目

Option-Critic Algorithm Based on Mutual Information Optimization

LI Junwei1, LIU Quan1,2,3,4, XU Yapeng1   

  1. 1 School of Computer and Technology,Soochow University,Suzhou,Jiangsu 215006,China
    2 Collaborative Innovation Center of Novel Software Technology and Industrialization,Nanjing 210000,China
    3 Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education,Jilin University,Changchun 130012,China
    4 Provincial Key Laboratory for Computer Information Processing Technology,Soochow University,Suzhou,Jiangsu 215006,China
  • Received:2022-11-03 Revised:2023-03-15 Online:2024-02-15 Published:2024-02-22
  • About author:LI Junwei,born in 1998,postgraduate.His main research interests include reinforcement learning and hierarchical reinforcement learning.LIU Quan,born in 1969,Ph.D,professor,Ph.D supervisor,is a member of CCF(No.15231S).His main research interests include deep reinforcement learning and automated reasoning.
  • Supported by:
    National Natural Science Foundation of China(61772355,61702055),Jiangsu Province Natural Science Research University Major Projects(18KJA520011,17KJA520004),Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education,Jilin University(93K172014K04,93K172017K18),Suzhou Industrial Application of Basic Research Program Part(SYG201422) and Project Funded by the Priority Academic Program Development of Jiangsu Higher Education Institutions.

摘要: 时序抽象作为分层强化学习的重要研究内容,允许分层强化学习智能体在不同的时间尺度上学习策略,可以有效解决深度强化学习难以处理的稀疏奖励问题。如何端到端地学习到优秀的时序抽象策略一直是分层强化学习研究面临的挑战。Option-Critic(OC)框架在Option框架的基础上,通过策略梯度理论,可以有效解决此问题。然而,在策略学习过程中,OC框架会出现Option内部策略动作分布变得十分相似的退化问题。该退化问题影响了OC框架的实验性能,导致Option的可解释性变差。为了解决上述问题,引入互信息知识作为内部奖励,并提出基于互信息优化的Option-Critic算法(Option-Critic Algorithm with Mutual Information Optimization,MIOOC)。MIOOC算法结合了近端策略Option-Critic(Proximal Policy Option-Critic,PPOC)算法,可以保证下层策略的多样性。为了验证算法的有效性,把MIOOC算法和几种常见的强化学习方法在连续实验环境中进行对比实验。实验结果表明,MIOOC算法可以加快模型学习速度,实验性能更优,Option内部策略更有区分度。

关键词: 深度强化学习, 时序抽象, 分层强化学习, 互信息, 内部奖励, Option多样性

Abstract: As an important research content of hierarchical reinforcement learning,temporal abstraction allows hierarchical reinforcement learning agents to learn policies at different time scales,which can effectively solve the sparse reward problem that is difficult to deal with in deep reinforcement learning.How to learn excellent temporal abstraction policy end-to-end is always a research challenge in hierarchical reinforcement learning.Based on the Option framework,Option-Critic can effectively solve the above problems through policy gradient theory.However,in the process of policy learning,the OC framework will have the degradation problem that the action distribution of the internal option policies becomes very similar.This degradation problem affects the experimental performance of the OC framework and leads to poor interpretability of the Option.In order to solve the above problems,mutual information knowledge is introduced as the internal reward,and an Option-Critic algorithm with mutual information optimization is proposed.The MIOOC algorithm combines the proximal policy Option-Critic algorithm to ensure the diversity of the lower level policies.In order to verify the effectiveness of the algorithm,the MIOOC algorithm is compared with several common reinforcement learning methods in continuous experimental environments.Experimental results show that the MIOOC algorithm can speed up the learning speed of the model,improve its experimental performance,and its Option internal strategy is more discriminative.

Key words: Deep reinforcement learning, Temporal abstract, Hierarchical reinforcement learning, Mutual information, Internal rewards, Diversity in option policies

中图分类号: 

  • TP181
[1]SUTTON R S,BARTO A G.Reinforcement learning:An introduction[M].MIT Press,1998.
[2]LIU Q,ZHAI J W,ZHANG Z Z,et al.A survey on deep reinforcement learning[J].Chinese Journal of Computers,2018,41(1):1-27.
[3]LIU J W,GAO F,LUO X L.Survey of deep reinforcementlearning based on value function and policy gradient[J].Chinese Journal of Computers,2019,42(6):1406-1438.
[4]MNIH V,KAVUKCUOGLU K,SILVER D,et al.Human-level control through deep reinforcement learning[J].Nature,2015,518(7540):529-533.
[5]LILLICRAP T P,HUNT J J,PRITZEL A,et al.Continuouscontrol with deep reinforcement learning[J].arXiv:1509.02971,2015.
[6]FUJIMOTO S,HOOF H,MEGER D.Addressing function approximation error in actor-critic methods[C]//Proceedings of the International Conference on Machine Learning.2018:1587-1596.
[7]SCHULMAN J,LEVINE S,ABBEEL P,et al.Trust region po-licy optimization[C]//Proceedings of the International Confe-rence on Machine Learning.2015:1889-1897.
[8]HAARNOJA T,ZHOU A,ABBEEL P,et al.Soft actor-critic:Off-policy maximum entropy deep reinforcement learning with a stochastic actor[C]//Proceedings of the International Confe-rence on Machine Learning.2018:1861-1870.
[9]KULKARNI T D,NARASIMHAN K,SAEEDI A,et al.Hie-rarchical deep reinforcement learning:Integrating temporal abstraction and intrinsic motivation[C]//Advances in Neural Information Processing Systems.2016:3675-3683.
[10]ZHAO D,ZHANG L,ZHANG B,et al.Mahrl:Multi-goals abstraction based deep hierarchical reinforcement learning for re-commendations[C]//Proceedings of International ACM SIGIR Conference on Research and Development in Information Retrieval.2020:871-880.
[11]DUAN J,LI S E,GUAN Y,et al.Hierarchical reinforcementlearning for self-driving decision-making without reliance on labelled driving data[J].IET Intelligent Transport Systems,2020,14(5):297-305.
[12]LIU J,PAN F,LUO L.Gochat:Goal-oriented chatbots withhierarchical reinforcement learning[C]//Proceedings of International ACM SIGIR Conference on Research and Development in Information Retrieval.2020:1793 -1796.
[13]SUTTON R S,PRECUP D,SINGH S.Between mdps and semi-mdps:A framework for temporal abstraction in reinforcement learning[J].Artificial Intelligence,1999,112(1/2):181-211.
[14]LIU C H,ZHU F,LIU Q.Option-Critic Algorithm Based onSub-Goal Quantity Optimization[J].Chinese Journal of Computers,2021,44(9):1922-1933.
[15]HUANG Z G,LIU Q,ZHANG L H,et al.Research and Deve-lopment on Deep Hierarchical Reinforcement Learning[J].Journal of Software,2023,34(2):733-760.
[16]BACON P L,HARB J,PRECUP D.The option-critic architecture[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2017.
[17]SUTTON R S,MCALLESTER D A,SINGH SP,et al.Policy gradient methods for reinforcement learning with function approximation[C]//Proceedings of the Advances in Neural Information Processing Systems.2000:1057-1063.
[18]EYSENBACH B,GUPTA A,IBARZ J,et al.Diversity is all you need:Learning skills without a reward function[J].arXiv:1802.06070,2018.
[19]BAUMLI K,WARDE F D,HANSEN S,et al.Relative variational intrinsic control[C]//Proceeding of the AAAI Conference on Artificial Intelligence.2021:6732-6740.
[20]ZHANG J,YU H,XU W.Hierarchical reinforcement learning by discovering intrinsic options[C]//Proceeding of the International Conference on Learning Representations.2021.
[21]SCHULMAN J,WOLSKI F,DHARIWAL P,et al.Proximal policy optimization algorithms[J].arXiv:1707.06347,2017.
[22]KLISSAROV M,BACON P L,HARB J,et al.Learnings options end-to-end for continuous action tasks[J].arXiv:1712.00004,2017.
[23]BROCKMAN G,CHEUNG V,PETTERSSON L,et at.Openai gym[J].arXiv:1606.01540,2016.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!