Computer Science ›› 2023, Vol. 50 ›› Issue (12): 314-321.doi: 10.11896/jsjkx.221100096

• Artificial Intelligence • Previous Articles     Next Articles

Hierarchical Reinforcement Learning Method Based on Trajectory Information

XU Yapeng1, LIU Quan1,2, LI Junwei1   

  1. 1 School of Computer and Technology,Soochow University,Suzhou,Jiangsu 215006,China
    2 Provincial Key Laboratory for Computer Information Processing Technology,Soochow University,Suzhou,Jiangsu 215006,China
  • Received:2022-11-10 Revised:2023-03-28 Online:2023-12-15 Published:2023-12-07
  • About author:XU Yapeng,born in 1996,postgraduate.His main research interests include hie-rarchical reinforcement learning and deep reinforcement learning.
    LIU Quan,born in 1969,Ph.D,professor,Ph.D supervisor,is a member of China Computer Federation.His main research interests include deep reinforcement learning and automated reasoning.
  • Supported by:
    National Natural Science Foundation of China(61772355,61702055,61876217,62176175) and Project Funded by the Priority Academic Program Development of Jiangsu Higher Education Institutions(PAPD).

Abstract: The option-based hierarchical reinforcement learning(O-HRL) algorithm has the characteristics of temporal abstraction,which can effectively deal with complex problems such as long-term temporal order and sparse rewards that are difficult to solve in reinforcement learning.The existing studies of O-HRL methods mainly focus on data efficiency improvement by increa-sing the sampling efficiency as well as the exploration ability of the agent to maximize its probability of obtaining excellent expe-riences.However,in terms of policy stability,the high-level policy guides the low-level action by only considering the state,resulting in the underutilization of option information,which leads to the instability of the low-level policy.To address this problem,a hierarchical reinforcement learning method based on trajectory information(THRL) is proposed.THRL uses different types of information of option trajectories to guide the selection of low-level actions,and also generates inferred options by the obtained extended trajectory information.A discriminator is introduced to use the inferred options and the original options as inputs to obtain internal rewards,which makes the selection of low-level actions more consistent with the current option policy,thus solving the instability problem of low-level policies.The effectiveness of THRL is verified by applying it to the MuJoCo environment,along with the best deep reinforcement learning algorithms,and experimental results show that the THRL algorithm has better stability and performance.

Key words: Option, Hierarchical reinforcement learning, Trajectory information, Discriminator, Deep reinforcement learning

CLC Number: 

  • TP181
[1]SUTTON R S,BARTO A G.Reinforcement learning:An introduction[M].MIT press,2018.
[2]GOODFELLOW I,BENGIO Y,COURVILLE A,et al.Deeplearning[M].MIT press,2016.
[3]MNIH V,KAVUKCUOGLU K,SILVER D,et al.Human-level control through deep reinforcement learning[J].Nature,2015,518(7540):529-533.
[4]SILVER D,SCHRITTWIESER J,SIMONYAN K,et al.Mastering the game of go without human knowledge[J].Nature,2017,550(7676):354-359.
[5]SALLAB A E,ABDOU M,PEROT E,et al.Deep reinforcement learning framework for autonomous driving[J].Electronic Imaging,2017,2017(19):70-76.
[6]GOTTIPATI S K,SATTAROV B,NIU S,et al.Learning to navigate the synthetically accessible chemical space using reinforcement learning[C]//International Conference on Machine Learning.PMLR,2020:3668-3679.
[7]LIU Q,ZHAI J W,ZHANG Z Z,et al.A survey on deep reinforcement learning[J].Chinese Journal of Computers,2018,41(1):1-27.
[8]LILLICRAP T P,HUNT J J,PRITZEL A,et al.Continuouscontrol with deep reinforcement learning[C]//ICLR.2016.
[9]MNIH V,BADIA A P,MIRZA M,et al.Asynchronous methods for deep reinforcement learning[C]//International Conference on Machine Learning.2016:1928-1937.
[10]BARTO A G,MAHADEVAN S.Recent advances in hierarchical reinforcement learning[J].Discrete Event Dynamic Systems,2003,13(4):341-379.
[11]RASHID T,SAMVELYAN M,SCHROEDER C,et al.Qmix:Monotonic value function factorisation for deep multi-agent reinforcement learning[C]//International Conference on Machine Learning.PMLR,2018:4295-4304.
[12]SCHULMAN J,WOLSKI F,DHARIWAL P,et al.Proximalpolicy optimization algorithms[J].arXiv:1707.06347,2017.
[13]HAARNOJA T,ZHOU A,ABBEEL P,et al.Soft Actor-Critic:Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor[C]//International Conference on Machine Learning.2018:1861-1870.
[14]SUTTON R S,PRECUP D,SINGH S.Between mdps and semi-mdps:A framework for temporal abstraction in reinforcement learning[J].Artificial Intelligence,1999,112(1/2):181-211.
[15]BACON P L,HARB J,PRECUP D.The option-critic architecture[C]//AAAI Conference on Artificial Intelligence.2017:1726-1734.
[16]ZHANG S,WHITESON S.Dac:The double actor-critic architecture for learning options[C]//Advances in Neural Information Processing Systems.2019:2012-2022.
[17]SMITH M,HOOF H,PINEAU J.An inference-based policygradient method for learning options[C]//International Confe-rence on Machine Learning.PMLR,2018:4703-4712.
[18]OSA T,TANGKARATT V,SUGIYAMA M.Hierarchical Reinforcement Learning via Advantage-Weighted Information Maximization[C]//International Conference on Learning Representations.2018.
[19]FUJIMOTO S,HOOF H,MEGER D.Addressing function approximation error in actor-critic methods[C]//International Conference on Machine Learning.2018:1587-1596.
[20]LI C,MA X,ZHANG C,et al.SOAC:The Soft Option Actor-Critic Architecture[J].arXiv:2006.14363,2020.
[21]LEVINE S.Reinforcement learning and control as probabilistic inference:Tutorial and review[J].arXiv:1805.00909,2018.
[22]BROCKMAN G,CHEUNG V,PETTERSSON L,et al.Openai gym[J].arXiv:1606.01540,2016.
[1] LIU Xingguang, ZHOU Li, ZHANG Xiaoying, CHEN Haitao, ZHAO Haitao, WEI Jibo. Edge Intelligent Sensing Based UAV Space Trajectory Planning Method [J]. Computer Science, 2023, 50(9): 311-317.
[2] LIN Xinyu, YAO Zewei, HU Shengxi, CHEN Zheyi, CHEN Xing. Task Offloading Algorithm Based on Federated Deep Reinforcement Learning for Internet of Vehicles [J]. Computer Science, 2023, 50(9): 347-356.
[3] JIN Tiancheng, DOU Liang, ZHANG Wei, XIAO Chunyun, LIU Feng, ZHOU Aimin. OJ Exercise Recommendation Model Based on Deep Reinforcement Learning and Program Analysis [J]. Computer Science, 2023, 50(8): 58-67.
[4] XIONG Liqin, CAO Lei, CHEN Xiliang, LAI Jun. Value Factorization Method Based on State Estimation [J]. Computer Science, 2023, 50(8): 202-208.
[5] ZENG Qingwei, ZHANG Guomin, XING Changyou, SONG Lihua. Intelligent Attack Path Discovery Based on Hierarchical Reinforcement Learning [J]. Computer Science, 2023, 50(7): 308-316.
[6] WANG Hanmo, ZHENG Shijie, XU Ruonan, GUO Bin, WU Lei. Self Reconfiguration Algorithm of Modular Robot Based on Swarm Agent Deep Reinforcement Learning [J]. Computer Science, 2023, 50(6): 266-273.
[7] ZHANG Qiyang, CHEN Xiliang, CAO Lei, LAI Jun, SHENG Lei. Survey on Knowledge Transfer Method in Deep Reinforcement Learning [J]. Computer Science, 2023, 50(5): 201-216.
[8] YU Ze, NING Nianwen, ZHENG Yanliu, LYU Yining, LIU Fuqiang, ZHOU Yi. Review of Intelligent Traffic Signal Control Strategies Driven by Deep Reinforcement Learning [J]. Computer Science, 2023, 50(4): 159-171.
[9] XU Linling, ZHOU Yuan, HUANG Hongyun, LIU Yang. Real-time Trajectory Planning Algorithm Based on Collision Criticality and Deep Reinforcement Learning [J]. Computer Science, 2023, 50(3): 323-332.
[10] Cui ZHANG, En WANG, Funing YANG, Yong jian YANG , Nan JIANG. UAV Frequency-based Crowdsensing Using Grouping Multi-agentDeep Reinforcement Learning [J]. Computer Science, 2023, 50(2): 57-68.
[11] ZHOU Tianyu, GUAN Zheng. Study on Relay Decision in Wireless Heterogeneous Networks Based on Deep ReinforcementLearning [J]. Computer Science, 2023, 50(11A): 221000088-5.
[12] PENG Yingxuan, SHI Dianxi, YANG Huanhuan, HU Haomeng, YANG Shaowu. Intention-based Multi-agent Motion Planning Method with Deep Reinforcement Learning [J]. Computer Science, 2023, 50(10): 156-164.
[13] LIN Zeyang, LAI Jun, CHEN Xiliang, WANG Jun. UAV Anti-tank Policy Training Model Based on Curriculum Reinforcement Learning [J]. Computer Science, 2023, 50(10): 214-222.
[14] WEI Nan, WEI Xianglin, FAN Jianhua, XUE Yu, HU Yongyang. Backdoor Attack Against Deep Reinforcement Learning-based Spectrum Access Model [J]. Computer Science, 2023, 50(1): 351-361.
[15] HUANG Yuzhou, WANG Lisong, QIN Xiaolin. Bi-level Path Planning Method for Unmanned Vehicle Based on Deep Reinforcement Learning [J]. Computer Science, 2023, 50(1): 194-204.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!