基于互信息优化的Option-Critic算法

doi:10.11896/jsjkx.221100019

Computer Science ›› 2024, Vol. 51 ›› Issue (2): 252-258.doi: 10.11896/jsjkx.221100019

• Artificial Intelligence • Previous Articles Next Articles

Option-Critic Algorithm Based on Mutual Information Optimization

LI Junwei¹, LIU Quan^1,2,3,4, XU Yapeng¹

1 School of Computer and Technology,Soochow University,Suzhou,Jiangsu 215006,China
2 Collaborative Innovation Center of Novel Software Technology and Industrialization,Nanjing 210000,China
3 Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education,Jilin University,Changchun 130012,China
4 Provincial Key Laboratory for Computer Information Processing Technology,Soochow University,Suzhou,Jiangsu 215006,China

Received:2022-11-03 Revised:2023-03-15 Online:2024-02-15 Published:2024-02-22
About author:LI Junwei,born in 1998,postgraduate.His main research interests include reinforcement learning and hierarchical reinforcement learning.LIU Quan,born in 1969,Ph.D,professor,Ph.D supervisor,is a member of CCF(No.15231S).His main research interests include deep reinforcement learning and automated reasoning.
Supported by:
National Natural Science Foundation of China(61772355,61702055),Jiangsu Province Natural Science Research University Major Projects(18KJA520011,17KJA520004),Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education,Jilin University(93K172014K04,93K172017K18),Suzhou Industrial Application of Basic Research Program Part(SYG201422) and Project Funded by the Priority Academic Program Development of Jiangsu Higher Education Institutions.

Abstract

Abstract: As an important research content of hierarchical reinforcement learning,temporal abstraction allows hierarchical reinforcement learning agents to learn policies at different time scales,which can effectively solve the sparse reward problem that is difficult to deal with in deep reinforcement learning.How to learn excellent temporal abstraction policy end-to-end is always a research challenge in hierarchical reinforcement learning.Based on the Option framework,Option-Critic can effectively solve the above problems through policy gradient theory.However,in the process of policy learning,the OC framework will have the degradation problem that the action distribution of the internal option policies becomes very similar.This degradation problem affects the experimental performance of the OC framework and leads to poor interpretability of the Option.In order to solve the above problems,mutual information knowledge is introduced as the internal reward,and an Option-Critic algorithm with mutual information optimization is proposed.The MIOOC algorithm combines the proximal policy Option-Critic algorithm to ensure the diversity of the lower level policies.In order to verify the effectiveness of the algorithm,the MIOOC algorithm is compared with several common reinforcement learning methods in continuous experimental environments.Experimental results show that the MIOOC algorithm can speed up the learning speed of the model,improve its experimental performance,and its Option internal strategy is more discriminative.

Key words: Deep reinforcement learning, Temporal abstract, Hierarchical reinforcement learning, Mutual information, Internal rewards, Diversity in option policies

CLC Number:

TP181

LI Junwei, LIU Quan, XU Yapeng. Option-Critic Algorithm Based on Mutual Information Optimization[J].Computer Science, 2024, 51(2): 252-258.

References

[1]SUTTON R S,BARTO A G.Reinforcement learning:An introduction[M].MIT Press,1998.
[2]LIU Q,ZHAI J W,ZHANG Z Z,et al.A survey on deep reinforcement learning[J].Chinese Journal of Computers,2018,41(1):1-27.
[3]LIU J W,GAO F,LUO X L.Survey of deep reinforcementlearning based on value function and policy gradient[J].Chinese Journal of Computers,2019,42(6):1406-1438.
[4]MNIH V,KAVUKCUOGLU K,SILVER D,et al.Human-level control through deep reinforcement learning[J].Nature,2015,518(7540):529-533.
[5]LILLICRAP T P,HUNT J J,PRITZEL A,et al.Continuouscontrol with deep reinforcement learning[J].arXiv:1509.02971,2015.
[6]FUJIMOTO S,HOOF H,MEGER D.Addressing function approximation error in actor-critic methods[C]//Proceedings of the International Conference on Machine Learning.2018:1587-1596.
[7]SCHULMAN J,LEVINE S,ABBEEL P,et al.Trust region po-licy optimization[C]//Proceedings of the International Confe-rence on Machine Learning.2015:1889-1897.
[8]HAARNOJA T,ZHOU A,ABBEEL P,et al.Soft actor-critic:Off-policy maximum entropy deep reinforcement learning with a stochastic actor[C]//Proceedings of the International Confe-rence on Machine Learning.2018:1861-1870.
[9]KULKARNI T D,NARASIMHAN K,SAEEDI A,et al.Hie-rarchical deep reinforcement learning:Integrating temporal abstraction and intrinsic motivation[C]//Advances in Neural Information Processing Systems.2016:3675-3683.
[10]ZHAO D,ZHANG L,ZHANG B,et al.Mahrl:Multi-goals abstraction based deep hierarchical reinforcement learning for re-commendations[C]//Proceedings of International ACM SIGIR Conference on Research and Development in Information Retrieval.2020:871-880.
[11]DUAN J,LI S E,GUAN Y,et al.Hierarchical reinforcementlearning for self-driving decision-making without reliance on labelled driving data[J].IET Intelligent Transport Systems,2020,14(5):297-305.
[12]LIU J,PAN F,LUO L.Gochat:Goal-oriented chatbots withhierarchical reinforcement learning[C]//Proceedings of International ACM SIGIR Conference on Research and Development in Information Retrieval.2020:1793 -1796.
[13]SUTTON R S,PRECUP D,SINGH S.Between mdps and semi-mdps:A framework for temporal abstraction in reinforcement learning[J].Artificial Intelligence,1999,112(1/2):181-211.
[14]LIU C H,ZHU F,LIU Q.Option-Critic Algorithm Based onSub-Goal Quantity Optimization[J].Chinese Journal of Computers,2021,44(9):1922-1933.
[15]HUANG Z G,LIU Q,ZHANG L H,et al.Research and Deve-lopment on Deep Hierarchical Reinforcement Learning[J].Journal of Software,2023,34(2):733-760.
[16]BACON P L,HARB J,PRECUP D.The option-critic architecture[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2017.
[17]SUTTON R S,MCALLESTER D A,SINGH SP,et al.Policy gradient methods for reinforcement learning with function approximation[C]//Proceedings of the Advances in Neural Information Processing Systems.2000:1057-1063.
[18]EYSENBACH B,GUPTA A,IBARZ J,et al.Diversity is all you need:Learning skills without a reward function[J].arXiv:1802.06070,2018.
[19]BAUMLI K,WARDE F D,HANSEN S,et al.Relative variational intrinsic control[C]//Proceeding of the AAAI Conference on Artificial Intelligence.2021:6732-6740.
[20]ZHANG J,YU H,XU W.Hierarchical reinforcement learning by discovering intrinsic options[C]//Proceeding of the International Conference on Learning Representations.2021.
[21]SCHULMAN J,WOLSKI F,DHARIWAL P,et al.Proximal policy optimization algorithms[J].arXiv:1707.06347,2017.
[22]KLISSAROV M,BACON P L,HARB J,et al.Learnings options end-to-end for continuous action tasks[J].arXiv:1712.00004,2017.
[23]BROCKMAN G,CHEUNG V,PETTERSSON L,et at.Openai gym[J].arXiv:1606.01540,2016.

Related Articles 15

[1]	SHI Dianxi, PENG Yingxuan, YANG Huanhuan, OUYANG Qianying, ZHANG Yuhui, HAO Feng. DQN-based Multi-agent Motion Planning Method with Deep Reinforcement Learning [J]. Computer Science, 2024, 51(2): 268-277.
[2]	ZHAO Xiaoyan, ZHAO Bin, ZHANG Junna, YUAN Peiyan. Study on Cache-oriented Dynamic Collaborative Task Migration Technology [J]. Computer Science, 2024, 51(2): 300-310.
[3]	LIU Xingguang, ZHOU Li, ZHANG Xiaoying, CHEN Haitao, ZHAO Haitao, WEI Jibo. Edge Intelligent Sensing Based UAV Space Trajectory Planning Method [J]. Computer Science, 2023, 50(9): 311-317.
[4]	LIN Xinyu, YAO Zewei, HU Shengxi, CHEN Zheyi, CHEN Xing. Task Offloading Algorithm Based on Federated Deep Reinforcement Learning for Internet of Vehicles [J]. Computer Science, 2023, 50(9): 347-356.
[5]	JIN Tiancheng, DOU Liang, ZHANG Wei, XIAO Chunyun, LIU Feng, ZHOU Aimin. OJ Exercise Recommendation Model Based on Deep Reinforcement Learning and Program Analysis [J]. Computer Science, 2023, 50(8): 58-67.
[6]	XIONG Liqin, CAO Lei, CHEN Xiliang, LAI Jun. Value Factorization Method Based on State Estimation [J]. Computer Science, 2023, 50(8): 202-208.
[7]	ZENG Qingwei, ZHANG Guomin, XING Changyou, SONG Lihua. Intelligent Attack Path Discovery Based on Hierarchical Reinforcement Learning [J]. Computer Science, 2023, 50(7): 308-316.
[8]	ZHU Yuying, GUO Yan, WAN Yizhao, TIAN Kai. New Word Detection Based on Branch Entropy-Segmentation Probability Model [J]. Computer Science, 2023, 50(7): 221-228.
[9]	WANG Hanmo, ZHENG Shijie, XU Ruonan, GUO Bin, WU Lei. Self Reconfiguration Algorithm of Modular Robot Based on Swarm Agent Deep Reinforcement Learning [J]. Computer Science, 2023, 50(6): 266-273.
[10]	ZHANG Qiyang, CHEN Xiliang, CAO Lei, LAI Jun, SHENG Lei. Survey on Knowledge Transfer Method in Deep Reinforcement Learning [J]. Computer Science, 2023, 50(5): 201-216.
[11]	YU Ze, NING Nianwen, ZHENG Yanliu, LYU Yining, LIU Fuqiang, ZHOU Yi. Review of Intelligent Traffic Signal Control Strategies Driven by Deep Reinforcement Learning [J]. Computer Science, 2023, 50(4): 159-171.
[12]	XU Linling, ZHOU Yuan, HUANG Hongyun, LIU Yang. Real-time Trajectory Planning Algorithm Based on Collision Criticality and Deep Reinforcement Learning [J]. Computer Science, 2023, 50(3): 323-332.
[13]	Cui ZHANG, En WANG, Funing YANG, Yong jian YANG , Nan JIANG. UAV Frequency-based Crowdsensing Using Grouping Multi-agentDeep Reinforcement Learning [J]. Computer Science, 2023, 50(2): 57-68.
[14]	XU Yapeng, LIU Quan, LI Junwei. Hierarchical Reinforcement Learning Method Based on Trajectory Information [J]. Computer Science, 2023, 50(12): 314-321.
[15]	ZHOU Tianyu, GUAN Zheng. Study on Relay Decision in Wireless Heterogeneous Networks Based on Deep ReinforcementLearning [J]. Computer Science, 2023, 50(11A): 221000088-5.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Option-Critic Algorithm Based on Mutual Information Optimization

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 15

Metrics

Comments

Recommended 0