基于自适应调节策略熵的元强化学习算法

doi:10.11896/jsjkx.200600133

Computer Science ›› 2021, Vol. 48 ›› Issue (6): 168-174.doi: 10.11896/jsjkx.200600133

• Artificial Intelligence • Previous Articles Next Articles

Meta-reinforcement Learning Algorithm Based on Automating Policy Entropy

LU Jia-you¹, LING Xing-hong^1,2, LIU Quan¹, ZHU Fei¹

1 School of Computer Science & Technology,Soochow University,Suzhou,Jiangsu 215006,China
2 Wenzheng College of Soochow University,Suzhou,Jiangsu 215104,China

Received:2020-06-22 Revised:2020-07-29 Online:2021-06-15 Published:2021-06-03
About author:LU Jia-you,born in 1996,postgraduate.His main research interests include imitation learning and meta-reinforcement learning.(15261868763@163.com)
LING Xing-hong,born in 1968,Ph.D,associate professor.His main research interests include machine learning,artificial intelligence technology and information processing.
Supported by:
Research on Data Mining and Application of Suzhou Intelligent Public Transportation System Based on Cloud Computing(N311800117) and Project Funded by the Priority Academic Program Development of Jiangsu Higher Education Institutions.

Abstract

Abstract: Traditional deep reinforcement learning methods rely on a large number of samples and are difficult to adapt to new tasks.By extracting prior knowledge from previous training tasks,meta reinforcement learning provides a fast and effective me-thod for agents to adapt to new tasks.Meta deep reinforcement learning based on maximum entropy reinforcement learning framework optimizes strategies by maximizing expected reward and strategy entropy.However,the current meta reinforcement learning algorithms based on the maximum entropy reinforcement learning framework generally adopt fixed temperature parameters,which is unreasonable in the multi-task scenario of meta reinforcement learning.To solve this problem,an adaptive adjustment strategy entropy algorithm is proposed.Firstly,by limiting the entropy of the strategy,the original objective function optimization problem is transformed into a constrained optimization problem.Then,the dual variable in the constrained optimization problem is taken as the temperature parameters,and the updated formula is obtained by solving the dual variable by Lagrangedualmethod.According to the updated formula,the temperature parameters will be adjusted adaptively after each round of meta trai-ning.Experimental data show that the average score of the proposed algorithm on Ant -Fwd-back and Walker-2D increases by 200,the meta training efficiency improves by 82%,the strategy convergence on Human-Direc-2D requires 230 000 training steps,and the convergence speed increases by 127%.Experimental results show that the proposed algorithm has higher meta training efficiency and better stability.

Key words: Maximum entropy, Meta learning, Reinforcement learning

CLC Number:

TP181

LU Jia-you, LING Xing-hong, LIU Quan, ZHU Fei. Meta-reinforcement Learning Algorithm Based on Automating Policy Entropy[J].Computer Science, 2021, 48(6): 168-174.

References

[1]MNIH V,KAVUKCUOGLU K,SILVER D,et al.Human-level control through deep reinforcement learning[J].Nature,2015,518(7540):529-533.
[2]SILVER D,HUANG A,MADDISON C J,et al.Mastering the game of go with deep neural networks and tree search[J].Nature,2016,529(7587):484-489.
[3]VINYALS O,BABUSCHKIN I,CZARNECKI W M,et al.Grandmaster level in StarCraft II using multi-agent reinforcement learning[J].Nature,2019,575(7782):350-354.
[4]SCHMIDHUBER J.Evolutionary principles in self-referentiallearning[D].Munich:Univ.Munich,1987.
[5]BENGIO Y,BENGIO S,CLOUTIER J.Learning a synapticlearning rule[C]//IJCNN-91-Seattle International Joint Confe-rence on Neural Networks.IEEE,2002.
[6]WANG J X,KURTHNELSON Z,TIRUMALA D,et al.Lear-ning to reinforcement learn[C]//CogSci.2016.
[7]DUAN Y,SCHULMAN J,CHEN X,et al.RL2:Fast Reinforcement Learning via Slow Reinforcement Learning[C]//International Conference on Learning Representations.2017.
[8]MISHRA N,ROHANINEJAD M,CHEN X,et al.A SimpleNeural Attentive Meta-Learner[C]//International Conference on Learning Representations.2018.
[9]FINN C,ABBEEL P,LEVINE S.Model-agnostic meta-learning for fast adaptation of deep networks[C]//Proceedings of the 34th International Conference on Machine Learning.2017:1126-1135.
[10]GUPTA A,MENDONCA R,LIU Y,et al.Meta-reinforcement learning of structured exploration strategies[C]//Advances in Neural Information Processing Systems.2018:5302-5311.
[11]ROTHFUSS J,LEE D,CLAVERA I,et al.ProMP:Proximal Meta-Policy Search[C]//International Conference on Learning Representations.2019.
[12]RAJESWARAN A,FINN C,KAKADE S M,et al.Meta-lear-ning with implicit gradients[C]//Advances in Neural Information Processing Systems.2019:113-124.
[13]RAKELLY K,ZHOU A,FINN C,et al.Efficient off-policy meta-reinforcement learning via probabilistic context variables[C]//International Conference on Machine Learning.2019:5331-5340.
[14]HAARNOJA T,ZHOU A,ABBEEL P,et al.Soft Actor-Critic:Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor[C]//International Conference on Machine Learning.2018:1856-1865.
[15]ZIEBART B D,MAAS A L,BAGNELL J A,et al.Maximumentropy inverse reinforcement learning[C]//Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence.Chicago,Illinois,USA,2008:13-17.
[16]WANG H,ZHOU J,HE X.Learning Context-aware Task Reasoning for Efficient Meta-reinforcement Learning[J].arXiv:2003.01373,2020.
[17]MONTAGUE P R.Reinforcement learning:an introduction,by Sutton,RS and Barto,AG[J].Trends in Cognitive Sciences,1999,3(9):360.
[18]KINGMA D P,WELLING M.Auto-Encoding Variational Bayes[C]//International Conference on Learning Representations.2014.
[19]ALEMI A A,FISCHER I,DILLON J V,et al.Deep Variational Information Bottleneck[C]//International Conference on Lear-ning Representations.2017.
[20]EYSENBACH B,LEVINE S.If MaxEnt RL is the Answer,What is the Question?[J].arXiv:1910.01913,2019.
[21]MNIH V,BADIA A P,MIRZA M,et al.Asynchronous methods for deep reinforcement learning[C]//International Conference on Machine Learning.2016:1928-1937.
[22]HAARNOJA T,TANG H,ABBEEL P,et al.Reinforcementlearning with deep energy-based policies[C]//Proceedings of the 34th International Conference on Machine Learning.2017:1352-1361.
[23]FUJIMOTO S,VAN HOOF H,MEGER D.Addressing Function Approximation Error in Actor-Critic Methods[C]//International Conference on Machine Learning.2018:1582-1591.
[24]LILLICRAP T P,HUNT J J,PRITZEL A,et al.Continuouscontrol with deep reinforcement learning[C]//International Conference on Learning Representations.2016.
[25]TODOROV E,EREZ T,TASSA Y.Mujoco:A physics engine for model-based control[C]//2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.IEEE,2012:5026-5033.

Related Articles 15

[1]	LIU Xing-guang, ZHOU Li, LIU Yan, ZHANG Xiao-ying, TAN Xiang, WEI Ji-bo. Construction and Distribution Method of REM Based on Edge Intelligence [J]. Computer Science, 2022, 49(9): 236-241.
[2]	YUAN Wei-lin, LUO Jun-ren, LU Li-na, CHEN Jia-xing, ZHANG Wan-peng, CHEN Jing. Methods in Adversarial Intelligent Game:A Holistic Comparative Analysis from Perspective of Game Theory and Reinforcement Learning [J]. Computer Science, 2022, 49(8): 191-204.
[3]	SHI Dian-xi, ZHAO Chen-ran, ZHANG Yao-wen, YANG Shao-wu, ZHANG Yong-jun. Adaptive Reward Method for End-to-End Cooperation Based on Multi-agent Reinforcement Learning [J]. Computer Science, 2022, 49(8): 247-256.
[4]	YU Bin, LI Xue-hua, PAN Chun-yu, LI Na. Edge-Cloud Collaborative Resource Allocation Algorithm Based on Deep Reinforcement Learning [J]. Computer Science, 2022, 49(7): 248-253.
[5]	LI Meng-fei, MAO Ying-chi, TU Zi-jian, WANG Xuan, XU Shu-fang. Server-reliability Task Offloading Strategy Based on Deep Deterministic Policy Gradient [J]. Computer Science, 2022, 49(7): 271-279.
[6]	XIE Wan-cheng, LI Bin, DAI Yue-yue. PPO Based Task Offloading Scheme in Aerial Reconfigurable Intelligent Surface-assisted Edge Computing [J]. Computer Science, 2022, 49(6): 3-11.
[7]	HONG Zhi-li, LAI Jun, CAO Lei, CHEN Xi-liang, XU Zhi-xiong. Study on Intelligent Recommendation Method of Dueling Network Reinforcement Learning Based on Regret Exploration [J]. Computer Science, 2022, 49(6): 149-157.
[8]	GUO Yu-xin, CHEN Xiu-hong. Automatic Summarization Model Combining BERT Word Embedding Representation and Topic Information Enhancement [J]. Computer Science, 2022, 49(6): 313-318.
[9]	FAN Jing-yu, LIU Quan. Off-policy Maximum Entropy Deep Reinforcement Learning Algorithm Based on RandomlyWeighted Triple Q -Learning [J]. Computer Science, 2022, 49(6): 335-341.
[10]	ZHANG Jia-neng, LI Hui, WU Hao-lin, WANG Zhuang. Exploration and Exploitation Balanced Experience Replay [J]. Computer Science, 2022, 49(5): 179-185.
[11]	LI Peng, YI Xiu-wen, QI De-kang, DUAN Zhe-wen, LI Tian-rui. Heating Strategy Optimization Method Based on Deep Learning [J]. Computer Science, 2022, 49(4): 263-268.
[12]	OUYANG Zhuo, ZHOU Si-yuan, LYU Yong, TAN Guo-ping, ZHANG Yue, XIANG Liang-liang. DRL-based Vehicle Control Strategy for Signal-free Intersections [J]. Computer Science, 2022, 49(3): 46-51.
[13]	LIU Yang, LI Fan-zhang. Fiber Bundle Meta-learning Algorithm Based on Variational Bayes [J]. Computer Science, 2022, 49(3): 225-231.
[14]	ZHOU Qin, LUO Fei, DING Wei-chao, GU Chun-hua, ZHENG Shuai. Double Speedy Q-Learning Based on Successive Over Relaxation [J]. Computer Science, 2022, 49(3): 239-245.
[15]	LI Su, SONG Bao-yan, LI Dong, WANG Jun-lu. Composite Blockchain Associated Event Tracing Method for Financial Activities [J]. Computer Science, 2022, 49(3): 346-353.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Meta-reinforcement Learning Algorithm Based on Automating Policy Entropy

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 15

Metrics

Comments

Recommended 0