基于自适应调节策略熵的元强化学习算法

doi:10.11896/jsjkx.200600133

计算机科学 ›› 2021, Vol. 48 ›› Issue (6): 168-174.doi: 10.11896/jsjkx.200600133

基于自适应调节策略熵的元强化学习算法

陆嘉猷¹, 凌兴宏^1,2, 刘全¹, 朱斐¹

1 苏州大学计算机科学与技术学院江苏苏州215006
2 苏州大学文正学院江苏苏州215104

收稿日期:2020-06-22 修回日期:2020-07-29 出版日期:2021-06-15 发布日期:2021-06-03
通讯作者: 凌兴宏(lingxinghong@suda.edu.cn)
基金资助:
基于云计算的苏州智能公交系统数据挖掘及应用研究(N311800117);江苏高校优势学科建设工程资助项目

Meta-reinforcement Learning Algorithm Based on Automating Policy Entropy

LU Jia-you¹, LING Xing-hong^1,2, LIU Quan¹, ZHU Fei¹

1 School of Computer Science & Technology,Soochow University,Suzhou,Jiangsu 215006,China
2 Wenzheng College of Soochow University,Suzhou,Jiangsu 215104,China

Received:2020-06-22 Revised:2020-07-29 Online:2021-06-15 Published:2021-06-03
About author:LU Jia-you,born in 1996,postgraduate.His main research interests include imitation learning and meta-reinforcement learning.(15261868763@163.com)
LING Xing-hong,born in 1968,Ph.D,associate professor.His main research interests include machine learning,artificial intelligence technology and information processing.
Supported by:
Research on Data Mining and Application of Suzhou Intelligent Public Transportation System Based on Cloud Computing(N311800117) and Project Funded by the Priority Academic Program Development of Jiangsu Higher Education Institutions.

摘要/Abstract

摘要： 传统的深度强化学习方法依赖大量的经验样本并且难以适应新任务。元强化学习通过从以往的训练任务中提取先验知识,为智能体快速适应新任务提供了一种有效的方法。基于最大熵强化学习框架的元深度强化学习通过最大化期望奖赏和最大化策略熵来优化策略。然而,目前以最大熵强化学习框架为基础的元强化学习算法普遍采用固定的温度参数,这在面对元强化学习的多任务场景时是不合理的。针对这一问题,提出了自适应调节策略熵(Automating Policy Entropy,APE)算法。该算法首先通过限制策略的熵,将原本的目标函数优化问题转换为受限优化问题,然后将受限优化问题中的对偶变量作为温度参数,通过拉格朗日对偶法求解得到其更新公式。根据得到的更新公式,温度参数将在每一轮元训练结束之后进行自适应调节。实验数据表明,所提算法在Ant-Fwd-Back和Walker-2D上的平均得分提高了200,元训练效率提升了82%;在Humanoid-Direc-2D上的策略收敛所需的训练步数为23万,收敛速度提升了127%。实验结果表明,所提算法具有更高的元训练效率和更好的稳定性。

关键词: 强化学习, 元学习, 最大熵

Abstract: Traditional deep reinforcement learning methods rely on a large number of samples and are difficult to adapt to new tasks.By extracting prior knowledge from previous training tasks,meta reinforcement learning provides a fast and effective me-thod for agents to adapt to new tasks.Meta deep reinforcement learning based on maximum entropy reinforcement learning framework optimizes strategies by maximizing expected reward and strategy entropy.However,the current meta reinforcement learning algorithms based on the maximum entropy reinforcement learning framework generally adopt fixed temperature parameters,which is unreasonable in the multi-task scenario of meta reinforcement learning.To solve this problem,an adaptive adjustment strategy entropy algorithm is proposed.Firstly,by limiting the entropy of the strategy,the original objective function optimization problem is transformed into a constrained optimization problem.Then,the dual variable in the constrained optimization problem is taken as the temperature parameters,and the updated formula is obtained by solving the dual variable by Lagrangedualmethod.According to the updated formula,the temperature parameters will be adjusted adaptively after each round of meta trai-ning.Experimental data show that the average score of the proposed algorithm on Ant -Fwd-back and Walker-2D increases by 200,the meta training efficiency improves by 82%,the strategy convergence on Human-Direc-2D requires 230 000 training steps,and the convergence speed increases by 127%.Experimental results show that the proposed algorithm has higher meta training efficiency and better stability.

Key words: Maximum entropy, Meta learning, Reinforcement learning

中图分类号:

TP181

陆嘉猷, 凌兴宏, 刘全, 朱斐. 基于自适应调节策略熵的元强化学习算法[J]. 计算机科学, 2021, 48(6): 168-174. https://doi.org/10.11896/jsjkx.200600133

LU Jia-you, LING Xing-hong, LIU Quan, ZHU Fei. Meta-reinforcement Learning Algorithm Based on Automating Policy Entropy[J]. Computer Science, 2021, 48(6): 168-174. https://doi.org/10.11896/jsjkx.200600133

参考文献

[1]MNIH V,KAVUKCUOGLU K,SILVER D,et al.Human-level control through deep reinforcement learning[J].Nature,2015,518(7540):529-533.
[2]SILVER D,HUANG A,MADDISON C J,et al.Mastering the game of go with deep neural networks and tree search[J].Nature,2016,529(7587):484-489.
[3]VINYALS O,BABUSCHKIN I,CZARNECKI W M,et al.Grandmaster level in StarCraft II using multi-agent reinforcement learning[J].Nature,2019,575(7782):350-354.
[4]SCHMIDHUBER J.Evolutionary principles in self-referentiallearning[D].Munich:Univ.Munich,1987.
[5]BENGIO Y,BENGIO S,CLOUTIER J.Learning a synapticlearning rule[C]//IJCNN-91-Seattle International Joint Confe-rence on Neural Networks.IEEE,2002.
[6]WANG J X,KURTHNELSON Z,TIRUMALA D,et al.Lear-ning to reinforcement learn[C]//CogSci.2016.
[7]DUAN Y,SCHULMAN J,CHEN X,et al.RL2:Fast Reinforcement Learning via Slow Reinforcement Learning[C]//International Conference on Learning Representations.2017.
[8]MISHRA N,ROHANINEJAD M,CHEN X,et al.A SimpleNeural Attentive Meta-Learner[C]//International Conference on Learning Representations.2018.
[9]FINN C,ABBEEL P,LEVINE S.Model-agnostic meta-learning for fast adaptation of deep networks[C]//Proceedings of the 34th International Conference on Machine Learning.2017:1126-1135.
[10]GUPTA A,MENDONCA R,LIU Y,et al.Meta-reinforcement learning of structured exploration strategies[C]//Advances in Neural Information Processing Systems.2018:5302-5311.
[11]ROTHFUSS J,LEE D,CLAVERA I,et al.ProMP:Proximal Meta-Policy Search[C]//International Conference on Learning Representations.2019.
[12]RAJESWARAN A,FINN C,KAKADE S M,et al.Meta-lear-ning with implicit gradients[C]//Advances in Neural Information Processing Systems.2019:113-124.
[13]RAKELLY K,ZHOU A,FINN C,et al.Efficient off-policy meta-reinforcement learning via probabilistic context variables[C]//International Conference on Machine Learning.2019:5331-5340.
[14]HAARNOJA T,ZHOU A,ABBEEL P,et al.Soft Actor-Critic:Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor[C]//International Conference on Machine Learning.2018:1856-1865.
[15]ZIEBART B D,MAAS A L,BAGNELL J A,et al.Maximumentropy inverse reinforcement learning[C]//Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence.Chicago,Illinois,USA,2008:13-17.
[16]WANG H,ZHOU J,HE X.Learning Context-aware Task Reasoning for Efficient Meta-reinforcement Learning[J].arXiv:2003.01373,2020.
[17]MONTAGUE P R.Reinforcement learning:an introduction,by Sutton,RS and Barto,AG[J].Trends in Cognitive Sciences,1999,3(9):360.
[18]KINGMA D P,WELLING M.Auto-Encoding Variational Bayes[C]//International Conference on Learning Representations.2014.
[19]ALEMI A A,FISCHER I,DILLON J V,et al.Deep Variational Information Bottleneck[C]//International Conference on Lear-ning Representations.2017.
[20]EYSENBACH B,LEVINE S.If MaxEnt RL is the Answer,What is the Question?[J].arXiv:1910.01913,2019.
[21]MNIH V,BADIA A P,MIRZA M,et al.Asynchronous methods for deep reinforcement learning[C]//International Conference on Machine Learning.2016:1928-1937.
[22]HAARNOJA T,TANG H,ABBEEL P,et al.Reinforcementlearning with deep energy-based policies[C]//Proceedings of the 34th International Conference on Machine Learning.2017:1352-1361.
[23]FUJIMOTO S,VAN HOOF H,MEGER D.Addressing Function Approximation Error in Actor-Critic Methods[C]//International Conference on Machine Learning.2018:1582-1591.
[24]LILLICRAP T P,HUNT J J,PRITZEL A,et al.Continuouscontrol with deep reinforcement learning[C]//International Conference on Learning Representations.2016.
[25]TODOROV E,EREZ T,TASSA Y.Mujoco:A physics engine for model-based control[C]//2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.IEEE,2012:5026-5033.

相关文章 15

[1]	刘兴光, 周力, 刘琰, 张晓瀛, 谭翔, 魏急波. 基于边缘智能的频谱地图构建与分发方法 Construction and Distribution Method of REM Based on Edge Intelligence 计算机科学, 2022, 49(9): 236-241. https://doi.org/10.11896/jsjkx.220400148
[2]	熊丽琴, 曹雷, 赖俊, 陈希亮. 基于值分解的多智能体深度强化学习综述 Overview of Multi-agent Deep Reinforcement Learning Based on Value Factorization 计算机科学, 2022, 49(9): 172-182. https://doi.org/10.11896/jsjkx.210800112
[3]	史殿习, 赵琛然, 张耀文, 杨绍武, 张拥军. 基于多智能体强化学习的端到端合作的自适应奖励方法 Adaptive Reward Method for End-to-End Cooperation Based on Multi-agent Reinforcement Learning 计算机科学, 2022, 49(8): 247-256. https://doi.org/10.11896/jsjkx.210700100
[4]	袁唯淋, 罗俊仁, 陆丽娜, 陈佳星, 张万鹏, 陈璟. 智能博弈对抗方法:博弈论与强化学习综合视角对比分析 Methods in Adversarial Intelligent Game:A Holistic Comparative Analysis from Perspective of Game Theory and Reinforcement Learning 计算机科学, 2022, 49(8): 191-204. https://doi.org/10.11896/jsjkx.220200174
[5]	于滨, 李学华, 潘春雨, 李娜. 基于深度强化学习的边云协同资源分配算法 Edge-Cloud Collaborative Resource Allocation Algorithm Based on Deep Reinforcement Learning 计算机科学, 2022, 49(7): 248-253. https://doi.org/10.11896/jsjkx.210400219
[6]	李梦菲, 毛莺池, 屠子健, 王瑄, 徐淑芳. 基于深度确定性策略梯度的服务器可靠性任务卸载策略 Server-reliability Task Offloading Strategy Based on Deep Deterministic Policy Gradient 计算机科学, 2022, 49(7): 271-279. https://doi.org/10.11896/jsjkx.210600040
[7]	齐秀秀, 王佳昊, 李文雄, 周帆. 基于概率元学习的矩阵补全预测融合算法 Fusion Algorithm for Matrix Completion Prediction Based on Probabilistic Meta-learning 计算机科学, 2022, 49(7): 18-24. https://doi.org/10.11896/jsjkx.210600126
[8]	郭雨欣, 陈秀宏. 融合BERT词嵌入表示和主题信息增强的自动摘要模型 Automatic Summarization Model Combining BERT Word Embedding Representation and Topic Information Enhancement 计算机科学, 2022, 49(6): 313-318. https://doi.org/10.11896/jsjkx.210400101
[9]	范静宇, 刘全. 基于随机加权三重Q学习的异策略最大熵强化学习算法 Off-policy Maximum Entropy Deep Reinforcement Learning Algorithm Based on RandomlyWeighted Triple Q -Learning 计算机科学, 2022, 49(6): 335-341. https://doi.org/10.11896/jsjkx.210300081
[10]	谢万城, 李斌, 代玥玥. 空中智能反射面辅助边缘计算中基于PPO的任务卸载方案 PPO Based Task Offloading Scheme in Aerial Reconfigurable Intelligent Surface-assisted Edge Computing 计算机科学, 2022, 49(6): 3-11. https://doi.org/10.11896/jsjkx.220100249
[11]	洪志理, 赖俊, 曹雷, 陈希亮, 徐志雄. 基于遗憾探索的竞争网络强化学习智能推荐方法研究 Study on Intelligent Recommendation Method of Dueling Network Reinforcement Learning Based on Regret Exploration 计算机科学, 2022, 49(6): 149-157. https://doi.org/10.11896/jsjkx.210600226
[12]	张佳能, 李辉, 吴昊霖, 王壮. 一种平衡探索和利用的优先经验回放方法 Exploration and Exploitation Balanced Experience Replay 计算机科学, 2022, 49(5): 179-185. https://doi.org/10.11896/jsjkx.210300084
[13]	李鹏, 易修文, 齐德康, 段哲文, 李天瑞. 一种基于深度学习的供热策略优化方法 Heating Strategy Optimization Method Based on Deep Learning 计算机科学, 2022, 49(4): 263-268. https://doi.org/10.11896/jsjkx.210300155
[14]	欧阳卓, 周思源, 吕勇, 谭国平, 张悦, 项亮亮. 基于深度强化学习的无信号灯交叉路口车辆控制 DRL-based Vehicle Control Strategy for Signal-free Intersections 计算机科学, 2022, 49(3): 46-51. https://doi.org/10.11896/jsjkx.210700010
[15]	周颖, 常明新, 叶红, 张燕. 基于元迁移的太阳能电池板缺陷图像超分辨率重建方法 Super Resolution Reconstruction Method of Solar Panel Defect Images Based on Meta-transfer 计算机科学, 2022, 49(3): 185-191. https://doi.org/10.11896/jsjkx.210100234

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

基于自适应调节策略熵的元强化学习算法

Meta-reinforcement Learning Algorithm Based on Automating Policy Entropy

PDF (PC)

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

Metrics

本文评价

推荐阅读 0