计算机科学 ›› 2021, Vol. 48 ›› Issue (6): 168-174.doi: 10.11896/jsjkx.200600133
陆嘉猷1, 凌兴宏1,2, 刘全1, 朱斐1
LU Jia-you1, LING Xing-hong1,2, LIU Quan1, ZHU Fei1
摘要: 传统的深度强化学习方法依赖大量的经验样本并且难以适应新任务。元强化学习通过从以往的训练任务中提取先验知识,为智能体快速适应新任务提供了一种有效的方法。基于最大熵强化学习框架的元深度强化学习通过最大化期望奖赏和最大化策略熵来优化策略。然而,目前以最大熵强化学习框架为基础的元强化学习算法普遍采用固定的温度参数,这在面对元强化学习的多任务场景时是不合理的。针对这一问题,提出了自适应调节策略熵(Automating Policy Entropy,APE)算法。该算法首先通过限制策略的熵,将原本的目标函数优化问题转换为受限优化问题,然后将受限优化问题中的对偶变量作为温度参数,通过拉格朗日对偶法求解得到其更新公式。根据得到的更新公式,温度参数将在每一轮元训练结束之后进行自适应调节。实验数据表明,所提算法在Ant-Fwd-Back和Walker-2D上的平均得分提高了200,元训练效率提升了82%;在Humanoid-Direc-2D上的策略收敛所需的训练步数为23万,收敛速度提升了127%。实验结果表明,所提算法具有更高的元训练效率和更好的稳定性。
中图分类号:
[1]MNIH V,KAVUKCUOGLU K,SILVER D,et al.Human-level control through deep reinforcement learning[J].Nature,2015,518(7540):529-533. [2]SILVER D,HUANG A,MADDISON C J,et al.Mastering the game of go with deep neural networks and tree search[J].Nature,2016,529(7587):484-489. [3]VINYALS O,BABUSCHKIN I,CZARNECKI W M,et al.Grandmaster level in StarCraft II using multi-agent reinforcement learning[J].Nature,2019,575(7782):350-354. [4]SCHMIDHUBER J.Evolutionary principles in self-referentiallearning[D].Munich:Univ.Munich,1987. [5]BENGIO Y,BENGIO S,CLOUTIER J.Learning a synapticlearning rule[C]//IJCNN-91-Seattle International Joint Confe-rence on Neural Networks.IEEE,2002. [6]WANG J X,KURTHNELSON Z,TIRUMALA D,et al.Lear-ning to reinforcement learn[C]//CogSci.2016. [7]DUAN Y,SCHULMAN J,CHEN X,et al.RL2:Fast Reinforcement Learning via Slow Reinforcement Learning[C]//International Conference on Learning Representations.2017. [8]MISHRA N,ROHANINEJAD M,CHEN X,et al.A SimpleNeural Attentive Meta-Learner[C]//International Conference on Learning Representations.2018. [9]FINN C,ABBEEL P,LEVINE S.Model-agnostic meta-learning for fast adaptation of deep networks[C]//Proceedings of the 34th International Conference on Machine Learning.2017:1126-1135. [10]GUPTA A,MENDONCA R,LIU Y,et al.Meta-reinforcement learning of structured exploration strategies[C]//Advances in Neural Information Processing Systems.2018:5302-5311. [11]ROTHFUSS J,LEE D,CLAVERA I,et al.ProMP:Proximal Meta-Policy Search[C]//International Conference on Learning Representations.2019. [12]RAJESWARAN A,FINN C,KAKADE S M,et al.Meta-lear-ning with implicit gradients[C]//Advances in Neural Information Processing Systems.2019:113-124. [13]RAKELLY K,ZHOU A,FINN C,et al.Efficient off-policy meta-reinforcement learning via probabilistic context variables[C]//International Conference on Machine Learning.2019:5331-5340. [14]HAARNOJA T,ZHOU A,ABBEEL P,et al.Soft Actor-Critic:Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor[C]//International Conference on Machine Learning.2018:1856-1865. [15]ZIEBART B D,MAAS A L,BAGNELL J A,et al.Maximumentropy inverse reinforcement learning[C]//Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence.Chicago,Illinois,USA,2008:13-17. [16]WANG H,ZHOU J,HE X.Learning Context-aware Task Reasoning for Efficient Meta-reinforcement Learning[J].arXiv:2003.01373,2020. [17]MONTAGUE P R.Reinforcement learning:an introduction,by Sutton,RS and Barto,AG[J].Trends in Cognitive Sciences,1999,3(9):360. [18]KINGMA D P,WELLING M.Auto-Encoding Variational Bayes[C]//International Conference on Learning Representations.2014. [19]ALEMI A A,FISCHER I,DILLON J V,et al.Deep Variational Information Bottleneck[C]//International Conference on Lear-ning Representations.2017. [20]EYSENBACH B,LEVINE S.If MaxEnt RL is the Answer,What is the Question?[J].arXiv:1910.01913,2019. [21]MNIH V,BADIA A P,MIRZA M,et al.Asynchronous methods for deep reinforcement learning[C]//International Conference on Machine Learning.2016:1928-1937. [22]HAARNOJA T,TANG H,ABBEEL P,et al.Reinforcementlearning with deep energy-based policies[C]//Proceedings of the 34th International Conference on Machine Learning.2017:1352-1361. [23]FUJIMOTO S,VAN HOOF H,MEGER D.Addressing Function Approximation Error in Actor-Critic Methods[C]//International Conference on Machine Learning.2018:1582-1591. [24]LILLICRAP T P,HUNT J J,PRITZEL A,et al.Continuouscontrol with deep reinforcement learning[C]//International Conference on Learning Representations.2016. [25]TODOROV E,EREZ T,TASSA Y.Mujoco:A physics engine for model-based control[C]//2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.IEEE,2012:5026-5033. |
[1] | 刘兴光, 周力, 刘琰, 张晓瀛, 谭翔, 魏急波. 基于边缘智能的频谱地图构建与分发方法 Construction and Distribution Method of REM Based on Edge Intelligence 计算机科学, 2022, 49(9): 236-241. https://doi.org/10.11896/jsjkx.220400148 |
[2] | 熊丽琴, 曹雷, 赖俊, 陈希亮. 基于值分解的多智能体深度强化学习综述 Overview of Multi-agent Deep Reinforcement Learning Based on Value Factorization 计算机科学, 2022, 49(9): 172-182. https://doi.org/10.11896/jsjkx.210800112 |
[3] | 史殿习, 赵琛然, 张耀文, 杨绍武, 张拥军. 基于多智能体强化学习的端到端合作的自适应奖励方法 Adaptive Reward Method for End-to-End Cooperation Based on Multi-agent Reinforcement Learning 计算机科学, 2022, 49(8): 247-256. https://doi.org/10.11896/jsjkx.210700100 |
[4] | 袁唯淋, 罗俊仁, 陆丽娜, 陈佳星, 张万鹏, 陈璟. 智能博弈对抗方法:博弈论与强化学习综合视角对比分析 Methods in Adversarial Intelligent Game:A Holistic Comparative Analysis from Perspective of Game Theory and Reinforcement Learning 计算机科学, 2022, 49(8): 191-204. https://doi.org/10.11896/jsjkx.220200174 |
[5] | 于滨, 李学华, 潘春雨, 李娜. 基于深度强化学习的边云协同资源分配算法 Edge-Cloud Collaborative Resource Allocation Algorithm Based on Deep Reinforcement Learning 计算机科学, 2022, 49(7): 248-253. https://doi.org/10.11896/jsjkx.210400219 |
[6] | 李梦菲, 毛莺池, 屠子健, 王瑄, 徐淑芳. 基于深度确定性策略梯度的服务器可靠性任务卸载策略 Server-reliability Task Offloading Strategy Based on Deep Deterministic Policy Gradient 计算机科学, 2022, 49(7): 271-279. https://doi.org/10.11896/jsjkx.210600040 |
[7] | 齐秀秀, 王佳昊, 李文雄, 周帆. 基于概率元学习的矩阵补全预测融合算法 Fusion Algorithm for Matrix Completion Prediction Based on Probabilistic Meta-learning 计算机科学, 2022, 49(7): 18-24. https://doi.org/10.11896/jsjkx.210600126 |
[8] | 郭雨欣, 陈秀宏. 融合BERT词嵌入表示和主题信息增强的自动摘要模型 Automatic Summarization Model Combining BERT Word Embedding Representation and Topic Information Enhancement 计算机科学, 2022, 49(6): 313-318. https://doi.org/10.11896/jsjkx.210400101 |
[9] | 范静宇, 刘全. 基于随机加权三重Q学习的异策略最大熵强化学习算法 Off-policy Maximum Entropy Deep Reinforcement Learning Algorithm Based on RandomlyWeighted Triple Q -Learning 计算机科学, 2022, 49(6): 335-341. https://doi.org/10.11896/jsjkx.210300081 |
[10] | 谢万城, 李斌, 代玥玥. 空中智能反射面辅助边缘计算中基于PPO的任务卸载方案 PPO Based Task Offloading Scheme in Aerial Reconfigurable Intelligent Surface-assisted Edge Computing 计算机科学, 2022, 49(6): 3-11. https://doi.org/10.11896/jsjkx.220100249 |
[11] | 洪志理, 赖俊, 曹雷, 陈希亮, 徐志雄. 基于遗憾探索的竞争网络强化学习智能推荐方法研究 Study on Intelligent Recommendation Method of Dueling Network Reinforcement Learning Based on Regret Exploration 计算机科学, 2022, 49(6): 149-157. https://doi.org/10.11896/jsjkx.210600226 |
[12] | 张佳能, 李辉, 吴昊霖, 王壮. 一种平衡探索和利用的优先经验回放方法 Exploration and Exploitation Balanced Experience Replay 计算机科学, 2022, 49(5): 179-185. https://doi.org/10.11896/jsjkx.210300084 |
[13] | 李鹏, 易修文, 齐德康, 段哲文, 李天瑞. 一种基于深度学习的供热策略优化方法 Heating Strategy Optimization Method Based on Deep Learning 计算机科学, 2022, 49(4): 263-268. https://doi.org/10.11896/jsjkx.210300155 |
[14] | 欧阳卓, 周思源, 吕勇, 谭国平, 张悦, 项亮亮. 基于深度强化学习的无信号灯交叉路口车辆控制 DRL-based Vehicle Control Strategy for Signal-free Intersections 计算机科学, 2022, 49(3): 46-51. https://doi.org/10.11896/jsjkx.210700010 |
[15] | 周颖, 常明新, 叶红, 张燕. 基于元迁移的太阳能电池板缺陷图像超分辨率重建方法 Super Resolution Reconstruction Method of Solar Panel Defect Images Based on Meta-transfer 计算机科学, 2022, 49(3): 185-191. https://doi.org/10.11896/jsjkx.210100234 |
|