计算机科学 ›› 2021, Vol. 48 ›› Issue (9): 257-263.doi: 10.11896/jsjkx.200700044

• 人工智能 • 上一篇    下一篇

基于相对熵的元逆强化学习方法

吴少波1,2,3, 傅启明1,2,3, 陈建平2,3, 吴宏杰1,2, 陆悠1,2   

  1. 1 苏州科技大学电子与信息工程学院 江苏 苏州215009
    2 苏州科技大学江苏省建筑智慧节能重点实验室 江苏 苏州215009
    3 苏州科技大学苏州市移动网络技术与应用重点实验室 江苏 苏州215009
  • 收稿日期:2020-07-08 修回日期:2021-01-09 出版日期:2021-09-15 发布日期:2021-09-10
  • 通讯作者: 傅启明(fqm_1@126.com)
  • 作者简介:wushaobo_1@163.com
  • 基金资助:
    国家自然科学基金项目(61876217,61876121,61772357,61750110519,61772355,61702055,61672371); 江苏省重点研发计划项目(BE2017663)

Meta-inverse Reinforcement Learning Method Based on Relative Entropy

WU Shao-bo1,2,3, FU Qi-ming1,2,3, CHEN Jian-ping2,3, WU Hong-jie1,2, LU You1,2   

  1. 1 School of Electronics and Information Engineering,Suzhou University of Science and Technology,Suzhou,Jiangsu 215009,China
    2 Jiangsu Province Key Laboratory of Intelligent Building Energy Efficiency,Suzhou University of Science and Technology,Suzhou, Jiangsu 215009,China
    3 Suzhou Key Laboratory of Mobile Network Technology and Application,Suzhou University of Science and Technology,Suzhou,Jiangsu 215009,China
  • Received:2020-07-08 Revised:2021-01-09 Online:2021-09-15 Published:2021-09-10
  • About author:WU Shao-bo,born in 1996,postgra-duate.His main research interests include reinforcement learning,inverse reinforcement learning and building energy conversation.
    FU Qi-ming,born in 1985,Ph.D,asso-ciate professor,is a member of China Computer Federation.His main research interests include reinforcement learning,deep learning and building energy conservation.
  • Supported by:
    National Natural Science Foundation of China(61876217,61876121,61772357,61750110519,61772355,61702055,61672371) and Primary Research and Development Plan of Jiangsu Province(BE2017663)

摘要: 针对传统逆强化学习算法在缺少足够专家演示样本以及状态转移概率未知的情况下,求解奖赏函数速度慢、精度低甚至无法求解的问题,提出一种基于相对熵的元逆强化学习方法。利用元学习方法,结合与目标任务同分布的一组元训练集,构建目标任务学习先验,在无模型强化学习问题中,采用相对熵概率模型对奖赏函数进行建模,并结合所构建的先验,实现利用目标任务少量样本快速求解目标任务奖赏函数的目的。将所提算法与REIRL算法应用于经典的Gridworld和Object World问题,实验表明,在目标任务缺少足够数目的专家演示样本和状态转移概率信息的情况下,所提算法仍能较好地求解奖赏函数。

关键词: 奖赏函数, 逆强化学习, 梯度下降, 相对熵, 元学习

Abstract: Aiming at the problem that traditional inverse reinforcement learning algorithms are slow,imprecise,or even unsolvable when solving the reward function owing to insufficient expert demonstration samples and unknown state transition probabilitie,a meta-reinforcement learning method based on relative entropy is proposed.Using meta-learning methods,the target task learning prior is constructed by integrating a set of meta-training sets that meet the same distribution as the target task.In the model-free reinforcement learning problem,the relative entropy probability model is used to model the reward function and combined with the prior to achieve the goal of quickly solving the reward function of the target task using a small number of samples of the target task.The proposed algorithm and the RE IRL algorithm are applied to the classic Gridworld and Object World pro-blems.Experiments show that the proposed algorithm can still solve the reward function better when the target task lacks a sufficient number of expert demonstration samples and state transition probabilities information

Key words: Gradient decent, Inverse reinforcement learning, Meta-learning, Relative entropy, Reward function

中图分类号: 

  • TP311
[1]SUTTON R S,BARTO A G.Reinforcement learning:An introduction[M].MIT Press,2018.
[2]NG A Y,RUSSELL S J.Algorithms for inverse reinforcement learning[C]//Proceedings of the International Conference on Machine Learning.California,USA,2000:663-670.
[3]ABBEEL P,NG A Y.Apprenticeship learning via inverse reinforcement learning[C]//Proceedings of the International Conference on Machine Learning.Banff,Canada,2004:1.
[4]RATLIFF N D,SILVER D,BAGNELL J A.Learning tosearch:Functional gradient techniques for imitation learning[J].Autonomous Robots,2009,27(1):25-53.
[5]ZIEBART B D,MAAS A L,BAGNELL J A,et al.Maximum Entropy Inverse Reinforcement Learning[C]//Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence(AAAI 2008).Chicago,Illinois,USA,2008:13-17.
[6]BOULARIAS A,KOBER J.Relative Entropy Inverse Rein-forcement Learning[C]//Proceedings of the 14th International Conference on Artificial Intelligence and Statistics (AISTATS) 2011.Fort Lauderdale,FL,USA,2011.
[7]WANG Y X,HEBERT M.Learning to learn:Model regression networks for easy small sample learning[C]//European Confe-rence on Computer Vision.Springer,Cham,2016:616-634.
[8]FINN C,ABBEEL P,LEVINE S.Model-agnostic meta-learning for fast adaptation of deep networks[C]//Proceedings of the 34th International Conference on Machine Learning.2017:1126-1135.
[9]SNELL J,SWERSKY K,ZEMEL R.Prototypical networks for few-shot learning[C]//Advances in Neural Information Processing Systems.2017:4077-4087.
[10]MISHRA N,ROHANINEJAD M,CHEN X,et al.Meta-lear-ning with temporal convolutions[J].arXiv:1707.03141.
[11]ANDRYCHOWICZ M,DENIL M,COLMENAREJO S G,et al.Learning to learn by gradient descent[C]//30th Conference on Neural Information Processing Systems (NIPS 2016).Barce-lona,Spain.2016.
[12]CHEN X L,CAO L,HE M,et al.A Summary of Research onDeep Reverse Reinforcement Learning[J].Computer Enginee-ring and Applications,2018,54(5):24-35.
[13]XIA C,KAMEL A E.Neural inverse reinforcement learning in autonomous navigation[J].Robotics & Autonomous Systems,2016,84:1-14.
[14]YI Z,ZHANG H,TAN P,et al.Dualgan:Unsupervised duallearning for image-to-image translation[C]//Proceedings of the IEEE International Conference on Computer Vision.Venice,Italy,2017:2849-2857.
[15]BYRAVAN A,MONFORT M,ZIEBART B,et al.Graph-based inverse optimal control for robot manipulation[C]//Proceedings of the Association for the Advance of Artificial Intelligence.Austin,USA,2015:1874-1890.
[1] 齐秀秀, 王佳昊, 李文雄, 周帆.
基于概率元学习的矩阵补全预测融合算法
Fusion Algorithm for Matrix Completion Prediction Based on Probabilistic Meta-learning
计算机科学, 2022, 49(7): 18-24. https://doi.org/10.11896/jsjkx.210600126
[2] 周颖, 常明新, 叶红, 张燕.
基于元迁移的太阳能电池板缺陷图像超分辨率重建方法
Super Resolution Reconstruction Method of Solar Panel Defect Images Based on Meta-transfer
计算机科学, 2022, 49(3): 185-191. https://doi.org/10.11896/jsjkx.210100234
[3] 刘洋, 李凡长.
基于变分贝叶斯的纤维丛元学习算法
Fiber Bundle Meta-learning Algorithm Based on Variational Bayes
计算机科学, 2022, 49(3): 225-231. https://doi.org/10.11896/jsjkx.201100111
[4] 黄鑫权, 刘爱军, 梁小虎, 王桁.
空中传感器网络中负载均衡的地理路由协议
Load-balanced Geographic Routing Protocol in Aerial Sensor Network
计算机科学, 2022, 49(2): 342-352. https://doi.org/10.11896/jsjkx.201000155
[5] 胡艳梅, 杨波, 多滨.
基于网络结构的正则化逻辑回归
Logistic Regression with Regularization Based on Network Structure
计算机科学, 2021, 48(7): 281-291. https://doi.org/10.11896/jsjkx.201100106
[6] 陆嘉猷, 凌兴宏, 刘全, 朱斐.
基于自适应调节策略熵的元强化学习算法
Meta-reinforcement Learning Algorithm Based on Automating Policy Entropy
计算机科学, 2021, 48(6): 168-174. https://doi.org/10.11896/jsjkx.200600133
[7] 王卫东, 徐金慧, 张志峰, 杨习贝.
基于密度峰值聚类的高斯混合模型算法
Gaussian Mixture Models Algorithm Based on Density Peaks Clustering
计算机科学, 2021, 48(10): 191-196. https://doi.org/10.11896/jsjkx.200800191
[8] 杨力, 李欣宇, 石怀峰, 潘成胜.
空间信息网络任务智能识别方法
Task Intelligent Identification Method for Spatial Information Network
计算机科学, 2020, 47(4): 262-269. https://doi.org/10.11896/jsjkx.190300111
[9] 刘晓彤,王伟,李泽禹,沈思婉,姜小明.
基于改进BP神经网络的尿液中红白细胞识别算法
Recognition Algorithm of Red and White Cells in Urine Based on Improved BP Neural Network
计算机科学, 2020, 47(2): 102-105. https://doi.org/10.11896/jsjkx.191100195
[10] 冯进展, 蔡淑琴.
融合信息增益和梯度下降算法的在线评论有用程度预测模型
Helpfulness Degree Prediction Model of Online Reviews Fusing Information Gain and Gradient Decline Algorithms
计算机科学, 2020, 47(10): 69-74. https://doi.org/10.11896/jsjkx.190700034
[11] 于诚, 朱皖宁, 游坤, 朱金付.
基于Attention机制与LRUA模块的ESports行为模式预测模型
Prediction Model of E-sports Behavior Pattern Based on Attention Mechanism and LRUA Module
计算机科学, 2019, 46(11A): 76-79.
[12] 王一丰, 郭渊博, 李涛, 孔菁.
小样本下未知内部威胁检测的方法研究
Method for Unknown Insider Threat Detection with Small Samples
计算机科学, 2019, 46(11A): 496-501.
[13] 张旋, 姜超, 李晓强, 燕莎.
基于变量节点更新的梯度下降比特翻转译码算法
Gradient Descent Bit-flipping Decoding Algorithm Based on Updating of Variable Nodes
计算机科学, 2018, 45(8): 80-83. https://doi.org/10.11896/j.issn.1002-137X.2018.08.014
[14] 陶秉墨,鲁淑霞.
基于自适应随机梯度下降方法的非平衡数据分类
Adaptive Stochastic Gradient Descent for Imbalanced Data Classification
计算机科学, 2018, 45(6A): 487-492.
[15] 陈俊华, 边宅安, 李慧嘉, 关闰丹.
基于相对熵的节点影响力测量方法
Measuring Method of Node Influence Based on Relative Entropy
计算机科学, 2018, 45(11A): 292-298.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!