计算机科学 ›› 2021, Vol. 48 ›› Issue (9): 257-263.doi: 10.11896/jsjkx.200700044
吴少波1,2,3, 傅启明1,2,3, 陈建平2,3, 吴宏杰1,2, 陆悠1,2
WU Shao-bo1,2,3, FU Qi-ming1,2,3, CHEN Jian-ping2,3, WU Hong-jie1,2, LU You1,2
摘要: 针对传统逆强化学习算法在缺少足够专家演示样本以及状态转移概率未知的情况下,求解奖赏函数速度慢、精度低甚至无法求解的问题,提出一种基于相对熵的元逆强化学习方法。利用元学习方法,结合与目标任务同分布的一组元训练集,构建目标任务学习先验,在无模型强化学习问题中,采用相对熵概率模型对奖赏函数进行建模,并结合所构建的先验,实现利用目标任务少量样本快速求解目标任务奖赏函数的目的。将所提算法与REIRL算法应用于经典的Gridworld和Object World问题,实验表明,在目标任务缺少足够数目的专家演示样本和状态转移概率信息的情况下,所提算法仍能较好地求解奖赏函数。
中图分类号:
[1]SUTTON R S,BARTO A G.Reinforcement learning:An introduction[M].MIT Press,2018. [2]NG A Y,RUSSELL S J.Algorithms for inverse reinforcement learning[C]//Proceedings of the International Conference on Machine Learning.California,USA,2000:663-670. [3]ABBEEL P,NG A Y.Apprenticeship learning via inverse reinforcement learning[C]//Proceedings of the International Conference on Machine Learning.Banff,Canada,2004:1. [4]RATLIFF N D,SILVER D,BAGNELL J A.Learning tosearch:Functional gradient techniques for imitation learning[J].Autonomous Robots,2009,27(1):25-53. [5]ZIEBART B D,MAAS A L,BAGNELL J A,et al.Maximum Entropy Inverse Reinforcement Learning[C]//Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence(AAAI 2008).Chicago,Illinois,USA,2008:13-17. [6]BOULARIAS A,KOBER J.Relative Entropy Inverse Rein-forcement Learning[C]//Proceedings of the 14th International Conference on Artificial Intelligence and Statistics (AISTATS) 2011.Fort Lauderdale,FL,USA,2011. [7]WANG Y X,HEBERT M.Learning to learn:Model regression networks for easy small sample learning[C]//European Confe-rence on Computer Vision.Springer,Cham,2016:616-634. [8]FINN C,ABBEEL P,LEVINE S.Model-agnostic meta-learning for fast adaptation of deep networks[C]//Proceedings of the 34th International Conference on Machine Learning.2017:1126-1135. [9]SNELL J,SWERSKY K,ZEMEL R.Prototypical networks for few-shot learning[C]//Advances in Neural Information Processing Systems.2017:4077-4087. [10]MISHRA N,ROHANINEJAD M,CHEN X,et al.Meta-lear-ning with temporal convolutions[J].arXiv:1707.03141. [11]ANDRYCHOWICZ M,DENIL M,COLMENAREJO S G,et al.Learning to learn by gradient descent[C]//30th Conference on Neural Information Processing Systems (NIPS 2016).Barce-lona,Spain.2016. [12]CHEN X L,CAO L,HE M,et al.A Summary of Research onDeep Reverse Reinforcement Learning[J].Computer Enginee-ring and Applications,2018,54(5):24-35. [13]XIA C,KAMEL A E.Neural inverse reinforcement learning in autonomous navigation[J].Robotics & Autonomous Systems,2016,84:1-14. [14]YI Z,ZHANG H,TAN P,et al.Dualgan:Unsupervised duallearning for image-to-image translation[C]//Proceedings of the IEEE International Conference on Computer Vision.Venice,Italy,2017:2849-2857. [15]BYRAVAN A,MONFORT M,ZIEBART B,et al.Graph-based inverse optimal control for robot manipulation[C]//Proceedings of the Association for the Advance of Artificial Intelligence.Austin,USA,2015:1874-1890. |
[1] | 齐秀秀, 王佳昊, 李文雄, 周帆. 基于概率元学习的矩阵补全预测融合算法 Fusion Algorithm for Matrix Completion Prediction Based on Probabilistic Meta-learning 计算机科学, 2022, 49(7): 18-24. https://doi.org/10.11896/jsjkx.210600126 |
[2] | 周颖, 常明新, 叶红, 张燕. 基于元迁移的太阳能电池板缺陷图像超分辨率重建方法 Super Resolution Reconstruction Method of Solar Panel Defect Images Based on Meta-transfer 计算机科学, 2022, 49(3): 185-191. https://doi.org/10.11896/jsjkx.210100234 |
[3] | 刘洋, 李凡长. 基于变分贝叶斯的纤维丛元学习算法 Fiber Bundle Meta-learning Algorithm Based on Variational Bayes 计算机科学, 2022, 49(3): 225-231. https://doi.org/10.11896/jsjkx.201100111 |
[4] | 黄鑫权, 刘爱军, 梁小虎, 王桁. 空中传感器网络中负载均衡的地理路由协议 Load-balanced Geographic Routing Protocol in Aerial Sensor Network 计算机科学, 2022, 49(2): 342-352. https://doi.org/10.11896/jsjkx.201000155 |
[5] | 胡艳梅, 杨波, 多滨. 基于网络结构的正则化逻辑回归 Logistic Regression with Regularization Based on Network Structure 计算机科学, 2021, 48(7): 281-291. https://doi.org/10.11896/jsjkx.201100106 |
[6] | 陆嘉猷, 凌兴宏, 刘全, 朱斐. 基于自适应调节策略熵的元强化学习算法 Meta-reinforcement Learning Algorithm Based on Automating Policy Entropy 计算机科学, 2021, 48(6): 168-174. https://doi.org/10.11896/jsjkx.200600133 |
[7] | 王卫东, 徐金慧, 张志峰, 杨习贝. 基于密度峰值聚类的高斯混合模型算法 Gaussian Mixture Models Algorithm Based on Density Peaks Clustering 计算机科学, 2021, 48(10): 191-196. https://doi.org/10.11896/jsjkx.200800191 |
[8] | 杨力, 李欣宇, 石怀峰, 潘成胜. 空间信息网络任务智能识别方法 Task Intelligent Identification Method for Spatial Information Network 计算机科学, 2020, 47(4): 262-269. https://doi.org/10.11896/jsjkx.190300111 |
[9] | 刘晓彤,王伟,李泽禹,沈思婉,姜小明. 基于改进BP神经网络的尿液中红白细胞识别算法 Recognition Algorithm of Red and White Cells in Urine Based on Improved BP Neural Network 计算机科学, 2020, 47(2): 102-105. https://doi.org/10.11896/jsjkx.191100195 |
[10] | 冯进展, 蔡淑琴. 融合信息增益和梯度下降算法的在线评论有用程度预测模型 Helpfulness Degree Prediction Model of Online Reviews Fusing Information Gain and Gradient Decline Algorithms 计算机科学, 2020, 47(10): 69-74. https://doi.org/10.11896/jsjkx.190700034 |
[11] | 于诚, 朱皖宁, 游坤, 朱金付. 基于Attention机制与LRUA模块的ESports行为模式预测模型 Prediction Model of E-sports Behavior Pattern Based on Attention Mechanism and LRUA Module 计算机科学, 2019, 46(11A): 76-79. |
[12] | 王一丰, 郭渊博, 李涛, 孔菁. 小样本下未知内部威胁检测的方法研究 Method for Unknown Insider Threat Detection with Small Samples 计算机科学, 2019, 46(11A): 496-501. |
[13] | 张旋, 姜超, 李晓强, 燕莎. 基于变量节点更新的梯度下降比特翻转译码算法 Gradient Descent Bit-flipping Decoding Algorithm Based on Updating of Variable Nodes 计算机科学, 2018, 45(8): 80-83. https://doi.org/10.11896/j.issn.1002-137X.2018.08.014 |
[14] | 陶秉墨,鲁淑霞. 基于自适应随机梯度下降方法的非平衡数据分类 Adaptive Stochastic Gradient Descent for Imbalanced Data Classification 计算机科学, 2018, 45(6A): 487-492. |
[15] | 陈俊华, 边宅安, 李慧嘉, 关闰丹. 基于相对熵的节点影响力测量方法 Measuring Method of Node Influence Based on Relative Entropy 计算机科学, 2018, 45(11A): 292-298. |
|