Computer Science ›› 2021, Vol. 48 ›› Issue (9): 257-263.doi: 10.11896/jsjkx.200700044

• Artificial Intelligence • Previous Articles     Next Articles

Meta-inverse Reinforcement Learning Method Based on Relative Entropy

WU Shao-bo1,2,3, FU Qi-ming1,2,3, CHEN Jian-ping2,3, WU Hong-jie1,2, LU You1,2   

  1. 1 School of Electronics and Information Engineering,Suzhou University of Science and Technology,Suzhou,Jiangsu 215009,China
    2 Jiangsu Province Key Laboratory of Intelligent Building Energy Efficiency,Suzhou University of Science and Technology,Suzhou, Jiangsu 215009,China
    3 Suzhou Key Laboratory of Mobile Network Technology and Application,Suzhou University of Science and Technology,Suzhou,Jiangsu 215009,China
  • Received:2020-07-08 Revised:2021-01-09 Online:2021-09-15 Published:2021-09-10
  • About author:WU Shao-bo,born in 1996,postgra-duate.His main research interests include reinforcement learning,inverse reinforcement learning and building energy conversation.
    FU Qi-ming,born in 1985,Ph.D,asso-ciate professor,is a member of China Computer Federation.His main research interests include reinforcement learning,deep learning and building energy conservation.
  • Supported by:
    National Natural Science Foundation of China(61876217,61876121,61772357,61750110519,61772355,61702055,61672371) and Primary Research and Development Plan of Jiangsu Province(BE2017663)

Abstract: Aiming at the problem that traditional inverse reinforcement learning algorithms are slow,imprecise,or even unsolvable when solving the reward function owing to insufficient expert demonstration samples and unknown state transition probabilitie,a meta-reinforcement learning method based on relative entropy is proposed.Using meta-learning methods,the target task learning prior is constructed by integrating a set of meta-training sets that meet the same distribution as the target task.In the model-free reinforcement learning problem,the relative entropy probability model is used to model the reward function and combined with the prior to achieve the goal of quickly solving the reward function of the target task using a small number of samples of the target task.The proposed algorithm and the RE IRL algorithm are applied to the classic Gridworld and Object World pro-blems.Experiments show that the proposed algorithm can still solve the reward function better when the target task lacks a sufficient number of expert demonstration samples and state transition probabilities information

Key words: Gradient decent, Inverse reinforcement learning, Meta-learning, Relative entropy, Reward function

CLC Number: 

  • TP311
[1]SUTTON R S,BARTO A G.Reinforcement learning:An introduction[M].MIT Press,2018.
[2]NG A Y,RUSSELL S J.Algorithms for inverse reinforcement learning[C]//Proceedings of the International Conference on Machine Learning.California,USA,2000:663-670.
[3]ABBEEL P,NG A Y.Apprenticeship learning via inverse reinforcement learning[C]//Proceedings of the International Conference on Machine Learning.Banff,Canada,2004:1.
[4]RATLIFF N D,SILVER D,BAGNELL J A.Learning tosearch:Functional gradient techniques for imitation learning[J].Autonomous Robots,2009,27(1):25-53.
[5]ZIEBART B D,MAAS A L,BAGNELL J A,et al.Maximum Entropy Inverse Reinforcement Learning[C]//Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence(AAAI 2008).Chicago,Illinois,USA,2008:13-17.
[6]BOULARIAS A,KOBER J.Relative Entropy Inverse Rein-forcement Learning[C]//Proceedings of the 14th International Conference on Artificial Intelligence and Statistics (AISTATS) 2011.Fort Lauderdale,FL,USA,2011.
[7]WANG Y X,HEBERT M.Learning to learn:Model regression networks for easy small sample learning[C]//European Confe-rence on Computer Vision.Springer,Cham,2016:616-634.
[8]FINN C,ABBEEL P,LEVINE S.Model-agnostic meta-learning for fast adaptation of deep networks[C]//Proceedings of the 34th International Conference on Machine Learning.2017:1126-1135.
[9]SNELL J,SWERSKY K,ZEMEL R.Prototypical networks for few-shot learning[C]//Advances in Neural Information Processing Systems.2017:4077-4087.
[10]MISHRA N,ROHANINEJAD M,CHEN X,et al.Meta-lear-ning with temporal convolutions[J].arXiv:1707.03141.
[11]ANDRYCHOWICZ M,DENIL M,COLMENAREJO S G,et al.Learning to learn by gradient descent[C]//30th Conference on Neural Information Processing Systems (NIPS 2016).Barce-lona,Spain.2016.
[12]CHEN X L,CAO L,HE M,et al.A Summary of Research onDeep Reverse Reinforcement Learning[J].Computer Enginee-ring and Applications,2018,54(5):24-35.
[13]XIA C,KAMEL A E.Neural inverse reinforcement learning in autonomous navigation[J].Robotics & Autonomous Systems,2016,84:1-14.
[14]YI Z,ZHANG H,TAN P,et al.Dualgan:Unsupervised duallearning for image-to-image translation[C]//Proceedings of the IEEE International Conference on Computer Vision.Venice,Italy,2017:2849-2857.
[15]BYRAVAN A,MONFORT M,ZIEBART B,et al.Graph-based inverse optimal control for robot manipulation[C]//Proceedings of the Association for the Advance of Artificial Intelligence.Austin,USA,2015:1874-1890.
[1] QI Xiu-xiu, WANG Jia-hao, LI Wen-xiong, ZHOU Fan. Fusion Algorithm for Matrix Completion Prediction Based on Probabilistic Meta-learning [J]. Computer Science, 2022, 49(7): 18-24.
[2] ZHOU Ying, CHANG Ming-xin, YE Hong, ZHANG Yan. Super Resolution Reconstruction Method of Solar Panel Defect Images Based on Meta-transfer [J]. Computer Science, 2022, 49(3): 185-191.
[3] HUANG Xin-quan, LIU Ai-jun, LIANG Xiao-hu, WANG Heng. Load-balanced Geographic Routing Protocol in Aerial Sensor Network [J]. Computer Science, 2022, 49(2): 342-352.
[4] WANG Wei-dong, XU Jin-hui, ZHANG Zhi-feng, YANG Xi-bei. Gaussian Mixture Models Algorithm Based on Density Peaks Clustering [J]. Computer Science, 2021, 48(10): 191-196.
[5] YU Cheng, ZHU Wan-ning, YOU Kun, ZHU Jin-fu. Prediction Model of E-sports Behavior Pattern Based on Attention Mechanism and LRUA Module [J]. Computer Science, 2019, 46(11A): 76-79.
[6] CEHN Jun-hua, BIAN Zhai-an, LI Hui-jia, GUAN Run-dan. Measuring Method of Node Influence Based on Relative Entropy [J]. Computer Science, 2018, 45(11A): 292-298.
[7] ZHOU Xian-ting, HUANG Wen-ming and DENG Zhen-rong. Micro-blog Retweet Behavior Prediction Algorithm Based on Anomaly Detection and Random Forest [J]. Computer Science, 2017, 44(7): 191-196.
[8] . Relative Entropy Threshold Segmentation Method Based on the Minimum Variance Filtering [J]. Computer Science, 2012, 39(7): 253-256.
[9] . [J]. Computer Science, 2006, 33(5): 222-226.
[10] . [J]. Computer Science, 2005, 32(10): 181-186.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!