一种数据高效的第三人称模仿学习方法

doi:10.11896/jsjkx.191100107

计算机科学 ›› 2021, Vol. 48 ›› Issue (2): 238-244.doi: 10.11896/jsjkx.191100107

一种数据高效的第三人称模仿学习方法

姜冲¹, 章宗长², 陈子璇¹, 朱佳成¹, 蒋俊鹏¹

1 苏州大学计算机科学与技术学院江苏苏州215006
2 南京大学计算机软件新技术国家重点实验室南京210023

收稿日期:2019-11-14 修回日期:2020-04-16 出版日期:2021-02-15 发布日期:2021-02-04
通讯作者: 章宗长(zzzhang@nju.edu.cn)
作者简介:20175227033@stu.suda.edu.cn
基金资助:
国家自然科学基金面上项目(61876119);江苏省自然科学基金面上项目(BK20181432);中央高校基本科研业务费专项资金(14380005)

Data Efficient Third-person Imitation Learning Method

JIANG Chong¹, ZHANG Zong-zhang², CHEN Zi-xuan¹, ZHU Jia-cheng¹, JIANG Jun-peng¹

1 School of Computer Science and Technology,Soochow University,Suzhou,Jiangsu 215006,China
2 National Key Laboratory for Novel Software Technology,Nanjing University,Nanjing 210023,China

Received:2019-11-14 Revised:2020-04-16 Online:2021-02-15 Published:2021-02-04
About author:JIANG Chong,born in 1995,postgra-duate,is a member of China Computer Federation.His main research interests include imitation learning and reinforcement learning.
ZHANG Zong-zhang,born in 1985,Ph.D,associate professor,is a member of China Computer Federation.His main research interests include reinforcement learning,intelligent planning and multi-agent systems.
Supported by:
The National Natural Science Foundation of China (61876119),Natural Science Foundation of Jiangsu(BK20181432) and Fundamental Research Funds for the Central Universities(14380005).

摘要/Abstract

摘要： 模仿学习提供了一种能够使智能体从专家示范中学习如何决策的框架。在学习过程中,智能体无需与专家进行交互,也不依赖于环境的奖励信号,而只需要大量的专家示范。经典的模仿学习方法需要使用第一人称的专家示范,该示范由一个状态序列以及对应的专家动作序列组成。但是,在现实生活中,专家示范通常以第三人称视频的形式存在。相比第一人称专家示范,第三人称示范的观察视角与智能体的存在差异,导致两者之间缺乏一一对应关系,因此第三人称示范无法被直接用于模仿学习中。针对此问题,文中提出了一种数据高效的第三人称模仿学习方法。首先,该方法在生成对抗模仿学习的基础上引入了图像差分方法,利用马尔可夫决策过程的马尔可夫性质以及其状态的时间连续性,去除环境背景、颜色等领域特征,以得到观察图像中与行为策略最相关的部分,并将其用于模仿学习;其次,该方法引入了一个变分判别器瓶颈,以对判别器进行限制,进一步削弱了领域特征对策略学习的影响。为了验证所提算法的性能,通过MuJoCo平台中的3个实验环境对其进行了测试,并与已有算法进行了比较。实验结果表明,与已有的模仿学习方法相比,该方法在第三人称模仿学习任务中具有更好的性能表现,并且不需要额外增加对样本的需求。

关键词: 变分判别器瓶颈, 第三人称, 领域特征, 模仿学习, 数据高效, 图像差分

Abstract: Imitation learning provides a framework to make agent learn an efficient policy from expert demonstrations.During the learning process,the agent does not need to interact with the expert or get access to an explicit reward signal,but only needs a large number of expert demonstrations.Classical imitation learning methods usually need to imitate from first-person expert demonstrations,a sequence of states and actions that expert should have taken.However,most expert demonstrations exist in the form of third-person videos in reality.Different from the first-person expert demonstrations,there is a difference between the viewpoint of the third-person demonstrations and samples generated by the agent,resulting in a lack of one-to-one correspondence between them.Therefore,the third-person demonstrations cannot be directly used in imitation learning.To alleviate this problem,this paper presents a data efficient third-person imitation learning method.Firstly,this method introduces the image difference based on Generative Adversarial Imitation Learning(GAIL) to eliminate the domain features including the background of environment and colors by taking advantage of the Markov property of Markov decision process and the time continuity of states.And the most relevant part of policy can be achieved for imitation learning.Secondly,this paper introduces a variational discriminator bottleneck to limit the discriminator to alleviate the influence of domain features on the process of learning policy.In order to verify the performance of the proposed algorithm,this paper makes experiments on three MuJoCo tasks,and compares it with the existing algorithms.Experimental results indicate that the proposed method can achieve significant performance improvements over existing methods and does not require additional demonstrations,when dealing with imitation learning from third-person expert demonstrations.

Key words: Data efficient, Domain feature, Image difference, Imitation learning, Third-person, Variational discriminator bottleneck

中图分类号:

TP181

姜冲, 章宗长, 陈子璇, 朱佳成, 蒋俊鹏. 一种数据高效的第三人称模仿学习方法[J]. 计算机科学, 2021, 48(2): 238-244. https://doi.org/10.11896/jsjkx.191100107

JIANG Chong, ZHANG Zong-zhang, CHEN Zi-xuan, ZHU Jia-cheng, JIANG Jun-peng. Data Efficient Third-person Imitation Learning Method[J]. Computer Science, 2021, 48(2): 238-244. https://doi.org/10.11896/jsjkx.191100107

参考文献

[1] LIU Q,ZHAI J W,ZHANG Z Z,et al.A Survey of Deep Reinforcement Learning [J].Chinese Journal of Computers,2018,41(1):1-27.
[2] SILVER D,HUANG A,MADDISON C J,et al.Mastering the game of Go with deep neural networks and tree search [J].Nature,2016,529(7587):484-489.
[3] SILVER D,SCHRITTWIESER J,SIMONYAN K,et al.Mastering the game of Go without human knowledge [J].Nature,2017,550(7676):354-359.
[4] MNIH V,KAVUKCUOGLU K,SILVER D,et al.Playing Atari with deep reinforcement learning[C]//Proceedings of the Workshops at the 27th Neural Information Processing Systems (NIPS).2013:201-220.
[5] SUTTON R S,BARTO A G.Reinforcement learning:An introduction (2nd edition) [M].MIT Press,2018.
[6] SCHAAL S.Is imitation learning the route to humanoid robots? [J].Trendsin Cognitive Sciences,1999,3(6):233-242.
[7] OSA T,PAJARINEN J,NEUMANN G,et al.An algorithmicperspective on imitation learning [J].Foundationsand Trends in Robotics,2018,7(1／2):1-179.
[8] ABBEEL P,NG A Y.Apprenticeship learning via inverse reinforcement learning[C]//Proceedings of the 21st International Conference on Machine Learning (ICML).2004:1-8.
[9] NG A Y,RUSSELL S J.Algorithms for inverse reinforcement learning[C]//Proceedings of the 17th International Conference on Machine Learning (ICML).2000:663-670.
[10] HO J,ERMON S.Generative adversarial imitation learning[C]//Proceedings of the 30th Neural Information Processing Systems (NIPS).2016:4565-4573.
[11] STADIE B C,ABBEEL P,SUTSKEVER I.Third-person imitation learning[C]//Proceedings of the 5th International Confe-rence on Learning Representations (ICLR).2017.
[12] SHARMA P,PATHAK D,GUPTA A.Third-person visual imitation learning via decoupled hierarchical controller[C]//Proceedings of the 33rd Neural Information Processing Systems (NIPS).2019:2593-2603.
[13] JIANG C,ZHANG Z Z,CHEN Z X,et al.Third-person imitation learning via image difference and variational discriminator bottleneck (student abstract version)[C]//Proceedings of the 44th AAAI Conference on Artificial Intelligence (AAAI).2020.
[14] TODOROV E,EREZ T,TASSA Y.Mujoco:A physics engine for model-based control[C]//2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.2012:5026-5033.
[15] LIN J H,ZHANG Z Z,JIANG C,et al.A Survey of imitation learning based on Generative Adversarial Nets [J].Chinese Journal of Computers,2020,43(2):326-351.
[16] GOODFELLOW I J,POUGET-ABADIE J,MIRZA M,et al.Generative adversarial nets[C]//Proceedings of the 28th Neural Information Processing Systems (NIPS).2014:2672-2680.
[17] MEREL J,TASSA Y,TB D,et al.Learning human behaviors from motion capture by adversarial imitation[J].arXiv:1707.02201,2017.
[18] TORABI F,WARNELL G,STONE P.Generative adversarialimitation from observation[J].arXiv:1807.06158,2018.
[19] TZENG E,HOFFMAN J,ZHANG N,et al.Deep domain confusion:maximizing for domain invariance[J].arXiv:1412.3474,2014.
[20] GANIN Y,LEMPITSKY V.Unsupervised domain adaptationby backpropagation[J].arXiv:1409.7495,2014.
[21] PENG X B,KANAZAWA A,TOYER S,et al.Variational discriminator bottleneck:improving imitation learning,inverse RL,and GANs by constraining information flow[J].arXiv:1810.00821,2018.
[22] ALEMI A A,FISCHER I,DILLON J V,et al.Deep variationalinformation bottleneck[J].arXiv:1612.00410,2016.
[23] KINGMA D P,BA J L.Adam:a method for stochastic optimization[C]//Proceedings of the 4th International Conference on Learning Representations (ICLR).2015.
[24] SCHULMAN J,LEVINE S,MORITZ P,et al.Trust region po-licy optimization[C]//Proceedings of the 32nd International Conference on Machine Learning (ICML).2015:1889-1897.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

一种数据高效的第三人称模仿学习方法

Data Efficient Third-person Imitation Learning Method

PDF (PC)

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 3

Metrics

本文评价

推荐阅读 0

[1]	范家宽, 王皓月, 赵生宇, 周添一, 王伟. 数据驱动的开源贡献度量化评估与持续优化方法 Data-driven Methods for Quantitative Assessment and Enhancement of Open Source Contributions 计算机科学, 2021, 48(5): 45-50. https://doi.org/10.11896/jsjkx.201000107
[2]	杨单. 基于图像差分特征的彩色图像差分预测与信息提取算法研究 Color Image Difference Prediction Based on Image Difference Measure 计算机科学, 2015, 42(1): 308-311. https://doi.org/10.11896/j.issn.1002-137X.2015.01.068
[3]	. 基于领域特征的AOP编织实现方法计算机科学, 2009, 36(2): 299-302.