计算机科学 ›› 2021, Vol. 48 ›› Issue (2): 238-244.doi: 10.11896/jsjkx.191100107
姜冲1, 章宗长2, 陈子璇1, 朱佳成1, 蒋俊鹏1
JIANG Chong1, ZHANG Zong-zhang2, CHEN Zi-xuan1, ZHU Jia-cheng1, JIANG Jun-peng1
摘要: 模仿学习提供了一种能够使智能体从专家示范中学习如何决策的框架。在学习过程中,智能体无需与专家进行交互,也不依赖于环境的奖励信号,而只需要大量的专家示范。经典的模仿学习方法需要使用第一人称的专家示范,该示范由一个状态序列以及对应的专家动作序列组成。但是,在现实生活中,专家示范通常以第三人称视频的形式存在。相比第一人称专家示范,第三人称示范的观察视角与智能体的存在差异,导致两者之间缺乏一一对应关系,因此第三人称示范无法被直接用于模仿学习中。针对此问题,文中提出了一种数据高效的第三人称模仿学习方法。首先,该方法在生成对抗模仿学习的基础上引入了图像差分方法,利用马尔可夫决策过程的马尔可夫性质以及其状态的时间连续性,去除环境背景、颜色等领域特征,以得到观察图像中与行为策略最相关的部分,并将其用于模仿学习;其次,该方法引入了一个变分判别器瓶颈,以对判别器进行限制,进一步削弱了领域特征对策略学习的影响。为了验证所提算法的性能,通过MuJoCo平台中的3个实验环境对其进行了测试,并与已有算法进行了比较。实验结果表明,与已有的模仿学习方法相比,该方法在第三人称模仿学习任务中具有更好的性能表现,并且不需要额外增加对样本的需求。
中图分类号:
[1] LIU Q,ZHAI J W,ZHANG Z Z,et al.A Survey of Deep Reinforcement Learning [J].Chinese Journal of Computers,2018,41(1):1-27. [2] SILVER D,HUANG A,MADDISON C J,et al.Mastering the game of Go with deep neural networks and tree search [J].Nature,2016,529(7587):484-489. [3] SILVER D,SCHRITTWIESER J,SIMONYAN K,et al.Mastering the game of Go without human knowledge [J].Nature,2017,550(7676):354-359. [4] MNIH V,KAVUKCUOGLU K,SILVER D,et al.Playing Atari with deep reinforcement learning[C]//Proceedings of the Workshops at the 27th Neural Information Processing Systems (NIPS).2013:201-220. [5] SUTTON R S,BARTO A G.Reinforcement learning:An introduction (2nd edition) [M].MIT Press,2018. [6] SCHAAL S.Is imitation learning the route to humanoid robots? [J].Trendsin Cognitive Sciences,1999,3(6):233-242. [7] OSA T,PAJARINEN J,NEUMANN G,et al.An algorithmicperspective on imitation learning [J].Foundationsand Trends in Robotics,2018,7(1/2):1-179. [8] ABBEEL P,NG A Y.Apprenticeship learning via inverse reinforcement learning[C]//Proceedings of the 21st International Conference on Machine Learning (ICML).2004:1-8. [9] NG A Y,RUSSELL S J.Algorithms for inverse reinforcement learning[C]//Proceedings of the 17th International Conference on Machine Learning (ICML).2000:663-670. [10] HO J,ERMON S.Generative adversarial imitation learning[C]//Proceedings of the 30th Neural Information Processing Systems (NIPS).2016:4565-4573. [11] STADIE B C,ABBEEL P,SUTSKEVER I.Third-person imitation learning[C]//Proceedings of the 5th International Confe-rence on Learning Representations (ICLR).2017. [12] SHARMA P,PATHAK D,GUPTA A.Third-person visual imitation learning via decoupled hierarchical controller[C]//Proceedings of the 33rd Neural Information Processing Systems (NIPS).2019:2593-2603. [13] JIANG C,ZHANG Z Z,CHEN Z X,et al.Third-person imitation learning via image difference and variational discriminator bottleneck (student abstract version)[C]//Proceedings of the 44th AAAI Conference on Artificial Intelligence (AAAI).2020. [14] TODOROV E,EREZ T,TASSA Y.Mujoco:A physics engine for model-based control[C]//2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.2012:5026-5033. [15] LIN J H,ZHANG Z Z,JIANG C,et al.A Survey of imitation learning based on Generative Adversarial Nets [J].Chinese Journal of Computers,2020,43(2):326-351. [16] GOODFELLOW I J,POUGET-ABADIE J,MIRZA M,et al.Generative adversarial nets[C]//Proceedings of the 28th Neural Information Processing Systems (NIPS).2014:2672-2680. [17] MEREL J,TASSA Y,TB D,et al.Learning human behaviors from motion capture by adversarial imitation[J].arXiv:1707.02201,2017. [18] TORABI F,WARNELL G,STONE P.Generative adversarialimitation from observation[J].arXiv:1807.06158,2018. [19] TZENG E,HOFFMAN J,ZHANG N,et al.Deep domain confusion:maximizing for domain invariance[J].arXiv:1412.3474,2014. [20] GANIN Y,LEMPITSKY V.Unsupervised domain adaptationby backpropagation[J].arXiv:1409.7495,2014. [21] PENG X B,KANAZAWA A,TOYER S,et al.Variational discriminator bottleneck:improving imitation learning,inverse RL,and GANs by constraining information flow[J].arXiv:1810.00821,2018. [22] ALEMI A A,FISCHER I,DILLON J V,et al.Deep variationalinformation bottleneck[J].arXiv:1612.00410,2016. [23] KINGMA D P,BA J L.Adam:a method for stochastic optimization[C]//Proceedings of the 4th International Conference on Learning Representations (ICLR).2015. [24] SCHULMAN J,LEVINE S,MORITZ P,et al.Trust region po-licy optimization[C]//Proceedings of the 32nd International Conference on Machine Learning (ICML).2015:1889-1897. |
[1] | 范家宽, 王皓月, 赵生宇, 周添一, 王伟. 数据驱动的开源贡献度量化评估与持续优化方法 Data-driven Methods for Quantitative Assessment and Enhancement of Open Source Contributions 计算机科学, 2021, 48(5): 45-50. https://doi.org/10.11896/jsjkx.201000107 |
[2] | 杨单. 基于图像差分特征的彩色图像差分预测与信息提取算法研究 Color Image Difference Prediction Based on Image Difference Measure 计算机科学, 2015, 42(1): 308-311. https://doi.org/10.11896/j.issn.1002-137X.2015.01.068 |
[3] | . 基于领域特征的AOP编织实现方法 计算机科学, 2009, 36(2): 299-302. |
|