Computer Science ›› 2021, Vol. 48 ›› Issue (2): 238-244.doi: 10.11896/jsjkx.191100107

• Artificial Intelligence • Previous Articles     Next Articles

Data Efficient Third-person Imitation Learning Method

JIANG Chong1, ZHANG Zong-zhang2, CHEN Zi-xuan1, ZHU Jia-cheng1, JIANG Jun-peng1   

  1. 1 School of Computer Science and Technology,Soochow University,Suzhou,Jiangsu 215006,China
    2 National Key Laboratory for Novel Software Technology,Nanjing University,Nanjing 210023,China
  • Received:2019-11-14 Revised:2020-04-16 Online:2021-02-15 Published:2021-02-04
  • About author:JIANG Chong,born in 1995,postgra-duate,is a member of China Computer Federation.His main research interests include imitation learning and reinforcement learning.
    ZHANG Zong-zhang,born in 1985,Ph.D,associate professor,is a member of China Computer Federation.His main research interests include reinforcement learning,intelligent planning and multi-agent systems.
  • Supported by:
    The National Natural Science Foundation of China (61876119),Natural Science Foundation of Jiangsu(BK20181432) and Fundamental Research Funds for the Central Universities(14380005).

Abstract: Imitation learning provides a framework to make agent learn an efficient policy from expert demonstrations.During the learning process,the agent does not need to interact with the expert or get access to an explicit reward signal,but only needs a large number of expert demonstrations.Classical imitation learning methods usually need to imitate from first-person expert demonstrations,a sequence of states and actions that expert should have taken.However,most expert demonstrations exist in the form of third-person videos in reality.Different from the first-person expert demonstrations,there is a difference between the viewpoint of the third-person demonstrations and samples generated by the agent,resulting in a lack of one-to-one correspondence between them.Therefore,the third-person demonstrations cannot be directly used in imitation learning.To alleviate this problem,this paper presents a data efficient third-person imitation learning method.Firstly,this method introduces the image difference based on Generative Adversarial Imitation Learning(GAIL) to eliminate the domain features including the background of environment and colors by taking advantage of the Markov property of Markov decision process and the time continuity of states.And the most relevant part of policy can be achieved for imitation learning.Secondly,this paper introduces a variational discriminator bottleneck to limit the discriminator to alleviate the influence of domain features on the process of learning policy.In order to verify the performance of the proposed algorithm,this paper makes experiments on three MuJoCo tasks,and compares it with the existing algorithms.Experimental results indicate that the proposed method can achieve significant performance improvements over existing methods and does not require additional demonstrations,when dealing with imitation learning from third-person expert demonstrations.

Key words: Data efficient, Domain feature, Image difference, Imitation learning, Third-person, Variational discriminator bottleneck

CLC Number: 

  • TP181
[1] LIU Q,ZHAI J W,ZHANG Z Z,et al.A Survey of Deep Reinforcement Learning [J].Chinese Journal of Computers,2018,41(1):1-27.
[2] SILVER D,HUANG A,MADDISON C J,et al.Mastering the game of Go with deep neural networks and tree search [J].Nature,2016,529(7587):484-489.
[3] SILVER D,SCHRITTWIESER J,SIMONYAN K,et al.Mastering the game of Go without human knowledge [J].Nature,2017,550(7676):354-359.
[4] MNIH V,KAVUKCUOGLU K,SILVER D,et al.Playing Atari with deep reinforcement learning[C]//Proceedings of the Workshops at the 27th Neural Information Processing Systems (NIPS).2013:201-220.
[5] SUTTON R S,BARTO A G.Reinforcement learning:An introduction (2nd edition) [M].MIT Press,2018.
[6] SCHAAL S.Is imitation learning the route to humanoid robots? [J].Trendsin Cognitive Sciences,1999,3(6):233-242.
[7] OSA T,PAJARINEN J,NEUMANN G,et al.An algorithmicperspective on imitation learning [J].Foundationsand Trends in Robotics,2018,7(1/2):1-179.
[8] ABBEEL P,NG A Y.Apprenticeship learning via inverse reinforcement learning[C]//Proceedings of the 21st International Conference on Machine Learning (ICML).2004:1-8.
[9] NG A Y,RUSSELL S J.Algorithms for inverse reinforcement learning[C]//Proceedings of the 17th International Conference on Machine Learning (ICML).2000:663-670.
[10] HO J,ERMON S.Generative adversarial imitation learning[C]//Proceedings of the 30th Neural Information Processing Systems (NIPS).2016:4565-4573.
[11] STADIE B C,ABBEEL P,SUTSKEVER I.Third-person imitation learning[C]//Proceedings of the 5th International Confe-rence on Learning Representations (ICLR).2017.
[12] SHARMA P,PATHAK D,GUPTA A.Third-person visual imitation learning via decoupled hierarchical controller[C]//Proceedings of the 33rd Neural Information Processing Systems (NIPS).2019:2593-2603.
[13] JIANG C,ZHANG Z Z,CHEN Z X,et al.Third-person imitation learning via image difference and variational discriminator bottleneck (student abstract version)[C]//Proceedings of the 44th AAAI Conference on Artificial Intelligence (AAAI).2020.
[14] TODOROV E,EREZ T,TASSA Y.Mujoco:A physics engine for model-based control[C]//2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.2012:5026-5033.
[15] LIN J H,ZHANG Z Z,JIANG C,et al.A Survey of imitation learning based on Generative Adversarial Nets [J].Chinese Journal of Computers,2020,43(2):326-351.
[16] GOODFELLOW I J,POUGET-ABADIE J,MIRZA M,et al.Generative adversarial nets[C]//Proceedings of the 28th Neural Information Processing Systems (NIPS).2014:2672-2680.
[17] MEREL J,TASSA Y,TB D,et al.Learning human behaviors from motion capture by adversarial imitation[J].arXiv:1707.02201,2017.
[18] TORABI F,WARNELL G,STONE P.Generative adversarialimitation from observation[J].arXiv:1807.06158,2018.
[19] TZENG E,HOFFMAN J,ZHANG N,et al.Deep domain confusion:maximizing for domain invariance[J].arXiv:1412.3474,2014.
[20] GANIN Y,LEMPITSKY V.Unsupervised domain adaptationby backpropagation[J].arXiv:1409.7495,2014.
[21] PENG X B,KANAZAWA A,TOYER S,et al.Variational discriminator bottleneck:improving imitation learning,inverse RL,and GANs by constraining information flow[J].arXiv:1810.00821,2018.
[22] ALEMI A A,FISCHER I,DILLON J V,et al.Deep variationalinformation bottleneck[J].arXiv:1612.00410,2016.
[23] KINGMA D P,BA J L.Adam:a method for stochastic optimization[C]//Proceedings of the 4th International Conference on Learning Representations (ICLR).2015.
[24] SCHULMAN J,LEVINE S,MORITZ P,et al.Trust region po-licy optimization[C]//Proceedings of the 32nd International Conference on Machine Learning (ICML).2015:1889-1897.
[1] CHEN Yan, CHEN Jia-qing, CHEN Xing. Machine Learning Process Composition Based on Hierarchical Label [J]. Computer Science, 2021, 48(6A): 306-312.
[2] FAN Jia-kuan, WANG Hao-yue, ZHAO Sheng-yu, ZHOU Tian-yi, WANG Wei. Data-driven Methods for Quantitative Assessment and Enhancement of Open Source Contributions [J]. Computer Science, 2021, 48(5): 45-50.
[3] ZHENG Jing-hua, GUO Shi-ze, GAO Liang and ZHONG Xiao-feng. Survey on Cognitive Domain Feature Prediction of Social Network Users [J]. Computer Science, 2018, 45(3): 16-22.
[4] YANG Dan. Color Image Difference Prediction Based on Image Difference Measure [J]. Computer Science, 2015, 42(1): 308-311.
[5] WANG Lei,ZENG Xian-ting and SU Jin-yang. Steganalysis Based on Multi-domain Features for JPEG Images [J]. Computer Science, 2014, 41(6): 94-98.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!