计算机科学 ›› 2026, Vol. 53 ›› Issue (3): 129-135.doi: 10.11896/jsjkx.250600131

• 数据库 & 大数据 & 数据科学 • 上一篇    下一篇

基于二次学习的行为克隆优化方法

范文殊, 万盛华, 李新春, 孙海航, 黄楷宸, 甘乐, 詹德川   

  1. 南京大学人工智能学院 南京 210023
    南京大学计算机软件新技术全国重点实验室 南京 210023
  • 收稿日期:2025-06-19 修回日期:2025-12-09 发布日期:2026-03-12
  • 通讯作者: 詹德川(dczhan@lamda.nju.edu.cn)
  • 作者简介:(fanws@lamda.nju.edu.cn)
  • 基金资助:
    新一代人工智能国家科技重大专项(2022ZD0114805)

Twice Learning Revitalizes Behavior Cloning

FAN Wenshu, WAN Shenghua, LI Xinchun, SUN Haihang, HUANG Kaichen, GAN Le, ZHAN Dechuan   

  1. School of Artificial Intelligence, Nanjing University, Nanjing 210023, China
    National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China
  • Received:2025-06-19 Revised:2025-12-09 Online:2026-03-12
  • About author:FAN Wenshu,born in 1997,Ph.D,is a member of CCF(No.G4364G).His main research interests include machine learning and data mining.
    ZHAN Dechuan,born in 1982,Ph.D,professor.His main research interests include machine learning and data mi-ning.
  • Supported by:
    National Science and Technology Major Project(2022ZD0114805).

摘要: 在行为克隆这一模仿学习方法中,智能体遇到专家数据中不包含的状态会随机采取行动,导致与专家策略产生偏移。这种现象被称为复合误差,是影响行为克隆性能的重要因素。为解决这一问题,首先说明行为克隆本质是二次学习的简易形式,接着指出在离散动作环境下,行为克隆只关注对齐专家策略采取的动作,忽视其他动作概率信息,对专家信息的提取不够完全。通过类比二次学习,提出能够提取专家数据中更多信息的完全行为克隆方法,并设计了多组对比实验,说明完全行为克隆不仅能缓解行为克隆的复合误差,还具备设备可迁移性高、抗噪能力强、专家数据依赖少等优点。实验结果表明,行为克隆仅需少量改进即可极具实用性,并且保持运行简便。此外,结果也进一步验证了二次学习在强化学习问题中的指导作用和有效性。

关键词: 模仿学习, 行为克隆, 复合误差, 二次学习, 信息提取

Abstract: In the imitation learning method of behavior cloning(BC),an agent tends to take random actions when encountering states that are not covered by expert data.This deviation from the expert policy leads to what is known as compounding error,a critical factor affecting the performance of BC.To address this issue,this paper first establishes that BC can be regarded as a simplified form of twice learning.Furthermore,in discrete action environments,BC primarily focuses on aligning with the expert-selected actions while ignoring probability information associated with other actions,resulting in incomplete extraction of expert knowledge.Inspired by twice learning,this paper proposes an enhanced version of BC,termed complete behavior cloning(CBC),which aims to leverage a more comprehensive set of information from expert data.To validate the effectiveness of this approach,this paper designs multiple comparative experiments.The results demonstrate that CBC not only mitigates compounding error but also exhibits high transferability across different devices,enhanced robustness to noise,and reduced dependency on expert data.These findings suggest that BC can become highly practical and computationally efficient with only minor modifications.More-over,the experimental results further reinforce the guiding role and effectiveness of twice learning in reinforcement learning problems.

Key words: Imitation learning, Behavior cloning, Compounding error, Twice learning, Information extraction

中图分类号: 

  • TP391
[1]ZHOU Z H,JIANG Y,CHEN S F.Extracting Symbolic Rules from Trained Neural Network Ensembles[J].AI Communications,2003,16(1):3-15.
[2]POMERLEAU D A.ALVINN:An autonomous land vehicle in a neural network[C]//NIPS.1989.
[3]ZARE M,KEBRIA P M,KHOSRAVI A,et al.A survey of imitation learning:Algorithms,recent developments,and challenges[J].IEEE Transactions on Cybernetics,2024,54(12):7173-7186.
[4]ROSS S,GORDON G,BAGNELL D.A reduction of imitation learning and structured prediction to no-regret online learning[C]//Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics.2011:627-635.
[5]MANDLEKAR A,XU D,MARTÍN-MARTÍN R,et al.Human-in-the-loop imitation learning using remote teleoperation[J].arXiv:2012.06733,2020.
[6]REDDY S,DRAGAN A D,LEVINE S.Sqil:Imitation learning via reinforcement learning with sparse rewards[J].arXiv:1905.11108,2019.
[7]WANG R,CILIBERTO C,AMADORI P V,et al.Random ex-pert distillation:Imitation learning via expert policy support estimation[C]//International Conference on Machine Learning.PMLR,2019:6536-6544.
[8]BRANTLEY K,SUN W,HENAFF M.Disagreement-regula-rized imitation learning[C]//International Conference on Lear-ning Representations.2019.
[9]BRANTLEY K.Expert-in-the-loop for sequential decisions and predictions[D].Maryland:University of Maryland,2021.
[10]CHANG J,UEHARA M,SREENIVAS D,et al.Mitigating covariate shift in imitation learning via offline data with partial coverage[J].Advances in Neural Information Processing Systems,2021,34:965-979.
[11]SYED U,BOWLING M,SCHAPIRE R E.Apprenticeshiplearning using linear programming[C]//Proceedings of the 25th International Conference on Machine Learning.2008:1032-1039.
[12]FU J,LUO K,LEVINE S.Learning robust rewards with adversarial inverse reinforcement learning[J].arXiv:1710.11248,2017.
[13]HO J,ERMON S.Generative adversarial imitation learning[J].Advances in Neural Information Processing Systems,2016,29:4565-4573.
[14]HO J,JAIN A,ABBEEL P.Denoising diffusion probabilisticmodels[J].Advances in Neural Information Processing Systems,2020,33:6840-6851.
[15]VASWANI A,SHAZEER N,PARMAR N,et al.Attention is all you need[C]//Proceedings of the 31st International Confe-rence on neural Information Processing Systems.2017:6000-6010.
[16]CHI C,XU Z,FENG S,et al.Diffusion policy:Visuomotor policy learning via action diffusion[J].arXiv:2303.04137.2023.
[17]SHAFIULLAH N M,CUI Z,ALTANZAYA A A,et al.Beha-vior transformers:Cloning k modes with one stone[J].Advances in Neural Information Processing Systems,2022,35:22955-22968.
[18]CUI J,LIU T,MENG Z,et al.Grove:A generalized reward for learning open-vocabulary physical skill[C]//Proceedings of the Computer Vision and Pattern Recognition Conference.2025:15781-15790.
[19]ZHAO T Z,TOMPSON J,DRIESS D,et al.Aloha unleashed:A simple recipe for robot dexterity[J].arXiv:2410.13126,2024.
[20]ZHOU Z H,JIANG Y.NeC4.5:Neural ensemble based C4.5[J].IEEE Transactions on Knowledge and Data Engineering,2004,16(6):770-773.
[21]HINTION G,VINYALS O,DEAN J.Distilling the Knowledge in a Neural Network[J].arXiv:1503.02531,2015.
[22]ROMERO A,BALLAS N,KAHOU S E,et al.Fitnets:Hints for thin deep nets[J].arXiv:1412.6550,2014.
[23]TIAN Y,KRISHNAN D,ISOLA P.Contrastive representation distillation[J].arXiv:1910.10699,2019.
[24]ZHOU W,WANG Y,QIAN X.Knowledge distillation and con-trastive learning for detecting visible-infrared transmission lines using separated stagger registration network[J].IEEE Transactions on Circuits and Systems I:Regular Papers,2025,72(8):4140-4152.
[25]YE H J,LU S,ZHAN D C.Distilling cross-task knowledge via relationship matching[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:12396-12405.
[26]SCHULMAN J,WOLSKI F,DHARIWAL P,et al.Proximalpolicy optimization algorithms[J].arXiv:1707.06347.2017.
[27]ZHANG W N.Hands-on Reinforcement Learning[M].Beijing:People’s Posts and Telecommunications Press,2022.
[28]YUAN L,TAY F E H,LI G,et al.Revisiting knowledge distillation via label smoothing regularization[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:3903-3911.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!