基于二次学习的行为克隆优化方法

doi:10.11896/jsjkx.250600131

计算机科学 ›› 2026, Vol. 53 ›› Issue (3): 129-135.doi: 10.11896/jsjkx.250600131

• 数据库 & 大数据 & 数据科学 • 上一篇下一篇

基于二次学习的行为克隆优化方法

范文殊, 万盛华, 李新春, 孙海航, 黄楷宸, 甘乐, 詹德川

南京大学人工智能学院南京 210023
南京大学计算机软件新技术全国重点实验室南京 210023

收稿日期:2025-06-19 修回日期:2025-12-09 发布日期:2026-03-12
通讯作者: 詹德川(dczhan@lamda.nju.edu.cn)
作者简介:(fanws@lamda.nju.edu.cn)
基金资助:
新一代人工智能国家科技重大专项(2022ZD0114805)

Twice Learning Revitalizes Behavior Cloning

FAN Wenshu, WAN Shenghua, LI Xinchun, SUN Haihang, HUANG Kaichen, GAN Le, ZHAN Dechuan

School of Artificial Intelligence, Nanjing University, Nanjing 210023, China
National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China

Received:2025-06-19 Revised:2025-12-09 Online:2026-03-12
About author:FAN Wenshu,born in 1997,Ph.D,is a member of CCF(No.G4364G).His main research interests include machine learning and data mining.
ZHAN Dechuan,born in 1982,Ph.D,professor.His main research interests include machine learning and data mi-ning.
Supported by:
National Science and Technology Major Project(2022ZD0114805).

摘要/Abstract

摘要： 在行为克隆这一模仿学习方法中,智能体遇到专家数据中不包含的状态会随机采取行动,导致与专家策略产生偏移。这种现象被称为复合误差,是影响行为克隆性能的重要因素。为解决这一问题,首先说明行为克隆本质是二次学习的简易形式,接着指出在离散动作环境下,行为克隆只关注对齐专家策略采取的动作,忽视其他动作概率信息,对专家信息的提取不够完全。通过类比二次学习,提出能够提取专家数据中更多信息的完全行为克隆方法,并设计了多组对比实验,说明完全行为克隆不仅能缓解行为克隆的复合误差,还具备设备可迁移性高、抗噪能力强、专家数据依赖少等优点。实验结果表明,行为克隆仅需少量改进即可极具实用性,并且保持运行简便。此外,结果也进一步验证了二次学习在强化学习问题中的指导作用和有效性。

关键词: 模仿学习, 行为克隆, 复合误差, 二次学习, 信息提取

Abstract: In the imitation learning method of behavior cloning(BC),an agent tends to take random actions when encountering states that are not covered by expert data.This deviation from the expert policy leads to what is known as compounding error,a critical factor affecting the performance of BC.To address this issue,this paper first establishes that BC can be regarded as a simplified form of twice learning.Furthermore,in discrete action environments,BC primarily focuses on aligning with the expert-selected actions while ignoring probability information associated with other actions,resulting in incomplete extraction of expert knowledge.Inspired by twice learning,this paper proposes an enhanced version of BC,termed complete behavior cloning(CBC),which aims to leverage a more comprehensive set of information from expert data.To validate the effectiveness of this approach,this paper designs multiple comparative experiments.The results demonstrate that CBC not only mitigates compounding error but also exhibits high transferability across different devices,enhanced robustness to noise,and reduced dependency on expert data.These findings suggest that BC can become highly practical and computationally efficient with only minor modifications.More-over,the experimental results further reinforce the guiding role and effectiveness of twice learning in reinforcement learning problems.

Key words: Imitation learning, Behavior cloning, Compounding error, Twice learning, Information extraction

中图分类号:

TP391

范文殊, 万盛华, 李新春, 孙海航, 黄楷宸, 甘乐, 詹德川. 基于二次学习的行为克隆优化方法[J]. 计算机科学, 2026, 53(3): 129-135. https://doi.org/10.11896/jsjkx.250600131

FAN Wenshu, WAN Shenghua, LI Xinchun, SUN Haihang, HUANG Kaichen, GAN Le, ZHAN Dechuan. Twice Learning Revitalizes Behavior Cloning[J]. Computer Science, 2026, 53(3): 129-135. https://doi.org/10.11896/jsjkx.250600131

参考文献

[1]ZHOU Z H,JIANG Y,CHEN S F.Extracting Symbolic Rules from Trained Neural Network Ensembles[J].AI Communications,2003,16(1):3-15.
[2]POMERLEAU D A.ALVINN:An autonomous land vehicle in a neural network[C]//NIPS.1989.
[3]ZARE M,KEBRIA P M,KHOSRAVI A,et al.A survey of imitation learning:Algorithms,recent developments,and challenges[J].IEEE Transactions on Cybernetics,2024,54(12):7173-7186.
[4]ROSS S,GORDON G,BAGNELL D.A reduction of imitation learning and structured prediction to no-regret online learning[C]//Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics.2011:627-635.
[5]MANDLEKAR A,XU D,MARTÍN-MARTÍN R,et al.Human-in-the-loop imitation learning using remote teleoperation[J].arXiv:2012.06733,2020.
[6]REDDY S,DRAGAN A D,LEVINE S.Sqil:Imitation learning via reinforcement learning with sparse rewards[J].arXiv:1905.11108,2019.
[7]WANG R,CILIBERTO C,AMADORI P V,et al.Random ex-pert distillation:Imitation learning via expert policy support estimation[C]//International Conference on Machine Learning.PMLR,2019:6536-6544.
[8]BRANTLEY K,SUN W,HENAFF M.Disagreement-regula-rized imitation learning[C]//International Conference on Lear-ning Representations.2019.
[9]BRANTLEY K.Expert-in-the-loop for sequential decisions and predictions[D].Maryland:University of Maryland,2021.
[10]CHANG J,UEHARA M,SREENIVAS D,et al.Mitigating covariate shift in imitation learning via offline data with partial coverage[J].Advances in Neural Information Processing Systems,2021,34:965-979.
[11]SYED U,BOWLING M,SCHAPIRE R E.Apprenticeshiplearning using linear programming[C]//Proceedings of the 25th International Conference on Machine Learning.2008:1032-1039.
[12]FU J,LUO K,LEVINE S.Learning robust rewards with adversarial inverse reinforcement learning[J].arXiv:1710.11248,2017.
[13]HO J,ERMON S.Generative adversarial imitation learning[J].Advances in Neural Information Processing Systems,2016,29:4565-4573.
[14]HO J,JAIN A,ABBEEL P.Denoising diffusion probabilisticmodels[J].Advances in Neural Information Processing Systems,2020,33:6840-6851.
[15]VASWANI A,SHAZEER N,PARMAR N,et al.Attention is all you need[C]//Proceedings of the 31st International Confe-rence on neural Information Processing Systems.2017:6000-6010.
[16]CHI C,XU Z,FENG S,et al.Diffusion policy:Visuomotor policy learning via action diffusion[J].arXiv:2303.04137.2023.
[17]SHAFIULLAH N M,CUI Z,ALTANZAYA A A,et al.Beha-vior transformers:Cloning k modes with one stone[J].Advances in Neural Information Processing Systems,2022,35:22955-22968.
[18]CUI J,LIU T,MENG Z,et al.Grove:A generalized reward for learning open-vocabulary physical skill[C]//Proceedings of the Computer Vision and Pattern Recognition Conference.2025:15781-15790.
[19]ZHAO T Z,TOMPSON J,DRIESS D,et al.Aloha unleashed:A simple recipe for robot dexterity[J].arXiv:2410.13126,2024.
[20]ZHOU Z H,JIANG Y.NeC4.5:Neural ensemble based C4.5[J].IEEE Transactions on Knowledge and Data Engineering,2004,16(6):770-773.
[21]HINTION G,VINYALS O,DEAN J.Distilling the Knowledge in a Neural Network[J].arXiv:1503.02531,2015.
[22]ROMERO A,BALLAS N,KAHOU S E,et al.Fitnets:Hints for thin deep nets[J].arXiv:1412.6550,2014.
[23]TIAN Y,KRISHNAN D,ISOLA P.Contrastive representation distillation[J].arXiv:1910.10699,2019.
[24]ZHOU W,WANG Y,QIAN X.Knowledge distillation and con-trastive learning for detecting visible-infrared transmission lines using separated stagger registration network[J].IEEE Transactions on Circuits and Systems I:Regular Papers,2025,72(8):4140-4152.
[25]YE H J,LU S,ZHAN D C.Distilling cross-task knowledge via relationship matching[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:12396-12405.
[26]SCHULMAN J,WOLSKI F,DHARIWAL P,et al.Proximalpolicy optimization algorithms[J].arXiv:1707.06347.2017.
[27]ZHANG W N.Hands-on Reinforcement Learning[M].Beijing:People’s Posts and Telecommunications Press,2022.
[28]YUAN L,TAY F E H,LI G,et al.Revisiting knowledge distillation via label smoothing regularization[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:3903-3911.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

基于二次学习的行为克隆优化方法

Twice Learning Revitalizes Behavior Cloning

PDF (PC)

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 0

Metrics

本文评价

推荐阅读 0