计算机科学 ›› 2026, Vol. 53 ›› Issue (3): 129-135.doi: 10.11896/jsjkx.250600131
范文殊, 万盛华, 李新春, 孙海航, 黄楷宸, 甘乐, 詹德川
FAN Wenshu, WAN Shenghua, LI Xinchun, SUN Haihang, HUANG Kaichen, GAN Le, ZHAN Dechuan
摘要: 在行为克隆这一模仿学习方法中,智能体遇到专家数据中不包含的状态会随机采取行动,导致与专家策略产生偏移。这种现象被称为复合误差,是影响行为克隆性能的重要因素。为解决这一问题,首先说明行为克隆本质是二次学习的简易形式,接着指出在离散动作环境下,行为克隆只关注对齐专家策略采取的动作,忽视其他动作概率信息,对专家信息的提取不够完全。通过类比二次学习,提出能够提取专家数据中更多信息的完全行为克隆方法,并设计了多组对比实验,说明完全行为克隆不仅能缓解行为克隆的复合误差,还具备设备可迁移性高、抗噪能力强、专家数据依赖少等优点。实验结果表明,行为克隆仅需少量改进即可极具实用性,并且保持运行简便。此外,结果也进一步验证了二次学习在强化学习问题中的指导作用和有效性。
中图分类号:
| [1]ZHOU Z H,JIANG Y,CHEN S F.Extracting Symbolic Rules from Trained Neural Network Ensembles[J].AI Communications,2003,16(1):3-15. [2]POMERLEAU D A.ALVINN:An autonomous land vehicle in a neural network[C]//NIPS.1989. [3]ZARE M,KEBRIA P M,KHOSRAVI A,et al.A survey of imitation learning:Algorithms,recent developments,and challenges[J].IEEE Transactions on Cybernetics,2024,54(12):7173-7186. [4]ROSS S,GORDON G,BAGNELL D.A reduction of imitation learning and structured prediction to no-regret online learning[C]//Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics.2011:627-635. [5]MANDLEKAR A,XU D,MARTÍN-MARTÍN R,et al.Human-in-the-loop imitation learning using remote teleoperation[J].arXiv:2012.06733,2020. [6]REDDY S,DRAGAN A D,LEVINE S.Sqil:Imitation learning via reinforcement learning with sparse rewards[J].arXiv:1905.11108,2019. [7]WANG R,CILIBERTO C,AMADORI P V,et al.Random ex-pert distillation:Imitation learning via expert policy support estimation[C]//International Conference on Machine Learning.PMLR,2019:6536-6544. [8]BRANTLEY K,SUN W,HENAFF M.Disagreement-regula-rized imitation learning[C]//International Conference on Lear-ning Representations.2019. [9]BRANTLEY K.Expert-in-the-loop for sequential decisions and predictions[D].Maryland:University of Maryland,2021. [10]CHANG J,UEHARA M,SREENIVAS D,et al.Mitigating covariate shift in imitation learning via offline data with partial coverage[J].Advances in Neural Information Processing Systems,2021,34:965-979. [11]SYED U,BOWLING M,SCHAPIRE R E.Apprenticeshiplearning using linear programming[C]//Proceedings of the 25th International Conference on Machine Learning.2008:1032-1039. [12]FU J,LUO K,LEVINE S.Learning robust rewards with adversarial inverse reinforcement learning[J].arXiv:1710.11248,2017. [13]HO J,ERMON S.Generative adversarial imitation learning[J].Advances in Neural Information Processing Systems,2016,29:4565-4573. [14]HO J,JAIN A,ABBEEL P.Denoising diffusion probabilisticmodels[J].Advances in Neural Information Processing Systems,2020,33:6840-6851. [15]VASWANI A,SHAZEER N,PARMAR N,et al.Attention is all you need[C]//Proceedings of the 31st International Confe-rence on neural Information Processing Systems.2017:6000-6010. [16]CHI C,XU Z,FENG S,et al.Diffusion policy:Visuomotor policy learning via action diffusion[J].arXiv:2303.04137.2023. [17]SHAFIULLAH N M,CUI Z,ALTANZAYA A A,et al.Beha-vior transformers:Cloning k modes with one stone[J].Advances in Neural Information Processing Systems,2022,35:22955-22968. [18]CUI J,LIU T,MENG Z,et al.Grove:A generalized reward for learning open-vocabulary physical skill[C]//Proceedings of the Computer Vision and Pattern Recognition Conference.2025:15781-15790. [19]ZHAO T Z,TOMPSON J,DRIESS D,et al.Aloha unleashed:A simple recipe for robot dexterity[J].arXiv:2410.13126,2024. [20]ZHOU Z H,JIANG Y.NeC4.5:Neural ensemble based C4.5[J].IEEE Transactions on Knowledge and Data Engineering,2004,16(6):770-773. [21]HINTION G,VINYALS O,DEAN J.Distilling the Knowledge in a Neural Network[J].arXiv:1503.02531,2015. [22]ROMERO A,BALLAS N,KAHOU S E,et al.Fitnets:Hints for thin deep nets[J].arXiv:1412.6550,2014. [23]TIAN Y,KRISHNAN D,ISOLA P.Contrastive representation distillation[J].arXiv:1910.10699,2019. [24]ZHOU W,WANG Y,QIAN X.Knowledge distillation and con-trastive learning for detecting visible-infrared transmission lines using separated stagger registration network[J].IEEE Transactions on Circuits and Systems I:Regular Papers,2025,72(8):4140-4152. [25]YE H J,LU S,ZHAN D C.Distilling cross-task knowledge via relationship matching[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:12396-12405. [26]SCHULMAN J,WOLSKI F,DHARIWAL P,et al.Proximalpolicy optimization algorithms[J].arXiv:1707.06347.2017. [27]ZHANG W N.Hands-on Reinforcement Learning[M].Beijing:People’s Posts and Telecommunications Press,2022. [28]YUAN L,TAY F E H,LI G,et al.Revisiting knowledge distillation via label smoothing regularization[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:3903-3911. |
|
||