计算机科学 ›› 2025, Vol. 52 ›› Issue (1): 277-288.doi: 10.11896/jsjkx.240100221
王麒迪, 沈立炜, 吴天一
WANG Qidi, SHEN Liwei, WU Tianyi
摘要: 基于选项(Option)的层次化策略学习是分层强化学习领域的一种主要实现方式。其中,选项表示特定动作的时序抽象,一组选项以多层次组合的方式可解决复杂的强化学习任务。针对选项发现这一目标,已有的研究工作使用监督或无监督方式从非结构化演示轨迹中自动发现有意义的选项。然而,基于监督的选项发现过程需要人为分解任务问题并定义选项策略,带来了大量的额外负担;无监督方式发现的选项则难以包含丰富语义,限制了后续选项的重用。为此,提出一种基于符号知识的选项发现方法,只需对环境符号建模,所得知识可指导环境中多种任务的选项发现,并为发现的选项赋予符号语义,从而在新任务执行时被重复使用。将选项发现过程分解为轨迹切割和行为克隆两阶段步骤:轨迹切割旨在从演示轨迹提取具备语义的轨迹片段,为此训练一个面向演示轨迹的切割模型,引入符号知识定义强化学习奖励评价切割的准确性;行为克隆根据切割得到的数据监督训练选项,旨在使选项模仿轨迹行为。使用所提方法在多个包括离散和连续空间的领域环境中分别进行了选项发现和选项重用实验。选项发现中轨迹切割部分的实验结果显示,所提方法在离散和连续空间环境中的切割准确率均高出基线方法数个百分点,并在复杂环境任务的切割中提高到20%。另外,选项重用实验的结果证明,相较于基线方法,赋予符号语义增强的选项在新任务重用上拥有更快的训练速度,并在基线方法无法完成的复杂任务中仍然得到良好收敛。
中图分类号:
[1]ZHOU X,BAI T,GAO Y,et al.Vision-based robot navigation through combining unsupervised learning and hierarchical reinforcement learning[J].Sensors,2019,19(7):1576. [2]JAIN D,ISCEN A,CALUWAERTS K.Hierarchical reinforcement learning for quadruped locomotion[C]//2019 IEEE/RSJ International Conference on Intelligent Robots and Systems(IROS).IEEE,2019:7551-7557. [3]YIN C,YANG R,ZHU W,et al.Survey on multi-agent hierarchical reinforcement learning[J].CAAI Transactions on Intelligent Systems,2020,15(4):646-655. [4]SUTTON R S,PRECUP D,SINGH S.Between mdps and semi-mdps:A framework for temporal abstraction in reinforcement learning[J].Artificial Intelligence,1999,112(1/2):181-211. [5]HOVLAND G E,SIKKA P,MCCARRAGHER B J.Skill acquisition from human demonstration using a hidden markov model[C]//Proceedings of IEEE International Conference on Robotics and Automation:volume 3.IEEE,1996:2706-2711. [6]SCHAAL S.Dynamic movement primitives-A framework formotor control in humans and humanoid robotics[M]//Adaptive Motion of Animals and Machines.Berlin:Springer,2006:261-280. [7]KONIDARIS G,KUINDERSMA S,GRUPEN R,et al.Robotlearning from demonstration by constructing skill trees[J].The International Journal of Robotics Research,2012,31(3):360-375. [8]KIPF T,LI Y,DAI H,et al.Compile:Compositional imitation learning and execution[C]//International Conference on Machine Learning.PMLR,2019:3418-3428. [9]SHANKAR T,TULSIANI S,PINTO L,et al.Discovering motor programs by recomposing demonstrations[C]//8th International Conference on Learning Representations(ICLR).2020. [10]CHEN Y,WANG C,BASTANI O,et al.Program synthesisusing deduction-guided reinforcement learning[C]//Computer Aided Verification:32nd International Conference(CAV 2020).Springer,2020:587-610. [11]ICARTE R T,KLASSEN T,VALENZANO R,et al.Using reward machines for high-level task specification and decomposition in reinforcement learning[C]//International Conference on Machine Learning.PMLR,2018:2107-2116. [12]ANDREAS J,KLEIN D,LEVINE S.Modular multitask rein-forcement learning with policy sketches[C]//International Conference on Machine Learning.PMLR,2017:166-175. [13]SHIARLIS K,WULFMEIER M,SALTER S,et al.Taco:Learning task decomposition via temporal alignment for control[C]//International Conference on Machine Learning.PMLR,2018:4654-4663. [14]ARGALL B D,CHERNOVA S,VELOSO M,et al.A survey of robot learning from demonstration[J].Robotics and Autonomous Systems,2009,57(5):469-483. [15]ESMAILI N,SAMMUT C,SHIRAZI G.Behavioural cloning in control of a dynamic system[C]//1995 IEEE International Conference on Systems,Man and Cybernetics.Intelligent Systems for the 21st Century:volume 3.IEEE,1995:2904-2909. [16]PETERS J,KOBER J,MÜLLING K,et al.Towards robot skill learning:From simple skills to table tennis[C]//Machine Lear-ning and Knowledge Discovery in Databases:European Confe-rence(ECML PKDD 2013).Springer,2013:627-631. [17]NIEKUM S,OSENTOSKI S,KONIDARIS G,et al.Learningand generalization of complex tasks from unstructured demonstrations[C]//2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.IEEE,2012:5239-5246. [18]NIEKUM S,OSENTOSKI S,KONIDARIS G,et al.Learninggrounded finite-state representations from unstructured demonstrations[J].The International Journal of Robotics Research,2015,34(2):131-157. [19]ZHU Y,STONE P,ZHU Y.Bottom-up skill discovery from unsegmented demonstrations for long-horizon robot manipulation[J].IEEE Robotics and Automation Letters,2022,7(2):4126-4133. [20]ARTO A G,MAHADEVAN S.Recent advances in hierarchical reinforcement learning[J].Discrete Event Dynamic Systems,2003,13(1/2):41-77. [21]PARR R,RUSSELL S.Reinforcement learning with hierarchies of machines[M]//Advances in Neural Information Processing Systems 10.The MIT Press,1997:1043-1049. [22]DAYAN P,HINTON G E.Feudal reinforcement learning[M]//Advances in Neural Information Processing Systems 5.Morgan Kaufmann,1992:271-278. [23]SUTTON R S,PRECUP D,SINGH S.Intra-option learningabout temporally abstract actions[C]//ICML:volume 98.1998:556-564. [24]FOX R,KRISHNAN S,STOICA I,et al.Multi-level discovery of deep options[J].arXiv:1703.08294,2017. [25]KRISHNAN S,FOX R,STOICA I,et al.Ddco:Discovery ofdeep continuous options for robot learning from demonstrations[C]//Conference on Robot Learning.PMLR,2017:418-437. [26]SHANKAR T,GUPTA A.Learning robot skills with temporal variational inference[C]//International Conference on Machine Learning.PMLR,2020:8624-8633. [27]XIE Y,ZHOU F,SOH H.Embedding symbolic temporal know-ledge into deep sequential models[C]//2021 IEEE International Conference on Robotics and Automation(ICRA).IEEE,2021:4267-4273. [28]YANG F,LYU D,LIU B,et al.Peorl:Integrating symbolicplanning and hierarchical reinforcement learning for robust decision-making[C]//Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence(IJCAI).2018:4860-4866. |
|