计算机科学 ›› 2025, Vol. 52 ›› Issue (1): 277-288.doi: 10.11896/jsjkx.240100221

• 人工智能 • 上一篇    下一篇

基于符号知识的选项发现方法

王麒迪, 沈立炜, 吴天一   

  1. 复旦大学计算机科学技术学院 上海 200438
  • 收稿日期:2024-01-30 修回日期:2024-05-13 出版日期:2025-01-15 发布日期:2025-01-09
  • 通讯作者: 沈立炜(shenliwei@fudan.edu.cn)
  • 作者简介:(21210240038@m.fudan.edu.cn)
  • 基金资助:
    上海市重大项目(2021SHZDZX0103)

Option Discovery Method Based on Symbolic Knowledge

WANG Qidi, SHEN Liwei, WU Tianyi   

  1. School of Computer Science,Fudan University,Shanghai 200438,China
  • Received:2024-01-30 Revised:2024-05-13 Online:2025-01-15 Published:2025-01-09
  • About author:WANG Qidi,born in 2000,postgra-duate.His main research interests include reinforcement learning and program synthesis.
    SHEN Liwei,born in 1982,associate professor.His main research interests include man-machine and object fusion system software, and robot software engineering.
  • Supported by:
    Shanghai Major Project(2021SHZDZX0103).

摘要: 基于选项(Option)的层次化策略学习是分层强化学习领域的一种主要实现方式。其中,选项表示特定动作的时序抽象,一组选项以多层次组合的方式可解决复杂的强化学习任务。针对选项发现这一目标,已有的研究工作使用监督或无监督方式从非结构化演示轨迹中自动发现有意义的选项。然而,基于监督的选项发现过程需要人为分解任务问题并定义选项策略,带来了大量的额外负担;无监督方式发现的选项则难以包含丰富语义,限制了后续选项的重用。为此,提出一种基于符号知识的选项发现方法,只需对环境符号建模,所得知识可指导环境中多种任务的选项发现,并为发现的选项赋予符号语义,从而在新任务执行时被重复使用。将选项发现过程分解为轨迹切割和行为克隆两阶段步骤:轨迹切割旨在从演示轨迹提取具备语义的轨迹片段,为此训练一个面向演示轨迹的切割模型,引入符号知识定义强化学习奖励评价切割的准确性;行为克隆根据切割得到的数据监督训练选项,旨在使选项模仿轨迹行为。使用所提方法在多个包括离散和连续空间的领域环境中分别进行了选项发现和选项重用实验。选项发现中轨迹切割部分的实验结果显示,所提方法在离散和连续空间环境中的切割准确率均高出基线方法数个百分点,并在复杂环境任务的切割中提高到20%。另外,选项重用实验的结果证明,相较于基线方法,赋予符号语义增强的选项在新任务重用上拥有更快的训练速度,并在基线方法无法完成的复杂任务中仍然得到良好收敛。

关键词: 分层强化学习, 演示学习, 选项发现, 马尔可夫决策过程

Abstract: Hierarchical strategy learning based on options is a prominent approach in the field of hierarchical reinforcement lear-ning.Options represent temporal abstractions of specific actions,and a set of options can be combined in a hierarchical manner to tackle complex reinforcement learning tasks.For the goal of option discovery,existing research has focused on the discovery of meaningful options using supervised or unsupervised methods from unstructured demonstration trajectories.However,supervised option discovery requires manual task decomposition and option policy definition,leading to a lot of additional burden.On the other hand,options discovered through unsupervised methods often lack rich semantics,limiting the subsequent reuse of options.Therefore,this paper proposes a symbol-knowledge-based option discovery method that only requires modeling the symbolic knowledge of the environment.The acquired knowledge can guide option discovery for various tasks in the environment and assign symbolic semantics to the discovered options,enabling their reuse in new task executions.This method decomposes the option discovery process into two stages:trajectory segmentation and behavior cloning.Trajectory segmentation aims to extract semantically meaningful trajectory segments from demonstration trajectories.To achieve this,a segmentation model is trained specifically for demonstration trajectories,incorporating symbolic knowledge to define the accuracy of segmentation in reinforcement learning reward evaluation.Behavior cloning,on the other hand,supervises the training of options based on the segmented data,aiming to make the options mimic trajectory behaviors.The proposed method is evaluated in multiple domain environments,including both discrete and continuous spaces,for option discovery and option reuse experiments.In the option discovery experiments,the results of trajectory segmentation show that the proposed method achieves higher segmentation accuracy compared to the baseline method,with an improvement of several percentage points in both discrete and continuous space environments.More-over,in complex environment tasks,the segmentation accuracy is further improved by 20%.Additionally,the results of the option reuse experiments demonstrate that options enriched with symbolic semantics exhibit faster training speed in adapting to new tasks compared to the baseline method.Furthermore,these symbolic semantics enhanced options show good convergence even in complex tasks that the baseline method fails to accomplish.

Key words: Hierarchical reinforcement learning, Demonstration learning, Option discovery, Markov decision process

中图分类号: 

  • TP311
[1]ZHOU X,BAI T,GAO Y,et al.Vision-based robot navigation through combining unsupervised learning and hierarchical reinforcement learning[J].Sensors,2019,19(7):1576.
[2]JAIN D,ISCEN A,CALUWAERTS K.Hierarchical reinforcement learning for quadruped locomotion[C]//2019 IEEE/RSJ International Conference on Intelligent Robots and Systems(IROS).IEEE,2019:7551-7557.
[3]YIN C,YANG R,ZHU W,et al.Survey on multi-agent hierarchical reinforcement learning[J].CAAI Transactions on Intelligent Systems,2020,15(4):646-655.
[4]SUTTON R S,PRECUP D,SINGH S.Between mdps and semi-mdps:A framework for temporal abstraction in reinforcement learning[J].Artificial Intelligence,1999,112(1/2):181-211.
[5]HOVLAND G E,SIKKA P,MCCARRAGHER B J.Skill acquisition from human demonstration using a hidden markov model[C]//Proceedings of IEEE International Conference on Robotics and Automation:volume 3.IEEE,1996:2706-2711.
[6]SCHAAL S.Dynamic movement primitives-A framework formotor control in humans and humanoid robotics[M]//Adaptive Motion of Animals and Machines.Berlin:Springer,2006:261-280.
[7]KONIDARIS G,KUINDERSMA S,GRUPEN R,et al.Robotlearning from demonstration by constructing skill trees[J].The International Journal of Robotics Research,2012,31(3):360-375.
[8]KIPF T,LI Y,DAI H,et al.Compile:Compositional imitation learning and execution[C]//International Conference on Machine Learning.PMLR,2019:3418-3428.
[9]SHANKAR T,TULSIANI S,PINTO L,et al.Discovering motor programs by recomposing demonstrations[C]//8th International Conference on Learning Representations(ICLR).2020.
[10]CHEN Y,WANG C,BASTANI O,et al.Program synthesisusing deduction-guided reinforcement learning[C]//Computer Aided Verification:32nd International Conference(CAV 2020).Springer,2020:587-610.
[11]ICARTE R T,KLASSEN T,VALENZANO R,et al.Using reward machines for high-level task specification and decomposition in reinforcement learning[C]//International Conference on Machine Learning.PMLR,2018:2107-2116.
[12]ANDREAS J,KLEIN D,LEVINE S.Modular multitask rein-forcement learning with policy sketches[C]//International Conference on Machine Learning.PMLR,2017:166-175.
[13]SHIARLIS K,WULFMEIER M,SALTER S,et al.Taco:Learning task decomposition via temporal alignment for control[C]//International Conference on Machine Learning.PMLR,2018:4654-4663.
[14]ARGALL B D,CHERNOVA S,VELOSO M,et al.A survey of robot learning from demonstration[J].Robotics and Autonomous Systems,2009,57(5):469-483.
[15]ESMAILI N,SAMMUT C,SHIRAZI G.Behavioural cloning in control of a dynamic system[C]//1995 IEEE International Conference on Systems,Man and Cybernetics.Intelligent Systems for the 21st Century:volume 3.IEEE,1995:2904-2909.
[16]PETERS J,KOBER J,MÜLLING K,et al.Towards robot skill learning:From simple skills to table tennis[C]//Machine Lear-ning and Knowledge Discovery in Databases:European Confe-rence(ECML PKDD 2013).Springer,2013:627-631.
[17]NIEKUM S,OSENTOSKI S,KONIDARIS G,et al.Learningand generalization of complex tasks from unstructured demonstrations[C]//2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.IEEE,2012:5239-5246.
[18]NIEKUM S,OSENTOSKI S,KONIDARIS G,et al.Learninggrounded finite-state representations from unstructured demonstrations[J].The International Journal of Robotics Research,2015,34(2):131-157.
[19]ZHU Y,STONE P,ZHU Y.Bottom-up skill discovery from unsegmented demonstrations for long-horizon robot manipulation[J].IEEE Robotics and Automation Letters,2022,7(2):4126-4133.
[20]ARTO A G,MAHADEVAN S.Recent advances in hierarchical reinforcement learning[J].Discrete Event Dynamic Systems,2003,13(1/2):41-77.
[21]PARR R,RUSSELL S.Reinforcement learning with hierarchies of machines[M]//Advances in Neural Information Processing Systems 10.The MIT Press,1997:1043-1049.
[22]DAYAN P,HINTON G E.Feudal reinforcement learning[M]//Advances in Neural Information Processing Systems 5.Morgan Kaufmann,1992:271-278.
[23]SUTTON R S,PRECUP D,SINGH S.Intra-option learningabout temporally abstract actions[C]//ICML:volume 98.1998:556-564.
[24]FOX R,KRISHNAN S,STOICA I,et al.Multi-level discovery of deep options[J].arXiv:1703.08294,2017.
[25]KRISHNAN S,FOX R,STOICA I,et al.Ddco:Discovery ofdeep continuous options for robot learning from demonstrations[C]//Conference on Robot Learning.PMLR,2017:418-437.
[26]SHANKAR T,GUPTA A.Learning robot skills with temporal variational inference[C]//International Conference on Machine Learning.PMLR,2020:8624-8633.
[27]XIE Y,ZHOU F,SOH H.Embedding symbolic temporal know-ledge into deep sequential models[C]//2021 IEEE International Conference on Robotics and Automation(ICRA).IEEE,2021:4267-4273.
[28]YANG F,LYU D,LIU B,et al.Peorl:Integrating symbolicplanning and hierarchical reinforcement learning for robust decision-making[C]//Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence(IJCAI).2018:4860-4866.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!