Computer Science ›› 2025, Vol. 52 ›› Issue (1): 277-288.doi: 10.11896/jsjkx.240100221

• Artificial Intelligence • Previous Articles     Next Articles

Option Discovery Method Based on Symbolic Knowledge

WANG Qidi, SHEN Liwei, WU Tianyi   

  1. School of Computer Science,Fudan University,Shanghai 200438,China
  • Received:2024-01-30 Revised:2024-05-13 Online:2025-01-15 Published:2025-01-09
  • About author:WANG Qidi,born in 2000,postgra-duate.His main research interests include reinforcement learning and program synthesis.
    SHEN Liwei,born in 1982,associate professor.His main research interests include man-machine and object fusion system software, and robot software engineering.
  • Supported by:
    Shanghai Major Project(2021SHZDZX0103).

Abstract: Hierarchical strategy learning based on options is a prominent approach in the field of hierarchical reinforcement lear-ning.Options represent temporal abstractions of specific actions,and a set of options can be combined in a hierarchical manner to tackle complex reinforcement learning tasks.For the goal of option discovery,existing research has focused on the discovery of meaningful options using supervised or unsupervised methods from unstructured demonstration trajectories.However,supervised option discovery requires manual task decomposition and option policy definition,leading to a lot of additional burden.On the other hand,options discovered through unsupervised methods often lack rich semantics,limiting the subsequent reuse of options.Therefore,this paper proposes a symbol-knowledge-based option discovery method that only requires modeling the symbolic knowledge of the environment.The acquired knowledge can guide option discovery for various tasks in the environment and assign symbolic semantics to the discovered options,enabling their reuse in new task executions.This method decomposes the option discovery process into two stages:trajectory segmentation and behavior cloning.Trajectory segmentation aims to extract semantically meaningful trajectory segments from demonstration trajectories.To achieve this,a segmentation model is trained specifically for demonstration trajectories,incorporating symbolic knowledge to define the accuracy of segmentation in reinforcement learning reward evaluation.Behavior cloning,on the other hand,supervises the training of options based on the segmented data,aiming to make the options mimic trajectory behaviors.The proposed method is evaluated in multiple domain environments,including both discrete and continuous spaces,for option discovery and option reuse experiments.In the option discovery experiments,the results of trajectory segmentation show that the proposed method achieves higher segmentation accuracy compared to the baseline method,with an improvement of several percentage points in both discrete and continuous space environments.More-over,in complex environment tasks,the segmentation accuracy is further improved by 20%.Additionally,the results of the option reuse experiments demonstrate that options enriched with symbolic semantics exhibit faster training speed in adapting to new tasks compared to the baseline method.Furthermore,these symbolic semantics enhanced options show good convergence even in complex tasks that the baseline method fails to accomplish.

Key words: Hierarchical reinforcement learning, Demonstration learning, Option discovery, Markov decision process

CLC Number: 

  • TP311
[1]ZHOU X,BAI T,GAO Y,et al.Vision-based robot navigation through combining unsupervised learning and hierarchical reinforcement learning[J].Sensors,2019,19(7):1576.
[2]JAIN D,ISCEN A,CALUWAERTS K.Hierarchical reinforcement learning for quadruped locomotion[C]//2019 IEEE/RSJ International Conference on Intelligent Robots and Systems(IROS).IEEE,2019:7551-7557.
[3]YIN C,YANG R,ZHU W,et al.Survey on multi-agent hierarchical reinforcement learning[J].CAAI Transactions on Intelligent Systems,2020,15(4):646-655.
[4]SUTTON R S,PRECUP D,SINGH S.Between mdps and semi-mdps:A framework for temporal abstraction in reinforcement learning[J].Artificial Intelligence,1999,112(1/2):181-211.
[5]HOVLAND G E,SIKKA P,MCCARRAGHER B J.Skill acquisition from human demonstration using a hidden markov model[C]//Proceedings of IEEE International Conference on Robotics and Automation:volume 3.IEEE,1996:2706-2711.
[6]SCHAAL S.Dynamic movement primitives-A framework formotor control in humans and humanoid robotics[M]//Adaptive Motion of Animals and Machines.Berlin:Springer,2006:261-280.
[7]KONIDARIS G,KUINDERSMA S,GRUPEN R,et al.Robotlearning from demonstration by constructing skill trees[J].The International Journal of Robotics Research,2012,31(3):360-375.
[8]KIPF T,LI Y,DAI H,et al.Compile:Compositional imitation learning and execution[C]//International Conference on Machine Learning.PMLR,2019:3418-3428.
[9]SHANKAR T,TULSIANI S,PINTO L,et al.Discovering motor programs by recomposing demonstrations[C]//8th International Conference on Learning Representations(ICLR).2020.
[10]CHEN Y,WANG C,BASTANI O,et al.Program synthesisusing deduction-guided reinforcement learning[C]//Computer Aided Verification:32nd International Conference(CAV 2020).Springer,2020:587-610.
[11]ICARTE R T,KLASSEN T,VALENZANO R,et al.Using reward machines for high-level task specification and decomposition in reinforcement learning[C]//International Conference on Machine Learning.PMLR,2018:2107-2116.
[12]ANDREAS J,KLEIN D,LEVINE S.Modular multitask rein-forcement learning with policy sketches[C]//International Conference on Machine Learning.PMLR,2017:166-175.
[13]SHIARLIS K,WULFMEIER M,SALTER S,et al.Taco:Learning task decomposition via temporal alignment for control[C]//International Conference on Machine Learning.PMLR,2018:4654-4663.
[14]ARGALL B D,CHERNOVA S,VELOSO M,et al.A survey of robot learning from demonstration[J].Robotics and Autonomous Systems,2009,57(5):469-483.
[15]ESMAILI N,SAMMUT C,SHIRAZI G.Behavioural cloning in control of a dynamic system[C]//1995 IEEE International Conference on Systems,Man and Cybernetics.Intelligent Systems for the 21st Century:volume 3.IEEE,1995:2904-2909.
[16]PETERS J,KOBER J,MÜLLING K,et al.Towards robot skill learning:From simple skills to table tennis[C]//Machine Lear-ning and Knowledge Discovery in Databases:European Confe-rence(ECML PKDD 2013).Springer,2013:627-631.
[17]NIEKUM S,OSENTOSKI S,KONIDARIS G,et al.Learningand generalization of complex tasks from unstructured demonstrations[C]//2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.IEEE,2012:5239-5246.
[18]NIEKUM S,OSENTOSKI S,KONIDARIS G,et al.Learninggrounded finite-state representations from unstructured demonstrations[J].The International Journal of Robotics Research,2015,34(2):131-157.
[19]ZHU Y,STONE P,ZHU Y.Bottom-up skill discovery from unsegmented demonstrations for long-horizon robot manipulation[J].IEEE Robotics and Automation Letters,2022,7(2):4126-4133.
[20]ARTO A G,MAHADEVAN S.Recent advances in hierarchical reinforcement learning[J].Discrete Event Dynamic Systems,2003,13(1/2):41-77.
[21]PARR R,RUSSELL S.Reinforcement learning with hierarchies of machines[M]//Advances in Neural Information Processing Systems 10.The MIT Press,1997:1043-1049.
[22]DAYAN P,HINTON G E.Feudal reinforcement learning[M]//Advances in Neural Information Processing Systems 5.Morgan Kaufmann,1992:271-278.
[23]SUTTON R S,PRECUP D,SINGH S.Intra-option learningabout temporally abstract actions[C]//ICML:volume 98.1998:556-564.
[24]FOX R,KRISHNAN S,STOICA I,et al.Multi-level discovery of deep options[J].arXiv:1703.08294,2017.
[25]KRISHNAN S,FOX R,STOICA I,et al.Ddco:Discovery ofdeep continuous options for robot learning from demonstrations[C]//Conference on Robot Learning.PMLR,2017:418-437.
[26]SHANKAR T,GUPTA A.Learning robot skills with temporal variational inference[C]//International Conference on Machine Learning.PMLR,2020:8624-8633.
[27]XIE Y,ZHOU F,SOH H.Embedding symbolic temporal know-ledge into deep sequential models[C]//2021 IEEE International Conference on Robotics and Automation(ICRA).IEEE,2021:4267-4273.
[28]YANG F,LYU D,LIU B,et al.Peorl:Integrating symbolicplanning and hierarchical reinforcement learning for robust decision-making[C]//Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence(IJCAI).2018:4860-4866.
[1] WANG Xianwei, FENG Xiang, YU Huiqun. Multi-agent Cooperative Algorithm for Obstacle Clearance Based on Deep Deterministic PolicyGradient and Attention Critic [J]. Computer Science, 2024, 51(7): 319-326.
[2] LI Junwei, LIU Quan, XU Yapeng. Option-Critic Algorithm Based on Mutual Information Optimization [J]. Computer Science, 2024, 51(2): 252-258.
[3] ZENG Qingwei, ZHANG Guomin, XING Changyou, SONG Lihua. Intelligent Attack Path Discovery Based on Hierarchical Reinforcement Learning [J]. Computer Science, 2023, 50(7): 308-316.
[4] XU Yapeng, LIU Quan, LI Junwei. Hierarchical Reinforcement Learning Method Based on Trajectory Information [J]. Computer Science, 2023, 50(12): 314-321.
[5] ZHOU Qin, LUO Fei, DING Wei-chao, GU Chun-hua, ZHENG Shuai. Double Speedy Q-Learning Based on Successive Over Relaxation [J]. Computer Science, 2022, 49(3): 239-245.
[6] QIAN Jing, WU Ke-yu, CHEN Chao, HU Xing-chen. Optimal Order Acceptance Decision Based on After-state Reinforcement Learning [J]. Computer Science, 2022, 49(11A): 210800261-9.
[7] ZHANG Fan, GONG Ao-yu, DENG Lei, LIU Fang, LIN Yan, ZHANG Yi-jin. Wireless Downlink Scheduling with Deadline Constraint for Realistic Channel Observation Environment [J]. Computer Science, 2021, 48(9): 264-270.
[8] WANG Ying-kai, WANG Qing-shan. Reinforcement Learning Based Energy Allocation Strategy for Multi-access Wireless Communications with Energy Harvesting [J]. Computer Science, 2021, 48(7): 333-339.
[9] FANG Ting, GONG Ao-yu, ZHANG Fan, LIN Yan, JIA Lin-qiong, ZHANG Yi-jin. Dynamic Broadcasting Strategy in Cognitive Radio Networks Under Delivery Deadline [J]. Computer Science, 2021, 48(7): 340-346.
[10] YU Li, DU Qi-han, YUE Bo-yan, XIANG Jun-yao, XU Guan-yu, LENG You-fang. Survey of Reinforcement Learning Based Recommender Systems [J]. Computer Science, 2021, 48(10): 1-18.
[11] WANG Zheng-ning, ZHOU Yang, LV Xia, ZENG Fan-wei, ZHANG Xiang, ZHANG Feng-jun. Improved MDP Tracking Method by Combining 2D and 3D Information [J]. Computer Science, 2019, 46(3): 97-102.
[12] CHAI Ye-sheng, ZHU Xue-yang, YAN Rong-jie and ZHANG Guang-quan. MARTE Models Based System Reliability Prediction [J]. Computer Science, 2015, 42(12): 82-86.
[13] HUANG Zhen-jin,LU Yang,YANG Juan and FANG Huan. Property Patterns of Markov Decision Process Nondeterministic Choice Scheduler [J]. Computer Science, 2013, 40(4): 263-266.
[14] NIU Jun,ZENG Guo-sun, LU Xin-rong,XU Chang. Stochastic Model Checking Continuous Time Markov Process [J]. Computer Science, 2011, 38(9): 112-115.
[15] WANG Guan-jun,WANG Mao-li,ZHAO Ying. Research on Novel Test Vector Ordering Approach Based on Markov Decision Processes [J]. Computer Science, 2010, 37(5): 287-290.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!