计算机科学 ›› 2021, Vol. 48 ›› Issue (10): 30-36.doi: 10.11896/jsjkx.201000129

• 人工智能* 上一篇    下一篇

面向高维连续行动空间的蒙特卡罗树搜索算法

刘天星, 李伟, 许铮, 张立华, 戚骁亚, 甘中学   

  1. 复旦大学智能机器人研究院 上海200433
    季华实验室 广东 佛山528000
  • 收稿日期:2020-10-22 修回日期:2021-03-12 出版日期:2021-10-15 发布日期:2021-10-18
  • 通讯作者: 甘中学(ganzhongxue@fudan.edu.cn)
  • 作者简介:mliutianxing@126.com
  • 基金资助:
    广东省季华实验室基金资助项目(X190021TB190);上海市科学技术委员会项目(19511132000)

Monte Carlo Tree Search for High-dimensional Continuous Control Space

LIU Tian-xing, LI Wei, XU Zheng, ZHANG Li-hua, QI Xiao-ya, GAN Zhong-xue   

  1. Institute of AI and Robotics,Fudan University,Shanghai 200433,China
    Jihua Laboratory,Foshan,Guangdong 528000,China
  • Received:2020-10-22 Revised:2021-03-12 Online:2021-10-15 Published:2021-10-18
  • About author:LIU Tian-xing,born in 1996,postgra-duate.His main research interests include reinforcement learning,machine learning and collective intelligence evolution.
    GAN Zhong-xue,born in 1952,Ph.D,distinguished professor.His main research interests include basic theory and key technologies of collective intelligence,autonomous intelligent robot and flexible automatic control.
  • Supported by:
    Jihua Laboratory Fund(X190021TB190) and Shanghai Committee of Science and Technology,China(19511132000).

摘要: 蒙特卡罗树搜索(Monte Carlo Tree Search,MCTS)在低维离散控制任务中取得了巨大的成功。然而,在现实生活中许多任务需要在连续动作空间进行行动规划。由于连续行动空间涉及的行动集过大,蒙特卡罗树搜索很难在有限的时间内从中筛选出最佳的行动。作为蒙特卡罗树搜索的一个变种,KR-UCT(Kernel Regression UCT)算法通过核函数泛化局部信息的方式提高了蒙特卡罗树搜索在低维连续动作空间的模拟效率。但是在与环境交互的过程中,为了找出最佳的行动,KR-UCT在每一步都需要从头进行大量的模拟,这使得KR-UCT算法仅局限于低维连续行动空间,而在高维连续行动空间难以在有限的时间内从行动空间筛选出最佳的行动。在与环境交互的过程中,智能体可以获得环境反馈回来的信息,因此,为了提高KR-UCT算法在高维行动空间的性能,可以使用这些反馈信息剪枝树搜索过程来加快KR-UCT算法在高维连续行动空间的模拟效率。基于此,文中提出了一种基于策略-价值网络的蒙特卡罗树搜索方法(KR-UCT with Policy-Value Network,KRPV)。该方法使用策略-价值网络保存智能体与环境之间的交互信息,随后策略网络利用这些信息帮助KR-UCT算法剪枝KR-UCT搜索树的宽度;而价值网络则通过泛化不同状态之间的价值信息对蒙特卡罗树搜索在深度上进行剪枝,从而提高了KR-UCT算法的模拟效率,进而提高了算法在高维连续行动任务中的性能。在OpenAI gym中的4个连续控制任务上对KRPV进行了评估。实验结果表明,该方法在4个连续控制任务上均优于KR-UCT,特别是在6维的HalfCheetah-v2任务中,使用KRPV算法所获得的奖励是KR-UCT的6倍。

关键词: 高维连续行动空间, 核回归UCT, 蒙特卡罗树搜索, 强化学习, 深度神经网络

Abstract: Monte Carlo tree search (MCTS) has gained great success in low discrete control tasks.However,there are many tasks in real life that require selecting action sequentially in continuous action space.Kernel regression UCT (KR-UCT) is a successful attempt in low-dimensional continuous action space by using a pre-defined kernel function to exploit the similarity of different continuous actions.However,KR-UCT gets a poor performance when it comes to high-dimensional continuous action space,because KR-UCT does not use the interacting information between agent and the environment.And when it interacts with the environment,KR-UCT needs to perform a lot of simulations at each step to find the best action.In order to solve this problem,this paper proposes a method named kernel regression UCT with policy-value network (KRPV).The proposed method can filter out more representative actions from action space to perform MCTS and generalize the information between different states to pruning MCTS.The proposed method has been evaluated by four continuous control tasks of the OpenAI gym.The experimental results show that KRPV outperforms KR-UCT in all tested continuous control tasks.Especially for the six-dimensional HalfCheetah-v2 task,the rewards gained by KRPV are six-timesof that of KR-UCT.

Key words: Deep neural network, High dimensional continuous action space, Kernel regression UCT, Monte Carlo tree search, Reinforcement learning

中图分类号: 

  • TP181
[1]BROWNE C B,POWLEY E,WHITEHOUSE D,et al.A Survey of Monte Carlo Tree Search Methods[J].IEEE Transactions on Computational Intelligence & AI in Games,2012,4(1):1-43.
[2]SCHRITTWIESER J,ANTONOGLOU I,HUBERT T,et al.Mastering Atari,Go,Chess and Shogi by Planning with a Lear-ned Model[J].arXiv:1911.08265.
[3]SILVER D,HUBERT T,SCHRITTWIESER J,et al.A general reinforcement learning algorithm that masters chess,shogi,and Go through self-play[J].Science,2018,362(6419):1140-1144.
[4]CHASLOT G,WINANDS M,VAN DEN HERIK H J,et al.Progressive strategies for monte-carlo tree search[J].New Mathematics and Natural Computation,2008,4(3):343-357.
[5]COUËTOUX A,MARIO M.Continuous rapid action value estimates [C]//Asian Conference on Machine Learning.2011:19-31.
[6]YEE T,VILAM L,BOWLING M.Monte Carlo Tree Search in Continuous Action Spaces with Execution Uncertainty[C]//IJCAI 2016.AAAI Press,2016:690-697.
[7]BROCKMAN G,CHEUNG V,PETTERSSON L,et al.OpenAI Gym[J].arXiv:1606.01540.
[8]RÉMI C.Efficient Selectivity and Backup Operators in Monte-Carlo Tree Search[C]//Proceeding of the International Confe-rence on Computer & Games.2006:72-83.
[9]KOCSIS L,SZEPESVÁRI C.Bandit based monte-carlo planning[C]//17th European Conference on Machine Learning.2006:282-293.
[10]COUËTOUX A,HOOCK J B.Continuous Upper ConfidenceTrees[C]//International Conference on Learning and Intelligent Optimization.2011:433-445.
[11]SÉBASTIEN B,RÉMI M,STOLTZ G,et al.Online Optimization in X-Armed Bandits[J].Advances in Neural Information Processing Systems,2009:201-208.
[12]MANSLEY C R,WEINSTEIN A,LITTMAN M L.Sample-Based Planning for Continuous Action Markov Decision Processes[C]//International Conference on International Conference on Automated Planning & Scheduling.AAAI Press,2011.
[13]WEINSTEIN A,LITTMAN M L.Bandit-based planning andlearning in continuous-action markov decision processes[C]//Proceedings of the Twenty-Second International Conference on International Conference on Automated Planning and Scheduling.2012:306-314.
[14]KIM B,LEE K,LIM S,et al.Monte Carlo Tree Search in Continuous Spaces Using Voronoi Optimistic Optimization with Regret Bounds [C]//AAAI 2020.2020:9916-9924.
[15]CHASLOT G M B,WINANDS M H,VAN DENHERIK H J.Parallel monte-carlo tree search[C]//International Conference on Computers and Games.2008:60-71.
[16]KURZER K,CHRISTOPH H,MARIUS Z J.Parallelization of Monte Carlo Tree Search in Continuous Domains[J].arXiv:2003.13741.
[17]ANTHONY T,TIAN Z,BARBER D.Thinking Fast and Slow with Deep Learning and Tree Search[C]//Neural Information Processing Systems.2017:5360-5370.
[18]KARTAL B,HERNANDEZ-LEAL P,TAYLOR M E.ActionGuidance with MCTS for Deep Reinforcement Learning[J].Proceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment,2019,15(1):153-159.
[19]SUBRAMANIAN S G,CROWLEY M.Combining MCTS and A3C for Prediction of Spatially Spreading Processes in Forest Wildfire Settings[C]//Canadian Conference on Artificial Intelligence.2018:285-291.
[20]ZHANG H,CHENG F,XU B,et al.RevCuT Tree SearchMethod in Complex Single-player Game with Continuous Search Space[C]//2019 International Joint Conference on Neural Networks (IJCNN).2019:1-8.
[21]LEE K,KIM S,CHOI J,et al.Deep Reinforcement Learning in Continuous Action Spaces:a Case Study in the Game of Simulated Curling[C]//Proceedings of the 35th International Confe-rence on Machine Learning.2018:2937-2946.
[1] 熊丽琴, 曹雷, 赖俊, 陈希亮.
基于值分解的多智能体深度强化学习综述
Overview of Multi-agent Deep Reinforcement Learning Based on Value Factorization
计算机科学, 2022, 49(9): 172-182. https://doi.org/10.11896/jsjkx.210800112
[2] 刘兴光, 周力, 刘琰, 张晓瀛, 谭翔, 魏急波.
基于边缘智能的频谱地图构建与分发方法
Construction and Distribution Method of REM Based on Edge Intelligence
计算机科学, 2022, 49(9): 236-241. https://doi.org/10.11896/jsjkx.220400148
[3] 袁唯淋, 罗俊仁, 陆丽娜, 陈佳星, 张万鹏, 陈璟.
智能博弈对抗方法:博弈论与强化学习综合视角对比分析
Methods in Adversarial Intelligent Game:A Holistic Comparative Analysis from Perspective of Game Theory and Reinforcement Learning
计算机科学, 2022, 49(8): 191-204. https://doi.org/10.11896/jsjkx.220200174
[4] 史殿习, 赵琛然, 张耀文, 杨绍武, 张拥军.
基于多智能体强化学习的端到端合作的自适应奖励方法
Adaptive Reward Method for End-to-End Cooperation Based on Multi-agent Reinforcement Learning
计算机科学, 2022, 49(8): 247-256. https://doi.org/10.11896/jsjkx.210700100
[5] 于滨, 李学华, 潘春雨, 李娜.
基于深度强化学习的边云协同资源分配算法
Edge-Cloud Collaborative Resource Allocation Algorithm Based on Deep Reinforcement Learning
计算机科学, 2022, 49(7): 248-253. https://doi.org/10.11896/jsjkx.210400219
[6] 李梦菲, 毛莺池, 屠子健, 王瑄, 徐淑芳.
基于深度确定性策略梯度的服务器可靠性任务卸载策略
Server-reliability Task Offloading Strategy Based on Deep Deterministic Policy Gradient
计算机科学, 2022, 49(7): 271-279. https://doi.org/10.11896/jsjkx.210600040
[7] 郭雨欣, 陈秀宏.
融合BERT词嵌入表示和主题信息增强的自动摘要模型
Automatic Summarization Model Combining BERT Word Embedding Representation and Topic Information Enhancement
计算机科学, 2022, 49(6): 313-318. https://doi.org/10.11896/jsjkx.210400101
[8] 范静宇, 刘全.
基于随机加权三重Q学习的异策略最大熵强化学习算法
Off-policy Maximum Entropy Deep Reinforcement Learning Algorithm Based on RandomlyWeighted Triple Q -Learning
计算机科学, 2022, 49(6): 335-341. https://doi.org/10.11896/jsjkx.210300081
[9] 谢万城, 李斌, 代玥玥.
空中智能反射面辅助边缘计算中基于PPO的任务卸载方案
PPO Based Task Offloading Scheme in Aerial Reconfigurable Intelligent Surface-assisted Edge Computing
计算机科学, 2022, 49(6): 3-11. https://doi.org/10.11896/jsjkx.220100249
[10] 洪志理, 赖俊, 曹雷, 陈希亮, 徐志雄.
基于遗憾探索的竞争网络强化学习智能推荐方法研究
Study on Intelligent Recommendation Method of Dueling Network Reinforcement Learning Based on Regret Exploration
计算机科学, 2022, 49(6): 149-157. https://doi.org/10.11896/jsjkx.210600226
[11] 高捷, 刘沙, 黄则强, 郑天宇, 刘鑫, 漆锋滨.
基于国产众核处理器的深度神经网络算子加速库优化
Deep Neural Network Operator Acceleration Library Optimization Based on Domestic Many-core Processor
计算机科学, 2022, 49(5): 355-362. https://doi.org/10.11896/jsjkx.210500226
[12] 张佳能, 李辉, 吴昊霖, 王壮.
一种平衡探索和利用的优先经验回放方法
Exploration and Exploitation Balanced Experience Replay
计算机科学, 2022, 49(5): 179-185. https://doi.org/10.11896/jsjkx.210300084
[13] 焦翔, 魏祥麟, 薛羽, 王超, 段强.
基于深度学习的自动调制识别研究
Automatic Modulation Recognition Based on Deep Learning
计算机科学, 2022, 49(5): 266-278. https://doi.org/10.11896/jsjkx.211000085
[14] 李鹏, 易修文, 齐德康, 段哲文, 李天瑞.
一种基于深度学习的供热策略优化方法
Heating Strategy Optimization Method Based on Deep Learning
计算机科学, 2022, 49(4): 263-268. https://doi.org/10.11896/jsjkx.210300155
[15] 欧阳卓, 周思源, 吕勇, 谭国平, 张悦, 项亮亮.
基于深度强化学习的无信号灯交叉路口车辆控制
DRL-based Vehicle Control Strategy for Signal-free Intersections
计算机科学, 2022, 49(3): 46-51. https://doi.org/10.11896/jsjkx.210700010
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!