计算机科学 ›› 2021, Vol. 48 ›› Issue (10): 30-36.doi: 10.11896/jsjkx.201000129
刘天星, 李伟, 许铮, 张立华, 戚骁亚, 甘中学
LIU Tian-xing, LI Wei, XU Zheng, ZHANG Li-hua, QI Xiao-ya, GAN Zhong-xue
摘要: 蒙特卡罗树搜索(Monte Carlo Tree Search,MCTS)在低维离散控制任务中取得了巨大的成功。然而,在现实生活中许多任务需要在连续动作空间进行行动规划。由于连续行动空间涉及的行动集过大,蒙特卡罗树搜索很难在有限的时间内从中筛选出最佳的行动。作为蒙特卡罗树搜索的一个变种,KR-UCT(Kernel Regression UCT)算法通过核函数泛化局部信息的方式提高了蒙特卡罗树搜索在低维连续动作空间的模拟效率。但是在与环境交互的过程中,为了找出最佳的行动,KR-UCT在每一步都需要从头进行大量的模拟,这使得KR-UCT算法仅局限于低维连续行动空间,而在高维连续行动空间难以在有限的时间内从行动空间筛选出最佳的行动。在与环境交互的过程中,智能体可以获得环境反馈回来的信息,因此,为了提高KR-UCT算法在高维行动空间的性能,可以使用这些反馈信息剪枝树搜索过程来加快KR-UCT算法在高维连续行动空间的模拟效率。基于此,文中提出了一种基于策略-价值网络的蒙特卡罗树搜索方法(KR-UCT with Policy-Value Network,KRPV)。该方法使用策略-价值网络保存智能体与环境之间的交互信息,随后策略网络利用这些信息帮助KR-UCT算法剪枝KR-UCT搜索树的宽度;而价值网络则通过泛化不同状态之间的价值信息对蒙特卡罗树搜索在深度上进行剪枝,从而提高了KR-UCT算法的模拟效率,进而提高了算法在高维连续行动任务中的性能。在OpenAI gym中的4个连续控制任务上对KRPV进行了评估。实验结果表明,该方法在4个连续控制任务上均优于KR-UCT,特别是在6维的HalfCheetah-v2任务中,使用KRPV算法所获得的奖励是KR-UCT的6倍。
中图分类号:
[1]BROWNE C B,POWLEY E,WHITEHOUSE D,et al.A Survey of Monte Carlo Tree Search Methods[J].IEEE Transactions on Computational Intelligence & AI in Games,2012,4(1):1-43. [2]SCHRITTWIESER J,ANTONOGLOU I,HUBERT T,et al.Mastering Atari,Go,Chess and Shogi by Planning with a Lear-ned Model[J].arXiv:1911.08265. [3]SILVER D,HUBERT T,SCHRITTWIESER J,et al.A general reinforcement learning algorithm that masters chess,shogi,and Go through self-play[J].Science,2018,362(6419):1140-1144. [4]CHASLOT G,WINANDS M,VAN DEN HERIK H J,et al.Progressive strategies for monte-carlo tree search[J].New Mathematics and Natural Computation,2008,4(3):343-357. [5]COUËTOUX A,MARIO M.Continuous rapid action value estimates [C]//Asian Conference on Machine Learning.2011:19-31. [6]YEE T,VILAM L,BOWLING M.Monte Carlo Tree Search in Continuous Action Spaces with Execution Uncertainty[C]//IJCAI 2016.AAAI Press,2016:690-697. [7]BROCKMAN G,CHEUNG V,PETTERSSON L,et al.OpenAI Gym[J].arXiv:1606.01540. [8]RÉMI C.Efficient Selectivity and Backup Operators in Monte-Carlo Tree Search[C]//Proceeding of the International Confe-rence on Computer & Games.2006:72-83. [9]KOCSIS L,SZEPESVÁRI C.Bandit based monte-carlo planning[C]//17th European Conference on Machine Learning.2006:282-293. [10]COUËTOUX A,HOOCK J B.Continuous Upper ConfidenceTrees[C]//International Conference on Learning and Intelligent Optimization.2011:433-445. [11]SÉBASTIEN B,RÉMI M,STOLTZ G,et al.Online Optimization in X-Armed Bandits[J].Advances in Neural Information Processing Systems,2009:201-208. [12]MANSLEY C R,WEINSTEIN A,LITTMAN M L.Sample-Based Planning for Continuous Action Markov Decision Processes[C]//International Conference on International Conference on Automated Planning & Scheduling.AAAI Press,2011. [13]WEINSTEIN A,LITTMAN M L.Bandit-based planning andlearning in continuous-action markov decision processes[C]//Proceedings of the Twenty-Second International Conference on International Conference on Automated Planning and Scheduling.2012:306-314. [14]KIM B,LEE K,LIM S,et al.Monte Carlo Tree Search in Continuous Spaces Using Voronoi Optimistic Optimization with Regret Bounds [C]//AAAI 2020.2020:9916-9924. [15]CHASLOT G M B,WINANDS M H,VAN DENHERIK H J.Parallel monte-carlo tree search[C]//International Conference on Computers and Games.2008:60-71. [16]KURZER K,CHRISTOPH H,MARIUS Z J.Parallelization of Monte Carlo Tree Search in Continuous Domains[J].arXiv:2003.13741. [17]ANTHONY T,TIAN Z,BARBER D.Thinking Fast and Slow with Deep Learning and Tree Search[C]//Neural Information Processing Systems.2017:5360-5370. [18]KARTAL B,HERNANDEZ-LEAL P,TAYLOR M E.ActionGuidance with MCTS for Deep Reinforcement Learning[J].Proceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment,2019,15(1):153-159. [19]SUBRAMANIAN S G,CROWLEY M.Combining MCTS and A3C for Prediction of Spatially Spreading Processes in Forest Wildfire Settings[C]//Canadian Conference on Artificial Intelligence.2018:285-291. [20]ZHANG H,CHENG F,XU B,et al.RevCuT Tree SearchMethod in Complex Single-player Game with Continuous Search Space[C]//2019 International Joint Conference on Neural Networks (IJCNN).2019:1-8. [21]LEE K,KIM S,CHOI J,et al.Deep Reinforcement Learning in Continuous Action Spaces:a Case Study in the Game of Simulated Curling[C]//Proceedings of the 35th International Confe-rence on Machine Learning.2018:2937-2946. |
[1] | 熊丽琴, 曹雷, 赖俊, 陈希亮. 基于值分解的多智能体深度强化学习综述 Overview of Multi-agent Deep Reinforcement Learning Based on Value Factorization 计算机科学, 2022, 49(9): 172-182. https://doi.org/10.11896/jsjkx.210800112 |
[2] | 刘兴光, 周力, 刘琰, 张晓瀛, 谭翔, 魏急波. 基于边缘智能的频谱地图构建与分发方法 Construction and Distribution Method of REM Based on Edge Intelligence 计算机科学, 2022, 49(9): 236-241. https://doi.org/10.11896/jsjkx.220400148 |
[3] | 袁唯淋, 罗俊仁, 陆丽娜, 陈佳星, 张万鹏, 陈璟. 智能博弈对抗方法:博弈论与强化学习综合视角对比分析 Methods in Adversarial Intelligent Game:A Holistic Comparative Analysis from Perspective of Game Theory and Reinforcement Learning 计算机科学, 2022, 49(8): 191-204. https://doi.org/10.11896/jsjkx.220200174 |
[4] | 史殿习, 赵琛然, 张耀文, 杨绍武, 张拥军. 基于多智能体强化学习的端到端合作的自适应奖励方法 Adaptive Reward Method for End-to-End Cooperation Based on Multi-agent Reinforcement Learning 计算机科学, 2022, 49(8): 247-256. https://doi.org/10.11896/jsjkx.210700100 |
[5] | 于滨, 李学华, 潘春雨, 李娜. 基于深度强化学习的边云协同资源分配算法 Edge-Cloud Collaborative Resource Allocation Algorithm Based on Deep Reinforcement Learning 计算机科学, 2022, 49(7): 248-253. https://doi.org/10.11896/jsjkx.210400219 |
[6] | 李梦菲, 毛莺池, 屠子健, 王瑄, 徐淑芳. 基于深度确定性策略梯度的服务器可靠性任务卸载策略 Server-reliability Task Offloading Strategy Based on Deep Deterministic Policy Gradient 计算机科学, 2022, 49(7): 271-279. https://doi.org/10.11896/jsjkx.210600040 |
[7] | 郭雨欣, 陈秀宏. 融合BERT词嵌入表示和主题信息增强的自动摘要模型 Automatic Summarization Model Combining BERT Word Embedding Representation and Topic Information Enhancement 计算机科学, 2022, 49(6): 313-318. https://doi.org/10.11896/jsjkx.210400101 |
[8] | 范静宇, 刘全. 基于随机加权三重Q学习的异策略最大熵强化学习算法 Off-policy Maximum Entropy Deep Reinforcement Learning Algorithm Based on RandomlyWeighted Triple Q -Learning 计算机科学, 2022, 49(6): 335-341. https://doi.org/10.11896/jsjkx.210300081 |
[9] | 谢万城, 李斌, 代玥玥. 空中智能反射面辅助边缘计算中基于PPO的任务卸载方案 PPO Based Task Offloading Scheme in Aerial Reconfigurable Intelligent Surface-assisted Edge Computing 计算机科学, 2022, 49(6): 3-11. https://doi.org/10.11896/jsjkx.220100249 |
[10] | 洪志理, 赖俊, 曹雷, 陈希亮, 徐志雄. 基于遗憾探索的竞争网络强化学习智能推荐方法研究 Study on Intelligent Recommendation Method of Dueling Network Reinforcement Learning Based on Regret Exploration 计算机科学, 2022, 49(6): 149-157. https://doi.org/10.11896/jsjkx.210600226 |
[11] | 高捷, 刘沙, 黄则强, 郑天宇, 刘鑫, 漆锋滨. 基于国产众核处理器的深度神经网络算子加速库优化 Deep Neural Network Operator Acceleration Library Optimization Based on Domestic Many-core Processor 计算机科学, 2022, 49(5): 355-362. https://doi.org/10.11896/jsjkx.210500226 |
[12] | 张佳能, 李辉, 吴昊霖, 王壮. 一种平衡探索和利用的优先经验回放方法 Exploration and Exploitation Balanced Experience Replay 计算机科学, 2022, 49(5): 179-185. https://doi.org/10.11896/jsjkx.210300084 |
[13] | 焦翔, 魏祥麟, 薛羽, 王超, 段强. 基于深度学习的自动调制识别研究 Automatic Modulation Recognition Based on Deep Learning 计算机科学, 2022, 49(5): 266-278. https://doi.org/10.11896/jsjkx.211000085 |
[14] | 李鹏, 易修文, 齐德康, 段哲文, 李天瑞. 一种基于深度学习的供热策略优化方法 Heating Strategy Optimization Method Based on Deep Learning 计算机科学, 2022, 49(4): 263-268. https://doi.org/10.11896/jsjkx.210300155 |
[15] | 欧阳卓, 周思源, 吕勇, 谭国平, 张悦, 项亮亮. 基于深度强化学习的无信号灯交叉路口车辆控制 DRL-based Vehicle Control Strategy for Signal-free Intersections 计算机科学, 2022, 49(3): 46-51. https://doi.org/10.11896/jsjkx.210700010 |
|