面向高维连续行动空间的蒙特卡罗树搜索算法

doi:10.11896/jsjkx.201000129

Computer Science ›› 2021, Vol. 48 ›› Issue (10): 30-36.doi: 10.11896/jsjkx.201000129

• Artificial Intelligence • Previous Articles Next Articles

Monte Carlo Tree Search for High-dimensional Continuous Control Space

LIU Tian-xing, LI Wei, XU Zheng, ZHANG Li-hua, QI Xiao-ya, GAN Zhong-xue

Institute of AI and Robotics,Fudan University,Shanghai 200433,China
Jihua Laboratory,Foshan,Guangdong 528000,China

Received:2020-10-22 Revised:2021-03-12 Online:2021-10-15 Published:2021-10-18
About author:LIU Tian-xing,born in 1996,postgra-duate.His main research interests include reinforcement learning,machine learning and collective intelligence evolution.
GAN Zhong-xue,born in 1952,Ph.D,distinguished professor.His main research interests include basic theory and key technologies of collective intelligence,autonomous intelligent robot and flexible automatic control.
Supported by:
Jihua Laboratory Fund(X190021TB190) and Shanghai Committee of Science and Technology,China(19511132000).

Abstract

Abstract: Monte Carlo tree search (MCTS) has gained great success in low discrete control tasks.However,there are many tasks in real life that require selecting action sequentially in continuous action space.Kernel regression UCT (KR-UCT) is a successful attempt in low-dimensional continuous action space by using a pre-defined kernel function to exploit the similarity of different continuous actions.However,KR-UCT gets a poor performance when it comes to high-dimensional continuous action space,because KR-UCT does not use the interacting information between agent and the environment.And when it interacts with the environment,KR-UCT needs to perform a lot of simulations at each step to find the best action.In order to solve this problem,this paper proposes a method named kernel regression UCT with policy-value network (KRPV).The proposed method can filter out more representative actions from action space to perform MCTS and generalize the information between different states to pruning MCTS.The proposed method has been evaluated by four continuous control tasks of the OpenAI gym.The experimental results show that KRPV outperforms KR-UCT in all tested continuous control tasks.Especially for the six-dimensional HalfCheetah-v2 task,the rewards gained by KRPV are six-timesof that of KR-UCT.

Key words: Deep neural network, High dimensional continuous action space, Kernel regression UCT, Monte Carlo tree search, Reinforcement learning

CLC Number:

TP181

LIU Tian-xing, LI Wei, XU Zheng, ZHANG Li-hua, QI Xiao-ya, GAN Zhong-xue. Monte Carlo Tree Search for High-dimensional Continuous Control Space[J].Computer Science, 2021, 48(10): 30-36.

References

[1]BROWNE C B,POWLEY E,WHITEHOUSE D,et al.A Survey of Monte Carlo Tree Search Methods[J].IEEE Transactions on Computational Intelligence & AI in Games,2012,4(1):1-43.
[2]SCHRITTWIESER J,ANTONOGLOU I,HUBERT T,et al.Mastering Atari,Go,Chess and Shogi by Planning with a Lear-ned Model[J].arXiv:1911.08265.
[3]SILVER D,HUBERT T,SCHRITTWIESER J,et al.A general reinforcement learning algorithm that masters chess,shogi,and Go through self-play[J].Science,2018,362(6419):1140-1144.
[4]CHASLOT G,WINANDS M,VAN DEN HERIK H J,et al.Progressive strategies for monte-carlo tree search[J].New Mathematics and Natural Computation,2008,4(3):343-357.
[5]COUËTOUX A,MARIO M.Continuous rapid action value estimates [C]//Asian Conference on Machine Learning.2011:19-31.
[6]YEE T,VILAM L,BOWLING M.Monte Carlo Tree Search in Continuous Action Spaces with Execution Uncertainty[C]//IJCAI 2016.AAAI Press,2016:690-697.
[7]BROCKMAN G,CHEUNG V,PETTERSSON L,et al.OpenAI Gym[J].arXiv:1606.01540.
[8]RÉMI C.Efficient Selectivity and Backup Operators in Monte-Carlo Tree Search[C]//Proceeding of the International Confe-rence on Computer & Games.2006:72-83.
[9]KOCSIS L,SZEPESVÁRI C.Bandit based monte-carlo planning[C]//17th European Conference on Machine Learning.2006:282-293.
[10]COUËTOUX A,HOOCK J B.Continuous Upper ConfidenceTrees[C]//International Conference on Learning and Intelligent Optimization.2011:433-445.
[11]SÉBASTIEN B,RÉMI M,STOLTZ G,et al.Online Optimization in X-Armed Bandits[J].Advances in Neural Information Processing Systems,2009:201-208.
[12]MANSLEY C R,WEINSTEIN A,LITTMAN M L.Sample-Based Planning for Continuous Action Markov Decision Processes[C]//International Conference on International Conference on Automated Planning & Scheduling.AAAI Press,2011.
[13]WEINSTEIN A,LITTMAN M L.Bandit-based planning andlearning in continuous-action markov decision processes[C]//Proceedings of the Twenty-Second International Conference on International Conference on Automated Planning and Scheduling.2012:306-314.
[14]KIM B,LEE K,LIM S,et al.Monte Carlo Tree Search in Continuous Spaces Using Voronoi Optimistic Optimization with Regret Bounds [C]//AAAI 2020.2020:9916-9924.
[15]CHASLOT G M B,WINANDS M H,VAN DENHERIK H J.Parallel monte-carlo tree search[C]//International Conference on Computers and Games.2008:60-71.
[16]KURZER K,CHRISTOPH H,MARIUS Z J.Parallelization of Monte Carlo Tree Search in Continuous Domains[J].arXiv:2003.13741.
[17]ANTHONY T,TIAN Z,BARBER D.Thinking Fast and Slow with Deep Learning and Tree Search[C]//Neural Information Processing Systems.2017:5360-5370.
[18]KARTAL B,HERNANDEZ-LEAL P,TAYLOR M E.ActionGuidance with MCTS for Deep Reinforcement Learning[J].Proceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment,2019,15(1):153-159.
[19]SUBRAMANIAN S G,CROWLEY M.Combining MCTS and A3C for Prediction of Spatially Spreading Processes in Forest Wildfire Settings[C]//Canadian Conference on Artificial Intelligence.2018:285-291.
[20]ZHANG H,CHENG F,XU B,et al.RevCuT Tree SearchMethod in Complex Single-player Game with Continuous Search Space[C]//2019 International Joint Conference on Neural Networks (IJCNN).2019:1-8.
[21]LEE K,KIM S,CHOI J,et al.Deep Reinforcement Learning in Continuous Action Spaces:a Case Study in the Game of Simulated Curling[C]//Proceedings of the 35th International Confe-rence on Machine Learning.2018:2937-2946.

Related Articles 15

[1]	LIU Xing-guang, ZHOU Li, LIU Yan, ZHANG Xiao-ying, TAN Xiang, WEI Ji-bo. Construction and Distribution Method of REM Based on Edge Intelligence [J]. Computer Science, 2022, 49(9): 236-241.
[2]	YUAN Wei-lin, LUO Jun-ren, LU Li-na, CHEN Jia-xing, ZHANG Wan-peng, CHEN Jing. Methods in Adversarial Intelligent Game:A Holistic Comparative Analysis from Perspective of Game Theory and Reinforcement Learning [J]. Computer Science, 2022, 49(8): 191-204.
[3]	SHI Dian-xi, ZHAO Chen-ran, ZHANG Yao-wen, YANG Shao-wu, ZHANG Yong-jun. Adaptive Reward Method for End-to-End Cooperation Based on Multi-agent Reinforcement Learning [J]. Computer Science, 2022, 49(8): 247-256.
[4]	YU Bin, LI Xue-hua, PAN Chun-yu, LI Na. Edge-Cloud Collaborative Resource Allocation Algorithm Based on Deep Reinforcement Learning [J]. Computer Science, 2022, 49(7): 248-253.
[5]	LI Meng-fei, MAO Ying-chi, TU Zi-jian, WANG Xuan, XU Shu-fang. Server-reliability Task Offloading Strategy Based on Deep Deterministic Policy Gradient [J]. Computer Science, 2022, 49(7): 271-279.
[6]	LUO Jun-ren, ZHANG Wan-peng, LU Li-na, CHEN Jing. Survey on Online Adversarial Planning for Real-time Strategy Game [J]. Computer Science, 2022, 49(6): 287-296.
[7]	GUO Yu-xin, CHEN Xiu-hong. Automatic Summarization Model Combining BERT Word Embedding Representation and Topic Information Enhancement [J]. Computer Science, 2022, 49(6): 313-318.
[8]	FAN Jing-yu, LIU Quan. Off-policy Maximum Entropy Deep Reinforcement Learning Algorithm Based on RandomlyWeighted Triple Q -Learning [J]. Computer Science, 2022, 49(6): 335-341.
[9]	WEI Hui, CHEN Ze-mao, ZHANG Li-qiang. Anomaly Detection Framework of System Call Trace Based on Sequence and Frequency Patterns [J]. Computer Science, 2022, 49(6): 350-355.
[10]	XIE Wan-cheng, LI Bin, DAI Yue-yue. PPO Based Task Offloading Scheme in Aerial Reconfigurable Intelligent Surface-assisted Edge Computing [J]. Computer Science, 2022, 49(6): 3-11.
[11]	HONG Zhi-li, LAI Jun, CAO Lei, CHEN Xi-liang, XU Zhi-xiong. Study on Intelligent Recommendation Method of Dueling Network Reinforcement Learning Based on Regret Exploration [J]. Computer Science, 2022, 49(6): 149-157.
[12]	GAO Jie, LIU Sha, HUANG Ze-qiang, ZHENG Tian-yu, LIU Xin, QI Feng-bin. Deep Neural Network Operator Acceleration Library Optimization Based on Domestic Many-core Processor [J]. Computer Science, 2022, 49(5): 355-362.
[13]	ZHANG Jia-neng, LI Hui, WU Hao-lin, WANG Zhuang. Exploration and Exploitation Balanced Experience Replay [J]. Computer Science, 2022, 49(5): 179-185.
[14]	JIAO Xiang, WEI Xiang-lin, XUE Yu, WANG Chao, DUAN Qiang. Automatic Modulation Recognition Based on Deep Learning [J]. Computer Science, 2022, 49(5): 266-278.
[15]	LI Peng, YI Xiu-wen, QI De-kang, DUAN Zhe-wen, LI Tian-rui. Heating Strategy Optimization Method Based on Deep Learning [J]. Computer Science, 2022, 49(4): 263-268.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Monte Carlo Tree Search for High-dimensional Continuous Control Space

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 15

Metrics

Comments

Recommended 0