Computer Science ›› 2022, Vol. 49 ›› Issue (1): 298-305.doi: 10.11896/jsjkx.201100194

• Artificial Intelligence • Previous Articles     Next Articles

Upper Confidence Bound Exploration with Fast Convergence

AO Tian-yu1, LIU Quan1,2,3,4   

  1. 1 School of Computer Science and Technology,Soochow University,Suzhou,Jiangsu 215006,China
    2 Provincial Key Laboratory for Computer Information Processing Technology,Soochow University,Suzhou,Jiangsu 215006,China
    3 Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education,Jilin University,Changchun 130012,China
    4 Collaborative Innovation Center of Novel Software Technology and Industrialization,Nanjing 210000,China
  • Received:2020-11-26 Revised:2021-03-02 Online:2022-01-15 Published:2022-01-18
  • About author:AO Tian-yu,born in 1995,postgra-duate.His main research interests include deep reinforcement learning and so on.
    LIU Quan,born in 1969,Ph.D,professor,is a member of China Computer Federation.His main research interests include deep reinforcement learning and automated reasoning.
  • Supported by:
    National Natural Science Foundation of China(61772355,61702055,61502323,61502329),Jiangsu Province Na-tural Science Research University Major Projects(18KJA520011,17KJA520004),Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education,Jilin University(93K172014K04,93K172017K18),Suzhou Industrial Application of Basic Research Program Part(SYG201422) and Project Funded by the Priority Academic Program Development of Jiangsu Higher Education Institutions.

Abstract: Deep reinforcement learning method has achieved excellent results in large state space control tasks.Exploration has always been a research hotspot in this field.There are some problems in the existing exploration algorithms,such as blind exploration,and slow learning.To solve these problems,an upper confidence bound exploration with fast convergence (FAST-UCB) method is proposed.This method uses UCB method to explore the environment and improve the exploration efficiency.In order to alleviate the overestimation of Q value and balance the relationship between exploration and utilization,Q value clipped technique is added.Then,in order to balance the deviation and variance of the algorithm and make the agent learn quickly,the long short term memory unit is added to the network model,and an improved mixed monte carlo method is used to calculate the network error.Finally,FAST-UCB is applied to deep Q network,and compared with epsilon-greedy and UCB algorithms in control environment to verify its effectiveness.Besides,the proposed algorithm is compared with noise network exploration,bootstrapped exploration,asynchronous advantage actor critical algorithm and proximal policy optimization algorithm in Atari 2600 environment to verify its generalization.The experimental results show that FAST-UCB algorithm can achieve excellent results in these two environments.

Key words: Clipped Q value, Exploration, Long short term memory, Mixed monte carlo, Upper confidence bound

CLC Number: 

  • TP181
[1]SUTTON R S,BARTO A G.Reinforcement Learning:An In-troduction[M].Cambridge,USA:MIT Press,2018:1-4.
[2]FU W B,SUN T,LIANG J,et al.Review of Principle and Application of Deep Learning[J].Computer Science,2018,45(6A):11-15,40.
[3]LIU Q,ZHAI J W,ZHANG Z C.et al.A Survey on Deep Reinforcement Learning[J].Chinese Journal of Computers,2018,41(1):1-27.
[4]SILVER D,SCHRITTWIESER J,SIMONYAN K,et al.Mastering the Game of Go Without Human Knowledge[J].Nature,2017,550 (7676):354-359.
[5]MNIH V,KAVUKCUOGLU K,SILVER D,et al.Playing Atari with Deep Reinforcement Learning[C]//Proceedings of the Workshops at the 27th Neural Information Proceeding Systems.Lake Tahoe,USA:MIT Press,2013:201-220.
[6]MNIH V,KAVUKCUOGLU K,SILVER D,et al.Human-level Control through Deep Reinforcement Learning[J].Nature,2015,518(7540):529-533.
[7]XU J N,ZENG J.Dynamic Target Following Based on Reinforcement Learning of Robotcar[J].Computer Science,2019,46(11A):94-97.
[8]ZHANG H,GUAN X J,BAI G W.Optimization of MobileCharging Path of Wireless Rechargeable Sensor Networks Based on Reinforcement Learning[J].Computer Science,2020,47(11):316-321.
[9]MICHAEL G,SCOTT S,CHI-GUHN L.Epsilon-BMC:ABayesian Ensemble Approach to Epsilon-Greedy Exploration in Model-Free Reinforcement Learning[C]//Proceedings of the 35th Conference on Uncertainty in Artificial Intelligence.Tel Aviv,Israel:AUAI Press,2019:476-485.
[10]ZHOU Z H.Machine Learning[M].Beijing:Tsinghua University Press,2016:373-377.
[11]AUER P,CESA-BIANCHI N,FISCHER P.Finite-time Analysis of the Multiarmed Bandit Problem[J].Machine Learning,2002,47(2-3):235-256.
[12]ZHANG X F,ZHOU Q,LIANG B,et al.An Adaptive Algorithm in Multi-Armed Bandit Problem[J].Journal of Computer Research and Development,2019,56(3):643-654.
[13]HAO B,ABBASI-YADKORI Y,WEN Z,et al.BootstrappingUpper Confidence Bound[J].Advances in Neural Information Processing Systems,2019,32:12123-12133.
[14]KUMARASWAMY R,SCHLEGEL M,WHITE A,et al.Context-Dependent Upper-Confidence Bounds for Directed Exploration[C]//Advances in Neural Information Processing Systems.2018:4779-4789.
[15]FORTUNATO M,AZAR M G,PIOT B,et al.Noisy Networks for Exploration[C]//International Conference on Learning Representations.2018.
[16]OSBAND I,BLUNDELL C,PRITZEL A,et al.Deep Exploration via Bootstrapped DQN[C]//Advances in Neural Information Processing Systems.Barcelona,Spain,2016:4026-4034.
[17]MNIH V,BADIA A P,MIRZA M,et al.Asynchronous Me-thods for Deep Reinforcement Learning[C]//International Conference on Machine Learning.New York,USA:ACM,2016:1928-1937.
[18]SCHULMAN J,LEVINE S,ABBEEL P,et al.Trust Region Policy Optimization[C]//International Conference on Machine Learning.Lille,France,2015:1889-1897.
[19]SCHULMAN J,WOLSKI F,DHARIWAL P,et al.ProximalPolicy Optimization Algorithms[J].arXiv:1707.06347,2017.
[20]WATKINS C J C H,DAYAN P.Q-learning[J].Machine Lear-ning,1992,8(3/4):279-292.
[21]SUTTON R.Introduction to Reinforcement Learning withFunction Approximation[C]//Tutorial at the Conference on Neural Information Processing Systems.Montreal,Canada:MIT Press,2015:33.
[22]HAUSKNECHT M,STONE P.Deep Recurrent Q-Learning for Partially Observable MDPs[J].arXiv:1507.06527,2015.
[23]WILLIAMS R J,ZIPSER D.A Learning Algorithm for Conti-nually Running Fully Recurrent Neural Networks[J].Neural Computation,1989,1(2):270-280.
[24]HOCHREITER S,SCHMIDHUBER J.Long Short-Term Me-mory[J].Neural Computation,1997,9(8):1735-1780.
[25]METROPOLIS N,ULAM S.The Monte Carlo Method[J].Journal of the American Statistical Association,1949,44(247):335-341.
[26]BELLEMARE M,SRINIVASAN S,OSTROVSKI G,et al.Unifying Count-Based Exploration and Intrinsic Motivation[C]//Advances in Neural Information Processing Systems.Barcelona,Spain,2016:1471-1479.
[27]HASSELT H V.Double Q-learning[C]//Advances in NeuralInformation Processing Systems.Vancouver,British Columbia,Canada,2010:2613-2621.
[28]HASSELT H V,GUEZ A,SILVER D.Deep ReinforcementLearning with Double Q-Learning[J].arXiv:1509.06461,2015.
[29]FUJIMOTO S,HOOF H V,MEGER D.Addressing Function Approximation Error in Actor-Critic Methods[J].arXiv:1802.09477,2018.
[30]JIN C,ALLEN-ZHU Z,BUBECK S,et al.IsQ-Learning Pro-vably Efficient? [C]//Advances in Neural Information Proces-sing Systems.Montreal,Canada:MIT Press,2018:4863-4873.
[31]SUTTON R S,BARTO A G.Reinforcement Learning:An Introduction[M].Cambridge,USA:MIT Press,2018:131-132.
[32]BROCKMAN G,CHEUNG V,PETTERSSON L,et al.OpenAi gym[J].arXiv:1606.01540,2016.
[1] HONG Zhi-li, LAI Jun, CAO Lei, CHEN Xi-liang, XU Zhi-xiong. Study on Intelligent Recommendation Method of Dueling Network Reinforcement Learning Based on Regret Exploration [J]. Computer Science, 2022, 49(6): 149-157.
[2] ZHANG Jia-neng, LI Hui, WU Hao-lin, WANG Zhuang. Exploration and Exploitation Balanced Experience Replay [J]. Computer Science, 2022, 49(5): 179-185.
[3] LIN Zhen-xian, ZHANG Meng-kai, WU Cheng-mao, ZHENG Xing-ning. Face Image Inpainting with Generative Adversarial Network [J]. Computer Science, 2021, 48(9): 174-180.
[4] HUANG Zhi-yong, WU Hao-lin, WANG Zhuang, LI Hui. DQN Algorithm Based on Averaged Neural Network Parameters [J]. Computer Science, 2021, 48(4): 223-228.
[5] SHEN Xia-jiong, YANG Ji-yong, ZHANG Lei. Attribute Exploration Algorithm Based on Unrelated Attribute Set [J]. Computer Science, 2021, 48(4): 54-62.
[6] GUO Xin, ZHANG Geng, CHEN Qian, WANG Su-ge. Candidate Sentences Extraction for Machine Reading Comprehension [J]. Computer Science, 2020, 47(5): 198-203.
[7] ZHANG Xin-ming, LI Shuang-qian, LIU Yan, MAO Wen-tao, LIU Shang-wang, LIU Guo-qi. Coyote Optimization Algorithm Based on Information Sharing and Static Greed Selection [J]. Computer Science, 2020, 47(5): 217-224.
[8] WANG Guo-yin, QU Zhong, ZHAO Xian-lian. Practical Exploration of Discipline Construction of Artificial Intelligence+ [J]. Computer Science, 2020, 47(4): 1-5.
[9] ZHENG Tian-jian, HOU Jin-hong, ZHANG Wei, WANG Ju. Finite Basis of Implicational System Associated with Finite Models of Description Logic FL0 Under the Greatest Fixed Point Semantics [J]. Computer Science, 2020, 47(11A): 92-96.
[10] YANG Jia-ning, HUANG Xiang-sheng, LI Zong-han, RONG Can, LIU Dao-wei. Spatio-temporal Trajectory Prediction of Power Grid Based on Double Layers Stacked Long Short-term Memory [J]. Computer Science, 2019, 46(11A): 23-27.
[11] . Review of Coverage Path Planning Arithmetic in Unl}nown Indoor Environment [J]. Computer Science, 2012, 39(Z11): 334-338.
[12] PENG Kai,QING Yong-bin,XU Dao-yun. Customer Segmentation Modeling on Factor Analysis and K-MEANS Clustering [J]. Computer Science, 2011, 38(5): 154-158.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!