Computer Science ›› 2020, Vol. 47 ›› Issue (12): 210-217.doi: 10.11896/jsjkx.191100084

Previous Articles     Next Articles

Double Weighted Learning Algorithm Based on Least Squares

LI Bin1, LIU Quan1,2,3,4   

  1. 1 School of Computer Science and Technology Soochow University Suzhou Jiangsu 215006,China
    2 Provincial Key Laboratory for Computer Information Processing Technology Soochow University Suzhou Jiangsu 215006,China
    3 Collaborative Innovation Center of Novel Software Technology and Industrialization Nanjing 210000,China
    4 Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry Education Jilin University Changchun 130012,China
  • Received:2019-11-11 Revised:2020-03-24 Published:2020-12-17
  • About author:LI Bin,,born in 1994 ,master candidate.His main research interests include re-inforcement learning and deep rein-forcement learning.
    LIU Quan,,born in 1969,Ph.D,profes-sor,Ph.D supervisor,is a member of China Computer Federation.His main research interests include reinforcement learning,intelligence information pro-cessing and automated reasoning.
  • Supported by:
    National Natural Science Foundation of China (61772355,61702055,61502323,61502329),Jiangsu Province Natural Science Research University Major Projects (18KJA520011,17KJA520004),Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education,Jilin University (93K172014K04,93K172017K18) and Suzhou Industrial Application of Basic Research Program Part (SYG201422).

Abstract: Reinforcement Learning is one of the most challenging and difficult concerns in the field of artificial intelligence.Least-squares method is one of the advanced function approximate methods that can be used to solve the problem of reinforcement learning.It has advantages of fast convergence rate and sufficient utilization of sample data.After the study and analysis of least squares temporal diffe-rence algorithm (LSTD)this paper proposes a double weights with least-squares Sarsa algorithm (DWLS-Sarsa) based on the LSTD algorithm.DWLS-Sarsa combines two weights in a certain way and takes control of temporal diffe-rence error with Sarsa methods.During the training processtwo weights will produce different values because of the difference in the updated samples and will gradually narrow the gap between the two weights until they converge to the same optimal value duo to the distribution of the sample data.So that the exploration performance and convergence of the algorithm will be ensured.FinallyDWLS-Sarsa algorithm is applied to the experiment and compared with other reinforcement learning algorithms.The experimental results show that DWLS-Sarsa algorithm can deal with local optimum problems effectively to achieve more precise convergence value and has better learning performance and robustness.

Key words: Function approximation, Least-squares, Reinforcement learning, Sarsa, Temporal difference

CLC Number: 

  • TP181
[1] MOERLAND T M,BROEKENS J,JONKER C M.Emotion in reinforcement learning agents and robots:a survey[J].Machine Learning,2018,107(2):4480.
[2] LIU T,TIAN B,CAO D,et al.Parallel Reinforcement Lear-ning:A Framework and Case Study[J].IEEE/CAA Journal of Automatica Sinica,2018,5(4):65-73.
[3] DU W,DING S F.Overview on Multi-agent Reinforcement Lear-ning[J].Computer Science,2019,46(8):1-8.
[4] ZHAO X Y,DING S F.Research on Deep Reinforcement Lear-ning[J].Computer Science,2018,45(7):1-6.
[5] MNIH V,KAVUKCUOGLU K,SILVER D,et al.Human-level control through deep reinforcement learning[J].Nature,2015,518(7540):529-533.
[6] SUTTON R S,BARTO A.Reinforcement Learning:An Intro-duction[M].MIT Press,2019.
[7] DEGRIS T.PILARSKI P M.SUTTON R S.Model-free rein-forcement learning with continuous action in practice[C]//Proceedings of 2012 American Control Conference.Montreal,QC,Canada,2012:2177-2182.
[8] NEDIC' A,BERTSEKAS D.Convergence Rate of IncrementalSubgradient Algorithms[J].Stochastic Optimization:Algorithms and Applications,2001,54:223.
[9] LI L,WILLIAMS J D,BALAKRISHNAN S.Reinforcementlearning for dialog management using least-squares policy iteration and fast feature selection[C]//Proceedings of the 10th Annual Conference of the International Speech Communation Association.Brighton,UK,2009.
[10] WOOKEY D S,KONIDARIS G D.Regularized feature selection in reinforcement learning[J].Machine Learning,2015,100(2/3):655-676.
[11] LAGOUDAKIS M G,PARR R.Least-Squares Policy Iteration[J].Journal of Machine Learning Research,2004,4(6):1107-1149.
[12] JUNG T,POLANI D.Least squares SVM for least squares TD learning[C]//Procedings of the 17th European Conference on Artificial Intelligence.Riva del Garda,Italy,2006.
[13] WANG J K,LIN S D.Parallel Least-Squares Policy Iteration[C]//2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA).2016:166-173.
[14] ZHOU X,LIU Q,FU Q M,et al.Batch Least-squares PolicyIteration[J].Computer Science,2014,41(9):232-238.
[15] GEORGE J A,SHALABH B.An online prediction algorithm for reinforcement learning with linear function approximation using cross entropy method[J].Machine Learning,2018,107(8/9/10):1385-1429.
[16] GEIST M,PIETQUIN O.Parametric value function approximation:A unified view[C]//Proceedings of the 2011 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning.Piscataway,USA,2011.
[17] BUSONIU L BRUINB T D,TOLICD,et al.ReinforcementLearning for Control:Performance,Stability,and Deep Appro-ximators[J].Annual Reviews in Control,2018,46:8-28.
[18] JIN Y J,ZHU W W,FU Y C,et al.Actor-Critic Algorithm Based on Tile Coding and Model Learning[J].Computer Scien-ce,2014,41(6):239-242,249.
[19] VAN SEIJEN H,MAHMOOD A R,PILARSKI P M,et al.True Online Temporal-Difference Learning[J].Journal of Machine Learning esearch,2015,17(1):5057-5096.
[20] GRONDMAN I,BUSONIU L,LOPES G A D,et al.A Survey of ActorCritic Reinforcement Learning:Standard and Natural Policy Gradients[J].IEEE Transactions on Systems,Man,and Cybernetics,Part C (Applications and Reviews),2012,42(6):1291-1307.
[21] GHORBANI F,DERHAMI V,AFSHARCHI M.Fuzzy Least Square Policy Iteration and Its Mathematical Analysis[J].International Journal of Fuzzy Systems,2017,19(3):849-862.
[1] LIU Xing-guang, ZHOU Li, LIU Yan, ZHANG Xiao-ying, TAN Xiang, WEI Ji-bo. Construction and Distribution Method of REM Based on Edge Intelligence [J]. Computer Science, 2022, 49(9): 236-241.
[2] YUAN Wei-lin, LUO Jun-ren, LU Li-na, CHEN Jia-xing, ZHANG Wan-peng, CHEN Jing. Methods in Adversarial Intelligent Game:A Holistic Comparative Analysis from Perspective of Game Theory and Reinforcement Learning [J]. Computer Science, 2022, 49(8): 191-204.
[3] SHI Dian-xi, ZHAO Chen-ran, ZHANG Yao-wen, YANG Shao-wu, ZHANG Yong-jun. Adaptive Reward Method for End-to-End Cooperation Based on Multi-agent Reinforcement Learning [J]. Computer Science, 2022, 49(8): 247-256.
[4] YU Bin, LI Xue-hua, PAN Chun-yu, LI Na. Edge-Cloud Collaborative Resource Allocation Algorithm Based on Deep Reinforcement Learning [J]. Computer Science, 2022, 49(7): 248-253.
[5] LI Meng-fei, MAO Ying-chi, TU Zi-jian, WANG Xuan, XU Shu-fang. Server-reliability Task Offloading Strategy Based on Deep Deterministic Policy Gradient [J]. Computer Science, 2022, 49(7): 271-279.
[6] XIE Wan-cheng, LI Bin, DAI Yue-yue. PPO Based Task Offloading Scheme in Aerial Reconfigurable Intelligent Surface-assisted Edge Computing [J]. Computer Science, 2022, 49(6): 3-11.
[7] HONG Zhi-li, LAI Jun, CAO Lei, CHEN Xi-liang, XU Zhi-xiong. Study on Intelligent Recommendation Method of Dueling Network Reinforcement Learning Based on Regret Exploration [J]. Computer Science, 2022, 49(6): 149-157.
[8] GUO Yu-xin, CHEN Xiu-hong. Automatic Summarization Model Combining BERT Word Embedding Representation and Topic Information Enhancement [J]. Computer Science, 2022, 49(6): 313-318.
[9] FAN Jing-yu, LIU Quan. Off-policy Maximum Entropy Deep Reinforcement Learning Algorithm Based on RandomlyWeighted Triple Q -Learning [J]. Computer Science, 2022, 49(6): 335-341.
[10] ZHANG Jia-neng, LI Hui, WU Hao-lin, WANG Zhuang. Exploration and Exploitation Balanced Experience Replay [J]. Computer Science, 2022, 49(5): 179-185.
[11] LI Peng, YI Xiu-wen, QI De-kang, DUAN Zhe-wen, LI Tian-rui. Heating Strategy Optimization Method Based on Deep Learning [J]. Computer Science, 2022, 49(4): 263-268.
[12] OUYANG Zhuo, ZHOU Si-yuan, LYU Yong, TAN Guo-ping, ZHANG Yue, XIANG Liang-liang. DRL-based Vehicle Control Strategy for Signal-free Intersections [J]. Computer Science, 2022, 49(3): 46-51.
[13] ZHOU Qin, LUO Fei, DING Wei-chao, GU Chun-hua, ZHENG Shuai. Double Speedy Q-Learning Based on Successive Over Relaxation [J]. Computer Science, 2022, 49(3): 239-245.
[14] LI Su, SONG Bao-yan, LI Dong, WANG Jun-lu. Composite Blockchain Associated Event Tracing Method for Financial Activities [J]. Computer Science, 2022, 49(3): 346-353.
[15] HUANG Xin-quan, LIU Ai-jun, LIANG Xiao-hu, WANG Heng. Load-balanced Geographic Routing Protocol in Aerial Sensor Network [J]. Computer Science, 2022, 49(2): 342-352.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!