Computer Science ›› 2014, Vol. 41 ›› Issue (9): 232-238.doi: 10.11896/j.issn.1002-137X.2014.09.044

Previous Articles     Next Articles

Batch Least-squares Policy Iteration

ZHOU Xin,LIU Quan,FU Qi-ming and XIAO Fei   

  • Online:2018-11-14 Published:2018-11-14

Abstract: Policy iteration is a reinforcement learning method which evaluates and improves the control policy iteratively.Policy evaluation with the least-square method can extract more useful information from the empirical data and improve the data validity.For the low empirical utilization rate of online least-squares policy iteration method which uses each sample only once,a batch least-squares policy iteration (BLSPI) method was proposed and its convergence was proved in theory.BLSPI method combines online least-squares policy iteration method and batch updating method,stores the generated samplesonline and reuses these samples with least-squares methods to update the control policy.We applied the BLSPI method to the inverted pendulum system,and the experiment results show that the method can effectively utilize the previous experience and knowledge,improve the empirical utilization rate,and accelerate the convergence speed.

Key words: Reinforcement learning,Batch updating,Least-squares,Policy iteration

[1] Sutton R S,Barto A G.Reinforcement learning:An introduction [M].Cambridge:MIT Press,1998
[2] 刘全,闫其粹,伏玉琛,等.一种基于启发式奖赏函数的分层强化学习方法 [J].计算机研究与发展,2011,48(12):2352-2358
[3] Kaelbing L P,Littman M L,Moore A W.Reinforcement lear-ning:A survey [J].Journal of Artificial Intelligence Research,1996,4(2):237-285
[4] 刘全,傅启明,龚声蓉,等.最小状态变元平均奖赏的强化学习方法 [J].通信学报,2011,32(1):66-71
[5] Gao Yang,Chen Shi-fu,Lu Xin.Research on reinforcementlearning technology:A review [J].Journal of Acta Automatica Sinica,2004,30(1):86-100
[6] Geist M,Pietquin O.Parametric value function approximation:Aunified view [C]∥Proc of the IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning.NJ:IEEE,2011:9-16
[7] Bradtke S J,Barto A G.Linear least-squares algorithms for temporal difference learning[J].Journal of Machine Learning,1996,22:33-57
[8] Boyan J.Technical update Least-squares temporal differencelearning [J].Journal of Machine Learning,2002,49:233-246
[9] Maei H R,Szepesvari C,Bhatnagar S,et al.Toward off-policylearning control with function approximation [C]∥Proc of the 27th International Conference on Machine Learning.Haifa:Omnipress,2010:719-726
[10] Sutton R S.Learning to predict by the method of temporaldifferences [J].Journal of Machine Learning,1988,22:33-57
[11] Sutton R S,Szepesvari Cs,Maei H R.A convergent O(n) algorithm for off-policy temporal-difference learning with Linear function approximation[C]∥Proc of the 25th Annual Confe-rence on Neural Information Processing Systems.Granada,2008:1609-1616
[12] Lagoudakis M,Parr R,Littman M.Least-squares methods in reinforcement learning for control[J].Methods and Applications of Artificial Intelligence,2002,2308:249-260
[13] Lagoudakis M,Parr R.Least squares policy iteration [J].Journal of Machine Learning Research,2003(4):1107-1149
[14] Busoniu L,Babuska R,Schutter B D,et al.ReinforcementLearning and Dynamic Programming using Function Approximators [M].New York:CRC Press,2010
[15] Kalyanakrishnan S,Stone P.Batch reinforcement learning in a complex domain[C]∥Proc of the 6th International Conference on Autonomous Agents and Multiagent Systems.New York,2007:650-657
[16] Jung T,Polani D.Kernelizing LSPE (λ) [C]∥Proc of the IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning.NJ:IEEE,2007
[17] Jung T,Polani D.Least squares SVM for least squares TDlearning[C]∥Proc of the 17th European Conference on Artificial Intelligence.Riva del Garda,2006:499-503

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!