一种批量最小二乘策略迭代方法

doi:10.11896/j.issn.1002-137X.2014.09.044

Abstract

Abstract: Policy iteration is a reinforcement learning method which evaluates and improves the control policy iteratively．Policy evaluation with the least-square method can extract more useful information from the empirical data and improve the data validity．For the low empirical utilization rate of online least-squares policy iteration method which uses each sample only once,a batch least-squares policy iteration (BLSPI) method was proposed and its convergence was proved in theory．BLSPI method combines online least-squares policy iteration method and batch updating method,stores the generated samplesonline and reuses these samples with least-squares methods to update the control policy．We applied the BLSPI method to the inverted pendulum system,and the experiment results show that the method can effectively utilize the previous experience and knowledge,improve the empirical utilization rate,and accelerate the convergence speed.

Key words: Reinforcement learning,Batch updating,Least-squares,Policy iteration

ZHOU Xin,LIU Quan,FU Qi-ming and XIAO Fei. Batch Least-squares Policy Iteration[J].Computer Science, 2014, 41(9): 232-238.

References

[1] Sutton R S,Barto A G．Reinforcement learning:An introduction [M]．Cambridge:MIT Press,1998
[2] 刘全,闫其粹,伏玉琛,等．一种基于启发式奖赏函数的分层强化学习方法 [J]．计算机研究与发展,2011,48(12):2352-2358
[3] Kaelbing L P,Littman M L,Moore A W．Reinforcement lear-ning:A survey [J]．Journal of Artificial Intelligence Research,1996,4(2):237-285
[4] 刘全,傅启明,龚声蓉,等．最小状态变元平均奖赏的强化学习方法 [J]．通信学报,2011,32(1):66-71
[5] Gao Yang,Chen Shi-fu,Lu Xin．Research on reinforcementlearning technology:A review [J]．Journal of Acta Automatica Sinica,2004,30(1):86-100
[6] Geist M,Pietquin O．Parametric value function approximation:Aunified view [C]∥Proc of the IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning．NJ:IEEE,2011:9-16
[7] Bradtke S J,Barto A G．Linear least-squares algorithms for temporal difference learning[J]．Journal of Machine Learning,1996,22:33-57
[8] Boyan J．Technical update Least-squares temporal differencelearning [J]．Journal of Machine Learning,2002,49:233-246
[9] Maei H R,Szepesvari C,Bhatnagar S,et al．Toward off-policylearning control with function approximation [C]∥Proc of the 27th International Conference on Machine Learning．Haifa:Omnipress,2010:719-726
[10] Sutton R S．Learning to predict by the method of temporaldifferences [J]．Journal of Machine Learning,1988,22:33-57
[11] Sutton R S,Szepesvari Cs,Maei H R．A convergent O(n) algorithm for off-policy temporal-difference learning with Linear function approximation[C]∥Proc of the 25th Annual Confe-rence on Neural Information Processing Systems．Granada,2008:1609-1616
[12] Lagoudakis M,Parr R,Littman M．Least-squares methods in reinforcement learning for control[J]．Methods and Applications of Artificial Intelligence,2002,2308:249-260
[13] Lagoudakis M,Parr R．Least squares policy iteration [J]．Journal of Machine Learning Research,2003(4):1107-1149
[14] Busoniu L,Babuska R,Schutter B D,et al．ReinforcementLearning and Dynamic Programming using Function Approximators [M]．New York:CRC Press,2010
[15] Kalyanakrishnan S,Stone P．Batch reinforcement learning in a complex domain[C]∥Proc of the 6th International Conference on Autonomous Agents and Multiagent Systems．New York,2007:650-657
[16] Jung T,Polani D．Kernelizing LSPE (λ) [C]∥Proc of the IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning．NJ:IEEE,2007
[17] Jung T,Polani D．Least squares SVM for least squares TDlearning[C]∥Proc of the 17th European Conference on Artificial Intelligence．Riva del Garda,2006:499-503

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Batch Least-squares Policy Iteration

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 0

Metrics

Comments

Recommended 0