一种批量最小二乘策略迭代方法

doi:10.11896/j.issn.1002-137X.2014.09.044

计算机科学 ›› 2014, Vol. 41 ›› Issue (9): 232-238.doi: 10.11896/j.issn.1002-137X.2014.09.044

一种批量最小二乘策略迭代方法

周鑫,刘全,傅启明,肖飞

苏州大学计算机科学与技术学院苏州215006;苏州大学计算机科学与技术学院苏州215006;符号计算与知识工程教育部重点实验室吉林大学长春130012;苏州大学计算机科学与技术学院苏州215006;苏州大学计算机科学与技术学院苏州215006

出版日期:2018-11-14 发布日期:2018-11-14
基金资助:
本文受国家自然科学基金项目(61070223,5,61070122,5,61303108),江苏省自然科学基金(BK2012616), 江苏省高校自然科学研究项目(09KJA520002,9KJB520012,3KJB520020),吉林大学符号计算与知识工程教育部重点实验室资助

Batch Least-squares Policy Iteration

ZHOU Xin,LIU Quan,FU Qi-ming and XIAO Fei

Online:2018-11-14 Published:2018-11-14

摘要/Abstract

摘要： 策略迭代是一种迭代地评估和改进控制策略的强化学习方法。采用最小二乘的策略评估方法可以从经验数据中提取出更多有用信息,提高数据有效性。针对在线的最小二乘策略迭代方法对样本数据的利用不充分、每个样本仅使用一次就被丢弃的问题,提出一种批量最小二乘策略迭代算法(BLSPI),并从理论上证明其收敛性。BLSPI算法将批量更新方法与在线最小二乘策略迭代方法相结合,在线保存生成的样本数据,多次重复使用这些样本数据并结合最小二乘方法来更新控制策略。将BLSPI算法用于倒立摆实验平台,实验结果表明,该算法可以有效利用之前的经验知识,提高经验利用率,加快收敛速度。

关键词: 强化学习,批量更新,最小二乘,策略迭代

Abstract: Policy iteration is a reinforcement learning method which evaluates and improves the control policy iteratively．Policy evaluation with the least-square method can extract more useful information from the empirical data and improve the data validity．For the low empirical utilization rate of online least-squares policy iteration method which uses each sample only once,a batch least-squares policy iteration (BLSPI) method was proposed and its convergence was proved in theory．BLSPI method combines online least-squares policy iteration method and batch updating method,stores the generated samplesonline and reuses these samples with least-squares methods to update the control policy．We applied the BLSPI method to the inverted pendulum system,and the experiment results show that the method can effectively utilize the previous experience and knowledge,improve the empirical utilization rate,and accelerate the convergence speed.

Key words: Reinforcement learning,Batch updating,Least-squares,Policy iteration

周鑫,刘全,傅启明,肖飞. 一种批量最小二乘策略迭代方法[J]. 计算机科学, 2014, 41(9): 232-238. https://doi.org/10.11896/j.issn.1002-137X.2014.09.044

ZHOU Xin,LIU Quan,FU Qi-ming and XIAO Fei. Batch Least-squares Policy Iteration[J]. Computer Science, 2014, 41(9): 232-238. https://doi.org/10.11896/j.issn.1002-137X.2014.09.044

参考文献

[1] Sutton R S,Barto A G．Reinforcement learning:An introduction [M]．Cambridge:MIT Press,1998
[2] 刘全,闫其粹,伏玉琛,等．一种基于启发式奖赏函数的分层强化学习方法 [J]．计算机研究与发展,2011,48(12):2352-2358
[3] Kaelbing L P,Littman M L,Moore A W．Reinforcement lear-ning:A survey [J]．Journal of Artificial Intelligence Research,1996,4(2):237-285
[4] 刘全,傅启明,龚声蓉,等．最小状态变元平均奖赏的强化学习方法 [J]．通信学报,2011,32(1):66-71
[5] Gao Yang,Chen Shi-fu,Lu Xin．Research on reinforcementlearning technology:A review [J]．Journal of Acta Automatica Sinica,2004,30(1):86-100
[6] Geist M,Pietquin O．Parametric value function approximation:Aunified view [C]∥Proc of the IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning．NJ:IEEE,2011:9-16
[7] Bradtke S J,Barto A G．Linear least-squares algorithms for temporal difference learning[J]．Journal of Machine Learning,1996,22:33-57
[8] Boyan J．Technical update Least-squares temporal differencelearning [J]．Journal of Machine Learning,2002,49:233-246
[9] Maei H R,Szepesvari C,Bhatnagar S,et al．Toward off-policylearning control with function approximation [C]∥Proc of the 27th International Conference on Machine Learning．Haifa:Omnipress,2010:719-726
[10] Sutton R S．Learning to predict by the method of temporaldifferences [J]．Journal of Machine Learning,1988,22:33-57
[11] Sutton R S,Szepesvari Cs,Maei H R．A convergent O(n) algorithm for off-policy temporal-difference learning with Linear function approximation[C]∥Proc of the 25th Annual Confe-rence on Neural Information Processing Systems．Granada,2008:1609-1616
[12] Lagoudakis M,Parr R,Littman M．Least-squares methods in reinforcement learning for control[J]．Methods and Applications of Artificial Intelligence,2002,2308:249-260
[13] Lagoudakis M,Parr R．Least squares policy iteration [J]．Journal of Machine Learning Research,2003(4):1107-1149
[14] Busoniu L,Babuska R,Schutter B D,et al．ReinforcementLearning and Dynamic Programming using Function Approximators [M]．New York:CRC Press,2010
[15] Kalyanakrishnan S,Stone P．Batch reinforcement learning in a complex domain[C]∥Proc of the 6th International Conference on Autonomous Agents and Multiagent Systems．New York,2007:650-657
[16] Jung T,Polani D．Kernelizing LSPE (λ) [C]∥Proc of the IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning．NJ:IEEE,2007
[17] Jung T,Polani D．Least squares SVM for least squares TDlearning[C]∥Proc of the 17th European Conference on Artificial Intelligence．Riva del Garda,2006:499-503

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

一种批量最小二乘策略迭代方法

Batch Least-squares Policy Iteration

PDF (PC)

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 0

Metrics

本文评价

推荐阅读 0