计算机科学 ›› 2014, Vol. 41 ›› Issue (9): 232-238.doi: 10.11896/j.issn.1002-137X.2014.09.044
周鑫,刘全,傅启明,肖飞
ZHOU Xin,LIU Quan,FU Qi-ming and XIAO Fei
摘要: 策略迭代是一种迭代地评估和改进控制策略的强化学习方法。采用最小二乘的策略评估方法可以从经验数据中提取出更多有用信息,提高数据有效性。针对在线的最小二乘策略迭代方法对样本数据的利用不充分、每个样本仅使用一次就被丢弃的问题,提出一种批量最小二乘策略迭代算法(BLSPI),并从理论上证明其收敛性。BLSPI算法将批量更新方法与在线最小二乘策略迭代方法相结合,在线保存生成的样本数据,多次重复使用这些样本数据并结合最小二乘方法来更新控制策略。将BLSPI算法用于倒立摆实验平台,实验结果表明,该算法可以有效利用之前的经验知识,提高经验利用率,加快收敛速度。
[1] Sutton R S,Barto A G.Reinforcement learning:An introduction [M].Cambridge:MIT Press,1998 [2] 刘全,闫其粹,伏玉琛,等.一种基于启发式奖赏函数的分层强化学习方法 [J].计算机研究与发展,2011,48(12):2352-2358 [3] Kaelbing L P,Littman M L,Moore A W.Reinforcement lear-ning:A survey [J].Journal of Artificial Intelligence Research,1996,4(2):237-285 [4] 刘全,傅启明,龚声蓉,等.最小状态变元平均奖赏的强化学习方法 [J].通信学报,2011,32(1):66-71 [5] Gao Yang,Chen Shi-fu,Lu Xin.Research on reinforcementlearning technology:A review [J].Journal of Acta Automatica Sinica,2004,30(1):86-100 [6] Geist M,Pietquin O.Parametric value function approximation:Aunified view [C]∥Proc of the IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning.NJ:IEEE,2011:9-16 [7] Bradtke S J,Barto A G.Linear least-squares algorithms for temporal difference learning[J].Journal of Machine Learning,1996,22:33-57 [8] Boyan J.Technical update Least-squares temporal differencelearning [J].Journal of Machine Learning,2002,49:233-246 [9] Maei H R,Szepesvari C,Bhatnagar S,et al.Toward off-policylearning control with function approximation [C]∥Proc of the 27th International Conference on Machine Learning.Haifa:Omnipress,2010:719-726 [10] Sutton R S.Learning to predict by the method of temporaldifferences [J].Journal of Machine Learning,1988,22:33-57 [11] Sutton R S,Szepesvari Cs,Maei H R.A convergent O(n) algorithm for off-policy temporal-difference learning with Linear function approximation[C]∥Proc of the 25th Annual Confe-rence on Neural Information Processing Systems.Granada,2008:1609-1616 [12] Lagoudakis M,Parr R,Littman M.Least-squares methods in reinforcement learning for control[J].Methods and Applications of Artificial Intelligence,2002,2308:249-260 [13] Lagoudakis M,Parr R.Least squares policy iteration [J].Journal of Machine Learning Research,2003(4):1107-1149 [14] Busoniu L,Babuska R,Schutter B D,et al.ReinforcementLearning and Dynamic Programming using Function Approximators [M].New York:CRC Press,2010 [15] Kalyanakrishnan S,Stone P.Batch reinforcement learning in a complex domain[C]∥Proc of the 6th International Conference on Autonomous Agents and Multiagent Systems.New York,2007:650-657 [16] Jung T,Polani D.Kernelizing LSPE (λ) [C]∥Proc of the IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning.NJ:IEEE,2007 [17] Jung T,Polani D.Least squares SVM for least squares TDlearning[C]∥Proc of the 17th European Conference on Artificial Intelligence.Riva del Garda,2006:499-503 |
No related articles found! |
|