计算机科学 ›› 2020, Vol. 47 ›› Issue (12): 210-217.doi: 10.11896/jsjkx.191100084
李斌1, 刘全1,2,3,4
LI Bin1, LIU Quan1,2,3,4
摘要: 强化学习是人工智能领域中的一个研究热点.在求解强化学习问题时传统的最小二乘法作为一类特殊的函数逼近学习方法具有收敛速度快、充分利用样本数据的优势.通过对最小二乘时序差分算法(Least-Squares Temporal DifferenceLSTD)的研究与分析并以该方法为基础提出了双权重最小二乘Sarsa算法(Double Weights WithLeast Squares SarsaDWLS-Sarsa).DWLS-Sarsa算法将两权重通过一定方式进行关联得到目标权重并利用Sarsa方法对时序差分误差进行控制.在算法训练过程中两权重会因为更新样本的不同而产生不同的值保证了算法可以有效地进行探索;两权重也会因为样本数据的分布而逐渐缩小之间的差距直到收敛至同一最优值确保了算法的收敛性能.最后将DWLS-Sarsa算法与其他强化学习算法进行实验对比结果表明DWLS-Sarsa算法具有较优的学习性能与鲁棒性可以有效地处理局部最优问题并提高算法收敛时的表现效果.
中图分类号:
[1] MOERLAND T M,BROEKENS J,JONKER C M.Emotion in reinforcement learning agents and robots:a survey[J].Machine Learning,2018,107(2):4480. [2] LIU T,TIAN B,CAO D,et al.Parallel Reinforcement Lear-ning:A Framework and Case Study[J].IEEE/CAA Journal of Automatica Sinica,2018,5(4):65-73. [3] DU W,DING S F.Overview on Multi-agent Reinforcement Lear-ning[J].Computer Science,2019,46(8):1-8. [4] ZHAO X Y,DING S F.Research on Deep Reinforcement Lear-ning[J].Computer Science,2018,45(7):1-6. [5] MNIH V,KAVUKCUOGLU K,SILVER D,et al.Human-level control through deep reinforcement learning[J].Nature,2015,518(7540):529-533. [6] SUTTON R S,BARTO A.Reinforcement Learning:An Intro-duction[M].MIT Press,2019. [7] DEGRIS T.PILARSKI P M.SUTTON R S.Model-free rein-forcement learning with continuous action in practice[C]//Proceedings of 2012 American Control Conference.Montreal,QC,Canada,2012:2177-2182. [8] NEDIC' A,BERTSEKAS D.Convergence Rate of IncrementalSubgradient Algorithms[J].Stochastic Optimization:Algorithms and Applications,2001,54:223. [9] LI L,WILLIAMS J D,BALAKRISHNAN S.Reinforcementlearning for dialog management using least-squares policy iteration and fast feature selection[C]//Proceedings of the 10th Annual Conference of the International Speech Communation Association.Brighton,UK,2009. [10] WOOKEY D S,KONIDARIS G D.Regularized feature selection in reinforcement learning[J].Machine Learning,2015,100(2/3):655-676. [11] LAGOUDAKIS M G,PARR R.Least-Squares Policy Iteration[J].Journal of Machine Learning Research,2004,4(6):1107-1149. [12] JUNG T,POLANI D.Least squares SVM for least squares TD learning[C]//Procedings of the 17th European Conference on Artificial Intelligence.Riva del Garda,Italy,2006. [13] WANG J K,LIN S D.Parallel Least-Squares Policy Iteration[C]//2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA).2016:166-173. [14] ZHOU X,LIU Q,FU Q M,et al.Batch Least-squares PolicyIteration[J].Computer Science,2014,41(9):232-238. [15] GEORGE J A,SHALABH B.An online prediction algorithm for reinforcement learning with linear function approximation using cross entropy method[J].Machine Learning,2018,107(8/9/10):1385-1429. [16] GEIST M,PIETQUIN O.Parametric value function approximation:A unified view[C]//Proceedings of the 2011 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning.Piscataway,USA,2011. [17] BUSONIU L BRUINB T D,TOLICD,et al.ReinforcementLearning for Control:Performance,Stability,and Deep Appro-ximators[J].Annual Reviews in Control,2018,46:8-28. [18] JIN Y J,ZHU W W,FU Y C,et al.Actor-Critic Algorithm Based on Tile Coding and Model Learning[J].Computer Scien-ce,2014,41(6):239-242,249. [19] VAN SEIJEN H,MAHMOOD A R,PILARSKI P M,et al.True Online Temporal-Difference Learning[J].Journal of Machine Learning esearch,2015,17(1):5057-5096. [20] GRONDMAN I,BUSONIU L,LOPES G A D,et al.A Survey of ActorCritic Reinforcement Learning:Standard and Natural Policy Gradients[J].IEEE Transactions on Systems,Man,and Cybernetics,Part C (Applications and Reviews),2012,42(6):1291-1307. [21] GHORBANI F,DERHAMI V,AFSHARCHI M.Fuzzy Least Square Policy Iteration and Its Mathematical Analysis[J].International Journal of Fuzzy Systems,2017,19(3):849-862. |
[1] | 熊丽琴, 曹雷, 赖俊, 陈希亮. 基于值分解的多智能体深度强化学习综述 Overview of Multi-agent Deep Reinforcement Learning Based on Value Factorization 计算机科学, 2022, 49(9): 172-182. https://doi.org/10.11896/jsjkx.210800112 |
[2] | 刘兴光, 周力, 刘琰, 张晓瀛, 谭翔, 魏急波. 基于边缘智能的频谱地图构建与分发方法 Construction and Distribution Method of REM Based on Edge Intelligence 计算机科学, 2022, 49(9): 236-241. https://doi.org/10.11896/jsjkx.220400148 |
[3] | 袁唯淋, 罗俊仁, 陆丽娜, 陈佳星, 张万鹏, 陈璟. 智能博弈对抗方法:博弈论与强化学习综合视角对比分析 Methods in Adversarial Intelligent Game:A Holistic Comparative Analysis from Perspective of Game Theory and Reinforcement Learning 计算机科学, 2022, 49(8): 191-204. https://doi.org/10.11896/jsjkx.220200174 |
[4] | 史殿习, 赵琛然, 张耀文, 杨绍武, 张拥军. 基于多智能体强化学习的端到端合作的自适应奖励方法 Adaptive Reward Method for End-to-End Cooperation Based on Multi-agent Reinforcement Learning 计算机科学, 2022, 49(8): 247-256. https://doi.org/10.11896/jsjkx.210700100 |
[5] | 于滨, 李学华, 潘春雨, 李娜. 基于深度强化学习的边云协同资源分配算法 Edge-Cloud Collaborative Resource Allocation Algorithm Based on Deep Reinforcement Learning 计算机科学, 2022, 49(7): 248-253. https://doi.org/10.11896/jsjkx.210400219 |
[6] | 李梦菲, 毛莺池, 屠子健, 王瑄, 徐淑芳. 基于深度确定性策略梯度的服务器可靠性任务卸载策略 Server-reliability Task Offloading Strategy Based on Deep Deterministic Policy Gradient 计算机科学, 2022, 49(7): 271-279. https://doi.org/10.11896/jsjkx.210600040 |
[7] | 郭雨欣, 陈秀宏. 融合BERT词嵌入表示和主题信息增强的自动摘要模型 Automatic Summarization Model Combining BERT Word Embedding Representation and Topic Information Enhancement 计算机科学, 2022, 49(6): 313-318. https://doi.org/10.11896/jsjkx.210400101 |
[8] | 范静宇, 刘全. 基于随机加权三重Q学习的异策略最大熵强化学习算法 Off-policy Maximum Entropy Deep Reinforcement Learning Algorithm Based on RandomlyWeighted Triple Q -Learning 计算机科学, 2022, 49(6): 335-341. https://doi.org/10.11896/jsjkx.210300081 |
[9] | 谢万城, 李斌, 代玥玥. 空中智能反射面辅助边缘计算中基于PPO的任务卸载方案 PPO Based Task Offloading Scheme in Aerial Reconfigurable Intelligent Surface-assisted Edge Computing 计算机科学, 2022, 49(6): 3-11. https://doi.org/10.11896/jsjkx.220100249 |
[10] | 洪志理, 赖俊, 曹雷, 陈希亮, 徐志雄. 基于遗憾探索的竞争网络强化学习智能推荐方法研究 Study on Intelligent Recommendation Method of Dueling Network Reinforcement Learning Based on Regret Exploration 计算机科学, 2022, 49(6): 149-157. https://doi.org/10.11896/jsjkx.210600226 |
[11] | 张佳能, 李辉, 吴昊霖, 王壮. 一种平衡探索和利用的优先经验回放方法 Exploration and Exploitation Balanced Experience Replay 计算机科学, 2022, 49(5): 179-185. https://doi.org/10.11896/jsjkx.210300084 |
[12] | 郭斯羽, 吴延冬. 去除离群点的改进椭圆拟合算法 Improved Ellipse Fitting Algorithm with Outlier Removal 计算机科学, 2022, 49(4): 188-194. https://doi.org/10.11896/jsjkx.210200040 |
[13] | 李鹏, 易修文, 齐德康, 段哲文, 李天瑞. 一种基于深度学习的供热策略优化方法 Heating Strategy Optimization Method Based on Deep Learning 计算机科学, 2022, 49(4): 263-268. https://doi.org/10.11896/jsjkx.210300155 |
[14] | 欧阳卓, 周思源, 吕勇, 谭国平, 张悦, 项亮亮. 基于深度强化学习的无信号灯交叉路口车辆控制 DRL-based Vehicle Control Strategy for Signal-free Intersections 计算机科学, 2022, 49(3): 46-51. https://doi.org/10.11896/jsjkx.210700010 |
[15] | 周琴, 罗飞, 丁炜超, 顾春华, 郑帅. 基于逐次超松弛技术的Double Speedy Q-Learning算法 Double Speedy Q-Learning Based on Successive Over Relaxation 计算机科学, 2022, 49(3): 239-245. https://doi.org/10.11896/jsjkx.201200173 |
|