计算机科学 ›› 2020, Vol. 47 ›› Issue (12): 210-217.doi: 10.11896/jsjkx.191100084

• 人工智能 • 上一篇    下一篇

基于最小二乘的双权重学习法

李斌1, 刘全1,2,3,4   

  1. 1 苏州大学计算机科学与技术学院 江苏 苏州 215006
    2 苏州大学江苏省计算机信息处理技术重点实验室 江苏 苏州 215006
    3 软件新技术与产业化协同创新中心 南京 210000
    4 吉林大学符号计算与知识工程教育部重点实验室 长春 130012
  • 收稿日期:2019-11-11 修回日期:2020-03-24 发布日期:2020-12-17
  • 通讯作者: 刘全 (quanliu@suda.edu.cn)
  • 作者简介:2314073669@qq.com
  • 基金资助:
    国家自然科学基金(61772355617020556150232361502329);江苏省高等学校自然科学研究重大项目(18KJA52001117KJA520004);吉林大学符号计算与知识工程教育部重点实验室资助项目(93K172014K0493K172017K18);苏州市应用基础研究计划工业部分(SYG201422)

Double Weighted Learning Algorithm Based on Least Squares

LI Bin1, LIU Quan1,2,3,4   

  1. 1 School of Computer Science and Technology Soochow University Suzhou Jiangsu 215006,China
    2 Provincial Key Laboratory for Computer Information Processing Technology Soochow University Suzhou Jiangsu 215006,China
    3 Collaborative Innovation Center of Novel Software Technology and Industrialization Nanjing 210000,China
    4 Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry Education Jilin University Changchun 130012,China
  • Received:2019-11-11 Revised:2020-03-24 Published:2020-12-17
  • About author:LI Bin,,born in 1994 ,master candidate.His main research interests include re-inforcement learning and deep rein-forcement learning.
    LIU Quan,,born in 1969,Ph.D,profes-sor,Ph.D supervisor,is a member of China Computer Federation.His main research interests include reinforcement learning,intelligence information pro-cessing and automated reasoning.
  • Supported by:
    National Natural Science Foundation of China (61772355,61702055,61502323,61502329),Jiangsu Province Natural Science Research University Major Projects (18KJA520011,17KJA520004),Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education,Jilin University (93K172014K04,93K172017K18) and Suzhou Industrial Application of Basic Research Program Part (SYG201422).

摘要: 强化学习是人工智能领域中的一个研究热点.在求解强化学习问题时传统的最小二乘法作为一类特殊的函数逼近学习方法具有收敛速度快、充分利用样本数据的优势.通过对最小二乘时序差分算法(Least-Squares Temporal DifferenceLSTD)的研究与分析并以该方法为基础提出了双权重最小二乘Sarsa算法(Double Weights WithLeast Squares SarsaDWLS-Sarsa).DWLS-Sarsa算法将两权重通过一定方式进行关联得到目标权重并利用Sarsa方法对时序差分误差进行控制.在算法训练过程中两权重会因为更新样本的不同而产生不同的值保证了算法可以有效地进行探索;两权重也会因为样本数据的分布而逐渐缩小之间的差距直到收敛至同一最优值确保了算法的收敛性能.最后将DWLS-Sarsa算法与其他强化学习算法进行实验对比结果表明DWLS-Sarsa算法具有较优的学习性能与鲁棒性可以有效地处理局部最优问题并提高算法收敛时的表现效果.

关键词: Sarsa, 函数逼近, 强化学习, 时序差分, 最小二乘

Abstract: Reinforcement Learning is one of the most challenging and difficult concerns in the field of artificial intelligence.Least-squares method is one of the advanced function approximate methods that can be used to solve the problem of reinforcement learning.It has advantages of fast convergence rate and sufficient utilization of sample data.After the study and analysis of least squares temporal diffe-rence algorithm (LSTD)this paper proposes a double weights with least-squares Sarsa algorithm (DWLS-Sarsa) based on the LSTD algorithm.DWLS-Sarsa combines two weights in a certain way and takes control of temporal diffe-rence error with Sarsa methods.During the training processtwo weights will produce different values because of the difference in the updated samples and will gradually narrow the gap between the two weights until they converge to the same optimal value duo to the distribution of the sample data.So that the exploration performance and convergence of the algorithm will be ensured.FinallyDWLS-Sarsa algorithm is applied to the experiment and compared with other reinforcement learning algorithms.The experimental results show that DWLS-Sarsa algorithm can deal with local optimum problems effectively to achieve more precise convergence value and has better learning performance and robustness.

Key words: Function approximation, Least-squares, Reinforcement learning, Sarsa, Temporal difference

中图分类号: 

  • TP181
[1] MOERLAND T M,BROEKENS J,JONKER C M.Emotion in reinforcement learning agents and robots:a survey[J].Machine Learning,2018,107(2):4480.
[2] LIU T,TIAN B,CAO D,et al.Parallel Reinforcement Lear-ning:A Framework and Case Study[J].IEEE/CAA Journal of Automatica Sinica,2018,5(4):65-73.
[3] DU W,DING S F.Overview on Multi-agent Reinforcement Lear-ning[J].Computer Science,2019,46(8):1-8.
[4] ZHAO X Y,DING S F.Research on Deep Reinforcement Lear-ning[J].Computer Science,2018,45(7):1-6.
[5] MNIH V,KAVUKCUOGLU K,SILVER D,et al.Human-level control through deep reinforcement learning[J].Nature,2015,518(7540):529-533.
[6] SUTTON R S,BARTO A.Reinforcement Learning:An Intro-duction[M].MIT Press,2019.
[7] DEGRIS T.PILARSKI P M.SUTTON R S.Model-free rein-forcement learning with continuous action in practice[C]//Proceedings of 2012 American Control Conference.Montreal,QC,Canada,2012:2177-2182.
[8] NEDIC' A,BERTSEKAS D.Convergence Rate of IncrementalSubgradient Algorithms[J].Stochastic Optimization:Algorithms and Applications,2001,54:223.
[9] LI L,WILLIAMS J D,BALAKRISHNAN S.Reinforcementlearning for dialog management using least-squares policy iteration and fast feature selection[C]//Proceedings of the 10th Annual Conference of the International Speech Communation Association.Brighton,UK,2009.
[10] WOOKEY D S,KONIDARIS G D.Regularized feature selection in reinforcement learning[J].Machine Learning,2015,100(2/3):655-676.
[11] LAGOUDAKIS M G,PARR R.Least-Squares Policy Iteration[J].Journal of Machine Learning Research,2004,4(6):1107-1149.
[12] JUNG T,POLANI D.Least squares SVM for least squares TD learning[C]//Procedings of the 17th European Conference on Artificial Intelligence.Riva del Garda,Italy,2006.
[13] WANG J K,LIN S D.Parallel Least-Squares Policy Iteration[C]//2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA).2016:166-173.
[14] ZHOU X,LIU Q,FU Q M,et al.Batch Least-squares PolicyIteration[J].Computer Science,2014,41(9):232-238.
[15] GEORGE J A,SHALABH B.An online prediction algorithm for reinforcement learning with linear function approximation using cross entropy method[J].Machine Learning,2018,107(8/9/10):1385-1429.
[16] GEIST M,PIETQUIN O.Parametric value function approximation:A unified view[C]//Proceedings of the 2011 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning.Piscataway,USA,2011.
[17] BUSONIU L BRUINB T D,TOLICD,et al.ReinforcementLearning for Control:Performance,Stability,and Deep Appro-ximators[J].Annual Reviews in Control,2018,46:8-28.
[18] JIN Y J,ZHU W W,FU Y C,et al.Actor-Critic Algorithm Based on Tile Coding and Model Learning[J].Computer Scien-ce,2014,41(6):239-242,249.
[19] VAN SEIJEN H,MAHMOOD A R,PILARSKI P M,et al.True Online Temporal-Difference Learning[J].Journal of Machine Learning esearch,2015,17(1):5057-5096.
[20] GRONDMAN I,BUSONIU L,LOPES G A D,et al.A Survey of ActorCritic Reinforcement Learning:Standard and Natural Policy Gradients[J].IEEE Transactions on Systems,Man,and Cybernetics,Part C (Applications and Reviews),2012,42(6):1291-1307.
[21] GHORBANI F,DERHAMI V,AFSHARCHI M.Fuzzy Least Square Policy Iteration and Its Mathematical Analysis[J].International Journal of Fuzzy Systems,2017,19(3):849-862.
[1] 熊丽琴, 曹雷, 赖俊, 陈希亮.
基于值分解的多智能体深度强化学习综述
Overview of Multi-agent Deep Reinforcement Learning Based on Value Factorization
计算机科学, 2022, 49(9): 172-182. https://doi.org/10.11896/jsjkx.210800112
[2] 刘兴光, 周力, 刘琰, 张晓瀛, 谭翔, 魏急波.
基于边缘智能的频谱地图构建与分发方法
Construction and Distribution Method of REM Based on Edge Intelligence
计算机科学, 2022, 49(9): 236-241. https://doi.org/10.11896/jsjkx.220400148
[3] 袁唯淋, 罗俊仁, 陆丽娜, 陈佳星, 张万鹏, 陈璟.
智能博弈对抗方法:博弈论与强化学习综合视角对比分析
Methods in Adversarial Intelligent Game:A Holistic Comparative Analysis from Perspective of Game Theory and Reinforcement Learning
计算机科学, 2022, 49(8): 191-204. https://doi.org/10.11896/jsjkx.220200174
[4] 史殿习, 赵琛然, 张耀文, 杨绍武, 张拥军.
基于多智能体强化学习的端到端合作的自适应奖励方法
Adaptive Reward Method for End-to-End Cooperation Based on Multi-agent Reinforcement Learning
计算机科学, 2022, 49(8): 247-256. https://doi.org/10.11896/jsjkx.210700100
[5] 于滨, 李学华, 潘春雨, 李娜.
基于深度强化学习的边云协同资源分配算法
Edge-Cloud Collaborative Resource Allocation Algorithm Based on Deep Reinforcement Learning
计算机科学, 2022, 49(7): 248-253. https://doi.org/10.11896/jsjkx.210400219
[6] 李梦菲, 毛莺池, 屠子健, 王瑄, 徐淑芳.
基于深度确定性策略梯度的服务器可靠性任务卸载策略
Server-reliability Task Offloading Strategy Based on Deep Deterministic Policy Gradient
计算机科学, 2022, 49(7): 271-279. https://doi.org/10.11896/jsjkx.210600040
[7] 郭雨欣, 陈秀宏.
融合BERT词嵌入表示和主题信息增强的自动摘要模型
Automatic Summarization Model Combining BERT Word Embedding Representation and Topic Information Enhancement
计算机科学, 2022, 49(6): 313-318. https://doi.org/10.11896/jsjkx.210400101
[8] 范静宇, 刘全.
基于随机加权三重Q学习的异策略最大熵强化学习算法
Off-policy Maximum Entropy Deep Reinforcement Learning Algorithm Based on RandomlyWeighted Triple Q -Learning
计算机科学, 2022, 49(6): 335-341. https://doi.org/10.11896/jsjkx.210300081
[9] 谢万城, 李斌, 代玥玥.
空中智能反射面辅助边缘计算中基于PPO的任务卸载方案
PPO Based Task Offloading Scheme in Aerial Reconfigurable Intelligent Surface-assisted Edge Computing
计算机科学, 2022, 49(6): 3-11. https://doi.org/10.11896/jsjkx.220100249
[10] 洪志理, 赖俊, 曹雷, 陈希亮, 徐志雄.
基于遗憾探索的竞争网络强化学习智能推荐方法研究
Study on Intelligent Recommendation Method of Dueling Network Reinforcement Learning Based on Regret Exploration
计算机科学, 2022, 49(6): 149-157. https://doi.org/10.11896/jsjkx.210600226
[11] 张佳能, 李辉, 吴昊霖, 王壮.
一种平衡探索和利用的优先经验回放方法
Exploration and Exploitation Balanced Experience Replay
计算机科学, 2022, 49(5): 179-185. https://doi.org/10.11896/jsjkx.210300084
[12] 郭斯羽, 吴延冬.
去除离群点的改进椭圆拟合算法
Improved Ellipse Fitting Algorithm with Outlier Removal
计算机科学, 2022, 49(4): 188-194. https://doi.org/10.11896/jsjkx.210200040
[13] 李鹏, 易修文, 齐德康, 段哲文, 李天瑞.
一种基于深度学习的供热策略优化方法
Heating Strategy Optimization Method Based on Deep Learning
计算机科学, 2022, 49(4): 263-268. https://doi.org/10.11896/jsjkx.210300155
[14] 欧阳卓, 周思源, 吕勇, 谭国平, 张悦, 项亮亮.
基于深度强化学习的无信号灯交叉路口车辆控制
DRL-based Vehicle Control Strategy for Signal-free Intersections
计算机科学, 2022, 49(3): 46-51. https://doi.org/10.11896/jsjkx.210700010
[15] 周琴, 罗飞, 丁炜超, 顾春华, 郑帅.
基于逐次超松弛技术的Double Speedy Q-Learning算法
Double Speedy Q-Learning Based on Successive Over Relaxation
计算机科学, 2022, 49(3): 239-245. https://doi.org/10.11896/jsjkx.201200173
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!