一种基于生成对抗网络的强化学习算法

doi:10.11896/jsjkx.180901655

计算机科学 ›› 2019, Vol. 46 ›› Issue (10): 265-272.doi: 10.11896/jsjkx.180901655

一种基于生成对抗网络的强化学习算法

陈建平^1,2,3, 邹锋^1,2,3, 刘全⁴, 吴宏杰^1,2,3, 胡伏原^1,2,3, 傅启明^1,2,3

(苏州科技大学电子与信息工程学院江苏苏州215009)¹
(苏州科技大学江苏省建筑智慧节能重点实验室江苏苏州215009)²
(苏州科技大学苏州市移动网络技术与应用重点实验室江苏苏州215009)³
(苏州大学计算机科学与技术学院江苏苏州215009)⁴

收稿日期:2018-09-05 修回日期:2018-11-24 出版日期:2019-10-15 发布日期:2019-10-21
通讯作者: 傅启明(1985-),男,博士,讲师,主要研究方向为强化学习、模式识别、建筑节能,E-mail:fqm_1@126.com。
作者简介:陈建平(1963-),男,教授,硕士生导师,主要研究方向为建筑节能、智能信息处理;邹锋(1993-),男,硕士生,主要研究方向为强化学习、建筑节能;刘全(1969-),男,教授,博士生导师,主要研究方向为强化学习、智能信息处理;吴宏杰(1977-),男,副教授,CCF会员,主要研究方向为深度学习、模式识别、生物信息;胡伏原(1978-),男,教授,主要研究方向为图像处理、模式识别与机器学习。
基金资助:
本文受国家自然科学基金项目(61502329,61772357,61750110519,61772355,61702055,61672371,61602334,61472267),江苏省自然科学基金项目(13KJB520020),江苏省重点研发计划项目(BE2017663),江苏省高校自然科学研究项目(13KJB520020),十三五省重点学科(20168765),航空基金(20151996016),苏州市应用基础研究计划工业部分(SYG201422)资助。

Reinforcement Learning Algorithm Based on Generative Adversarial Networks

CHEN Jian-ping^1,2,3, ZOU Feng^1,2,3, LIU Quan⁴, WU Hong-jie^1,2,3, HU Fu-yuan^1,2,3, FU Qi-ming^1,2,3

(Institute of Electronics and Information Engineering,Suzhou University of Science and Technology,Suzhou,Jiangsu 215009,China)¹
(Jiangsu Province Key Laboratory of Intelligent Building Energy Efficiency,Suzhou University ofScience and Technology,Suzhou,Jiangsu 215009,China)²
(Suzhou Key Laboratory of Mobile Networking and Applied Technologies,Suzhou University ofScience and Technology,Suzhou,Jiangsu 215009,China)³
(School of Computer Science and Technology,Soochow University,Suzhou,Jiangsu 215009,China)⁴

Received:2018-09-05 Revised:2018-11-24 Online:2019-10-15 Published:2019-10-21

摘要/Abstract

摘要： 针对强化学习方法在训练初期由于缺少经验样本所导致的学习速度慢的问题,提出了一种基于生成对抗网络的强化学习算法。在训练初期,该算法通过随机策略收集经验样本以构成真实样本池,并利用所收集的经验样本来训练生成对抗网络,然后利用生成对抗网络生成新的样本以构成虚拟样本池,再结合真实样本池和虚拟样本池来批量选择训练样本,以此来提高学习速度。同时,该算法引入了关系修正单元,结合深度神经网络,训练了真实样本池中样本的状态、动作与后续状态、奖赏之间的内部联系,结合相对熵优化生成对抗网络,提高生成样本的质量。最后,将所提出的算法与DQN算法应用于OpenAI Gym中的CartPole问题和MountainCar问题。实验结果表明,与DQN算法相比,所提算法可以有效地加快训练初期的学习速度,且收敛时间缩短了15%。

关键词: 经验样本, 强化学习, 深度学习, 生成对抗网络

Abstract: With respect to the slow learning rate caused by the lack of experience samples at the early stage for most traditional reinforcement learning algorithms,this paper proposed a novel reinforcement learning algorithm based on the generative adversarial networks.At the early stage,the algorithm collects a small amount of experience samples to construct a real sample set by a stochastic policy,and utilizes the collected samples to train GAN.Then,this algorithm uses the GAN to generate samples to construct a virtual sample set.After that,by combining two sample set,this algorithm selects a batch of samples to train value function network,thus improving the learning rate to some extent.Moreover,combining a deep neural network,this algorithm introduces a new model namely rectified relationship unit to train the internal relationship between the state,action and the next state and reward,feedbacks the GAN with the relative entropy and improves the sample quality generated by GAN.Finally,this paper applied the proposed algorithm and DQN algorithm to the traditional CartPole and MountainCar problem on OpenAI Gym platform The experimental results show that the learning rate is accelerated effectively and the convergence time is cut down by 15% through the proposed method compared with DQN.

Key words: Deep learning, Experience samples, Generative adversarial networks, Reinforcement learning

中图分类号:

TP391

陈建平, 邹锋, 刘全, 吴宏杰, 胡伏原, 傅启明. 一种基于生成对抗网络的强化学习算法[J]. 计算机科学, 2019, 46(10): 265-272. https://doi.org/10.11896/jsjkx.180901655

CHEN Jian-ping, ZOU Feng, LIU Quan, WU Hong-jie, HU Fu-yuan, FU Qi-ming. Reinforcement Learning Algorithm Based on Generative Adversarial Networks[J]. Computer Science, 2019, 46(10): 265-272. https://doi.org/10.11896/jsjkx.180901655

参考文献

[1]SUTTON R S,BARTO A G.Reinforcement learning:An introduction[M].Cambridge:MIT Press,1998.
[2]PUTERMAN M.Markov decision process [J].Statistica Neerlandica,1985,39(2):219-233.
[3]WU Y,SHEN T.Policy Iteration algorithm for optimal control of stochastic logical dynamical systems [J].IEEE Transactions on Neural Networks & Learning Systems,2017,28(99):1-6.
[4]WEI Q,LIU D,LIN H.Value iteration adaptive dynamic programming for optimal control of discrete-time nonlinear systems [J].IEEE Transactions on Cybernetics,2016,46(3):840-853.
[5]BRADTKE S J,BARTO A G.Linear least-squares algorithms for temporal difference learning [J].Machine Learning,1996,22(1／2／3):33-57.
[6]HACHIYA H,AKIYAMA T,SUGIAYMA M,et al.Adaptive importance sampling for value function approximation in off-po-licy reinforcement learning [J].Neural Networks,2009,22(10):1399-1410.
[7]MAHMOOD A R,SUTTON R S.Off-policy learning based on weighted importance sampling with linear computational complexity[C]//Proceedings of the 31st International Conference on Uncertainty in Artificial Intelligence.Amsterdam:AUAI,2015:552-561.
[8]CHEN X L,CAO L,LI C X,et al.Deep reinforcement learning via good choice resampling experience replay memory [J].Control and Decision,2018,33(4):129-134.
[9]LEDIG C,THEIS L,HUSZÁR F,et al.Photo-realistic single image super-resolution using a generative adversarial network[C]//Proceedings of the 30th IEEE Conference on ComputerVision and Pattern Recognition.Hawaii:IEEE,2017:105-114.
[10]CAO Z Y,NIU S Z,ZHANG J W.Masked image inpainting algorithm based on generative adversarial networks [J].Journal of Beijing University of Posts and Telecom,2018,41(3):81-86.(in Chinese)
曹志义,牛少彰,张继威.基于生成对抗网络的遮挡图像修复算法[J].北京邮电大学学报,2018,41(3):81-86.
[11]ZHENG W B,WANG K F,WANG F Y.Background subtraction algorithm with bayesian generative adversarial networks [J].Acta Automatica Sinica,2018,44(5):878-890.(in Chinese)
郑文博,王坤峰,王飞跃.基于贝叶斯生成对抗网络的背景消减算法[J].自动化学报,2018,44(5):878-890.
[12]ZHANG Y Z,GAN Z,CARIN L.Generating text via adversarial training[C]//Proceedings of the 30th Conference on Neural Information Processing Systems.Barcelona:MIT Press,2016:1543-1551.
[13]REED S,AKATA Z,YAN X C,et al.Generative adver-sarial text to image synthesis[C]//Proceedings of the 33rd International Conference on Machine Learning.New York:ACM,2016:1060-1069.
[14]WANG K F,GOU C,DUAN Y J,et al.Generative adversarial networks:the state of the art and beyond[J].Acta Automatica Sinica,2017,43(3):321-332.(in Chinese)
王坤峰,苟超,段艳杰,等.生成式对抗网络GAN的研究进展与展望[J].自动化学报,2017,43(3):321-332.
[15]ARJVSKY M,CHINTALA S,BOTTOU L.Wasserstein gene-rative adversarial networks[C]//Proceedings of the 34th International Conference on Machine Learning.Sydney:ACM,2017:214-223.
[16]MIRZA M,OSINDERO S.Conditional generative adversarial nets [J].Computer Science,2014,8(13):2672-2680.
[17]LECUN Y,BENGIO Y,HINTON G.Deep learning[J].Nature,2015,521(7553):436-444.
[18]MNIH V,KAVUKCUOGLU K,SILVER D,et al.Human-level control through deep reinforcement learning [J].Nature,2015,518(7540):529-533.

相关文章 15

[1]	饶志双, 贾真, 张凡, 李天瑞. 基于Key-Value关联记忆网络的知识图谱问答方法 Key-Value Relational Memory Networks for Question Answering over Knowledge Graph 计算机科学, 2022, 49(9): 202-207. https://doi.org/10.11896/jsjkx.220300277
[2]	刘兴光, 周力, 刘琰, 张晓瀛, 谭翔, 魏急波. 基于边缘智能的频谱地图构建与分发方法 Construction and Distribution Method of REM Based on Edge Intelligence 计算机科学, 2022, 49(9): 236-241. https://doi.org/10.11896/jsjkx.220400148
[3]	汤凌韬, 王迪, 张鲁飞, 刘盛云. 基于安全多方计算和差分隐私的联邦学习方案 Federated Learning Scheme Based on Secure Multi-party Computation and Differential Privacy 计算机科学, 2022, 49(9): 297-305. https://doi.org/10.11896/jsjkx.210800108
[4]	张佳, 董守斌. 基于评论方面级用户偏好迁移的跨领域推荐算法 Cross-domain Recommendation Based on Review Aspect-level User Preference Transfer 计算机科学, 2022, 49(9): 41-47. https://doi.org/10.11896/jsjkx.220200131
[5]	徐涌鑫, 赵俊峰, 王亚沙, 谢冰, 杨恺. 时序知识图谱表示学习 Temporal Knowledge Graph Representation Learning 计算机科学, 2022, 49(9): 162-171. https://doi.org/10.11896/jsjkx.220500204
[6]	熊丽琴, 曹雷, 赖俊, 陈希亮. 基于值分解的多智能体深度强化学习综述 Overview of Multi-agent Deep Reinforcement Learning Based on Value Factorization 计算机科学, 2022, 49(9): 172-182. https://doi.org/10.11896/jsjkx.210800112
[7]	史殿习, 赵琛然, 张耀文, 杨绍武, 张拥军. 基于多智能体强化学习的端到端合作的自适应奖励方法 Adaptive Reward Method for End-to-End Cooperation Based on Multi-agent Reinforcement Learning 计算机科学, 2022, 49(8): 247-256. https://doi.org/10.11896/jsjkx.210700100
[8]	王剑, 彭雨琦, 赵宇斐, 杨健. 基于深度学习的社交网络舆情信息抽取方法综述 Survey of Social Network Public Opinion Information Extraction Based on Deep Learning 计算机科学, 2022, 49(8): 279-293. https://doi.org/10.11896/jsjkx.220300099
[9]	郝志荣, 陈龙, 黄嘉成. 面向文本分类的类别区分式通用对抗攻击方法 Class Discriminative Universal Adversarial Attack for Text Classification 计算机科学, 2022, 49(8): 323-329. https://doi.org/10.11896/jsjkx.220200077
[10]	姜梦函, 李邵梅, 郑洪浩, 张建朋. 基于改进位置编码的谣言检测模型 Rumor Detection Model Based on Improved Position Embedding 计算机科学, 2022, 49(8): 330-335. https://doi.org/10.11896/jsjkx.210600046
[11]	孙奇, 吉根林, 张杰. 基于非局部注意力生成对抗网络的视频异常事件检测方法 Non-local Attention Based Generative Adversarial Network for Video Abnormal Event Detection 计算机科学, 2022, 49(8): 172-177. https://doi.org/10.11896/jsjkx.210600061
[12]	袁唯淋, 罗俊仁, 陆丽娜, 陈佳星, 张万鹏, 陈璟. 智能博弈对抗方法:博弈论与强化学习综合视角对比分析 Methods in Adversarial Intelligent Game:A Holistic Comparative Analysis from Perspective of Game Theory and Reinforcement Learning 计算机科学, 2022, 49(8): 191-204. https://doi.org/10.11896/jsjkx.220200174
[13]	胡艳羽, 赵龙, 董祥军. 一种用于癌症分类的两阶段深度特征选择提取算法 Two-stage Deep Feature Selection Extraction Algorithm for Cancer Classification 计算机科学, 2022, 49(7): 73-78. https://doi.org/10.11896/jsjkx.210500092
[14]	戴朝霞, 李锦欣, 张向东, 徐旭, 梅林, 张亮. 基于DNGAN的磁共振图像超分辨率重建算法 Super-resolution Reconstruction of MRI Based on DNGAN 计算机科学, 2022, 49(7): 113-119. https://doi.org/10.11896/jsjkx.210600105
[15]	程成, 降爱莲. 基于多路径特征提取的实时语义分割方法 Real-time Semantic Segmentation Method Based on Multi-path Feature Extraction 计算机科学, 2022, 49(7): 120-126. https://doi.org/10.11896/jsjkx.210500157

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

一种基于生成对抗网络的强化学习算法

Reinforcement Learning Algorithm Based on Generative Adversarial Networks

PDF (PC)

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

Metrics

本文评价

推荐阅读 0