计算机科学 ›› 2018, Vol. 45 ›› Issue (7): 1-6.doi: 10.11896/j.issn.1002-137X.2018.07.001

• 第五届CCF 大数据学术会议 •    下一篇

深度强化学习研究综述

赵星宇1,丁世飞1,2   

  1. 中国矿业大学计算机科学与技术学院 江苏 徐州2211161;
    中国科学院计算技术研究所智能信息处理重点实验室 北京1001902
  • 收稿日期:2017-06-12 出版日期:2018-07-30 发布日期:2018-07-30
  • 作者简介:赵星宇(1994-),男,硕士生,CCF会员,主要研究方向为强化学习和深度学习;丁世飞(1963-),男,教授,博士生导师,CCF高级会员,主要研究方向为智能信息处理、人工智能与模式识别、机器学习与数据挖掘、粗糙集与软计算、粒度计算等,E-mail:dingsf@cumt.edu.cn(通信作者)。
  • 基金资助:
    本文受国家自然科学基金(61379101,61672522),国家重点基础研究发展计划(2013CB329502)资助。

Research on Deep Reinforcement Learning

ZHAO Xing-yu1,DING Shi-fei1,2   

  1. School of Computer Science and Technology,China University of Mining and Technology,Xuzhou,Jiangsu 221116,China1;
    Key Laboratory of Intelligent Information Processing,Institute of Computing Technology,Chinese Academy of Sciences,Beijing 100190,China2
  • Received:2017-06-12 Online:2018-07-30 Published:2018-07-30

摘要: 作为一种崭新的机器学习方法,深度强化学习将深度学习和强化学习技术结合起来,使智能体能够从高维空间感知信息,并根据得到的信息训练模型、做出决策。由于深度强化学习算法具有通用性和有效性,人们对其进行了广泛的研究,并将其运用到了日常生活的各个领域。首先,对深度强化学习研究进行概述,介绍了深度强化学习的基础理论;然后,分别介绍了基于值函数和基于策略的深度强化学习算法,讨论了其应用前景;最后,对相关研究工作做了总结和展望。

关键词: 强化学习, 人工智能, 深度强化学习, 深度学习

Abstract: As a new machine learning method,deep reinforcement learning combines deep learning and reinforcement learning,which makes that the agent can perceive the information from high dimensional space,train model and make decision according to the received information.Deep reinforcement learning has been widely researched and used in va-rious fields of daily life because of its universality and effectiveness.Firstly,an overview of the deep reinforcement lear-ning research was given and the basic theory of deep reinforcement learning was introduced.Then value-based algorithms and policy-based algorithms were introduced.After that,the application prospects of deep reinfercement learning were discussed.Finally,the related researches were summarized and prospected.

Key words: Artificial intelligence, Deep learning, Deep reinforcement learning, Reinforcement learning

中图分类号: 

  • TP181
[1]MNIH V,KAVUKCUOGLU K,SILVER D,et al.Human-level control through deep reinforcement learning.Nature,2015,518(7540):529-533.<br /> [2]SILVER D,HUANG A,MADDISON C,et al.ing the game of Go with deep neural networks and tree search.Nature,2016,529(7587):484-489.<br /> [3]LEVINE S,PASTOR P,KRIZHEVSKY A,et al.LearningHand-Eye Coordination for Robotic Grasping with Large-Scale Data Collection[C]∥International Symposium on Experimental Robotics.Springer,Cham,2016:173-184.<br /> [4]ZHANG M,MCCARTHY Z,FINN C,et al.Learning deep neural network policies with continuous memory states[C]∥Proceedings of the International Conference on Robotics and Automation.Stockholm,Sweden,2016:520-527.<br /> [5]LEVINE S,FINN C,DARRELL T,et al.End-to-end training of deep visuomotor policies.Journal of Machine Learning Research,2016,17(39):1-40.<br /> [6]LENZ I,KNEPPER R,SAXENA A.Deepmpc:learning deep latent features for model predictive control[C]∥Proceedings of the Robotics Scienceand Systems.Rome,Italy,2015:201-209.<br /> [7]SATIJA H,PINEAU J.Simultaneous machine translation using deep reinforcement learning[C]∥Proceedings of the Workshops of International Conference on Machine Learning.New York,USA,2016:110-119.<br /> [8]OH J,GUO X,LEE H,et al.Action-conditional video prediction using deep networks in atari games[C]∥Advances in Neural Information Processing Systems.2015:2863-2871.<br /> [9]GUO H.Generating text with deep reinforcement learning[C]∥Proceedings of the Workshops of Advances in Neural Information Processing Systems.Montreal,Canada,2015:1-9.<br /> [10]LI J,MONROE W,RITTER A,et al.Deep reinforcement lear-ning for dialogue generation[C]∥Proceedings of the Conference on Empirical Methods in Natural Language Processing.Austin,USA,2016:1192-1202.<br /> [11]NARASIMHAN K,KULKARNI T,BARZILAY R.Language Understanding for Text-based Games Using Deep Reinforcement Learning.Computer Science,2015,40(4):1-5.<br /> [12]SALLAB A,ABDOU M,PEROT E,et al.Deep reinforcement learning framework for autonomous driving.Electronic Imaging,2017,2017(19):70-76.<br /> [13]CAICEDO J,LAZEBNIK S.Active Object Localization with Deep Reinforcement Learning[C]∥IEEE International Con-ference on Computer Vision.IEEE,2015:2488-2496.<br /> [14]ZHAO D B,SHAO K,ZHU Y H,et al.Review of deep reinforcement learning and discussions on the development of computer Go.Control Theory and Applications,2016,33(6):701-717.(in Chinese)<br /> 赵冬斌,邵坤,朱圆恒,等.深度强化学习综述:兼论计算机围棋的发展.控制理论与应用,2016,33(6):701-717.<br /> [15]HINTON G,SALAKHUTDINOV R.Reducing the Dimensiona-lity of Data with Neural Networks.Science,2006,313(5786):504-507.<br /> [16]DENG L,YU D.Deep learning:methods and applications.Foundations and Trends in Signal Processing,2014,7(3/4):197-387.<br /> [17]BENGIO Y,LECUN Y.Scaling learning algorithms towards AI.Large-scale Kernel Machines,2007,34(5):1-41.<br /> [18]HINTON G,OSINDERO S,TEH Y.A fast learning algorithm for deep belief nets.Neural Computation,2006,18(7):1527-1554.<br /> [19]HOCHREITER S,SCHMIDHUBER J.Long Short-Term Me-mory.Neural Computation,1997,9(8):1735-1780.<br /> [20]CHO K,VAN MERRI NBOER B,GULCE-HRE C,et al.Lear-ning phrase representations using RNN encoder-decoder for statistical machine translation[C]∥Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing.Doha:Association for Computational Linguistics,2014:1724-1734.GAO Y,CHEN S F,LU X.Research on Reinfocerment Lear-ning Technology:A Review.Acta Automatica Sinica,2004,30(1):86-100.(in Chinese)<br /> 高阳,陈世福,陆鑫.强化学习研究综述.自动化学报,2004,30(1):86-100.<br /> [22]WATKINS C.Learning from delayed rewards.Cambridge:King’s College,1989.<br /> [23]WILLIAMS R.Simple statistical gradient-following algoithmsfor connectionist reinforcement learning.Machine Learning,1992,8(3/4):229-256.<br /> [24]KONDA V,TSITSIKLIS J.Actor-critic algorithms[C]∥Advances in Neural Information Processing Systems.2000:1008-1014.<br /> [25]LANGE S,RIEDMILLER M.Deep auto-encoder neural net-works in reinforcement learning[C]∥Neural Networks (IJCNN),The 2010 International Joint Conference on Computational Science and Optimization.IEEE,2010:1-8.<br /> [26]LANGE S,RIEDMILLER M,VOIGTL NDER A.Autono-mous reinforcement learning on raw visual input data in a real world application[C]∥International Joint Conference on Neural Networks.IEEE,2012:1-8.<br /> [27]MNIH V,KAVUKCUOGLU K,SILVER D,et al.Playing atari with deep reinforcement learning[C]∥Proceedings of Workshops at the 26th Neural Information Processing Systems 2013.Lake Tahoe,USA,2013:201-220.<br /> [28]HASSELT H,GUEZ A,SILVER D.Deep Reinforcement Lear-ning with Double Q-Learning[C]∥AAAI.2016:2094-2100.<br /> [29]WANG Z,FREITAS N,LANCTOT M.Dueling network architectures for deep reinforcement learning[C]∥Proceedings of the International Conference on Machine Learning.New York,USA,2016:1995-2003.<br /> [30]SCHAUL T,QUAN J,ANTONOGLOU I,et al. Prioritized experience replay∥Proceedings of the 4th International Conference on Learning Representations.San Juan,Puerto Rico,2016:322-355.<br /> [31]OSBAND I,BLUNDELL C,PRITZEL A,et al.Deep exploration via bootstrapped DQN[C]∥Advances in Neural Information Processing Systems.2016:4026-4034.<br /> [32]HASSELT H,GUEZ A,HESSEL M,et al.Learning functions across many orders of magnitudes[C]∥Proceedings of the Advances in Neural Information Processing Systems.Barcelona,Spain,2016:80-99.<br /> [33]LAKSHMINARAYANAN A,SHARMA S,RAVINDRAN B.Dynamic frame skip deep q network∥Proceedings of the Workshops at the International Joint Conference on Artificial Intelligence.New York,USA,2016.<br /> [34]MUNOS R,STEPLETON T,HARUTYUNYAN A,et al.Safe and efficient off-policy reinforcement learning∥Advances in Neural Information Processing Systems.2016:1054-1062.<br /> [35]FRAN OIS-LAVET V,FONTENEAU R,ERNST D.How to discount deep reinforcement learning:towards new dynamic strategies∥Proceedings of the Workshops at the Advances in Neural Information Processing Systems.Montreal,Canada,2015:107-1160.<br /> [36]LILLICRAP T,HUNT J,PRITZEL A,et al.Continuous control with deep reinforcement learning.https://arxiv.org/abs.1509.02971.<br /> [37]SILVER D,LEVER G,HEESS N,et al.Deterministic policy gradient algorithms∥Proceedings of the 31st International Conference on Machine Learning.2014:387-395.<br /> [38]SCHULMAN J,WOLSKI F,DHARIWAL P,et al.ProximalPolicy Optimization Algorithms.https://arxiv.org/abs/1707.06347.<br /> [39]HEESS N,DHRUVA T,SRIRAM S,et al.Emergence of Locomotion Behaviours in Rich Environments .https://ar-xiv.org/abs/1707.02286.<br /> [40]SCHULMAN J,LEVINE S,MORITZ P,et al.Trust RegionPolicy Optimization∥International Conference on Machine Learning.Lille:International Machine Learning Society,2015:1889-1897.<br /> [41]ZHANG T,KAHN G,LEVINE S,et al.Learning deep control policies for autonomous aerial vehicles with mpc-guided policy search∥2016 IEEE International Conference on Robotics and Automation (ICRA).IEEE,2016:528-535.<br /> [42]DUAN Y,CHEN X,HOUTHOOFT R,et al.Benchmarkingdeep reinforcement learning for continuous control∥International Conference on Machine Learning.2016:1329-1338.<br /> [43]BALDUZZI D,GHIFARY M.Compatible Value Gradients forReinforcement Learning of Continuous Deep Policies.https://arxiv.org/abs/1509.03005.<br /> [44]HEESS N,WAYNE G,SILVER D,et al.Learning continuous control policies by stochastic value gradients∥Advances in Neural Information Processing Systems.2015:2944-2952.<br /> [45]MNIH V,BADIA A,MIRZA M,et al.Asynchronous methods for deep reinforcement learning∥International Conference on Machine Learning.2016:1928-1937.<br /> [46]JADERBERG M,MNIH V,CZARNECKI W,et al.Reinforcement learning with unsupervised auxiliary tasks .https://arxiv.org/abs/1611.05397.<br /> [47]FINN C,LEVINE S,ABBEEL P.Guided cost learning:Deep inverse optimal control via policy optimization∥International Conference on Machine Learning.2016:49-58.<br /> [48]OH J,CHOCKALINGAM V,SINGH S,et al.Control of memory,active perception,and action in Minecraft∥Proceedings of the International Conference on Machine Learning.New York,USA,2016:2790-2799.<br /> [49]KULKARNI T,NARASIMHAN K,SAEEDI A,et al.Hierarchical deep reinforcement learning:Integrating temporal abstraction and intrinsic motivation∥Advances in Neural Information Processing Systems.2016:3675-3683.<br /> [50]HOUTHOOFT R,CHEN X,DUAN Y,et al.VIME:Variationalinformation maximizing exploration∥Advances in Neural Information Processing Systems.2016:1109-1117.<br /> [51]FERN NDEZ F,VELOSO M.Probabilistic policy reuse in areinforcement learning agent∥Proceedings of the InternationalJoint Conference on Autonomous Agents and Multiagent Systems.Istanbul,Turkey,2015:720-727.<br /> [52]BELLEMARE M,SRINIVASAN S,OSTROVSKI G,et al.Unifying count-based exploration and intrinsic motivation∥Proceedings of the Conference on Neural Information Processing Systems.Barcelona,Spain,2016:1471-1479.<br /> [53]SCHAUL T,HORGAN D,GREGOR K,et al.Universal value function approximators∥Proceedings of the 32nd International Conference on Machine Learning.Lugano,Switzerland,2015:1312-1320.<br /> [54]LAMPLE G,CHAPLOT D.Playing FPS Games with DeepReinforcement Learning∥AAAI.2017:2140-2146.<br /> [55]KEMPKA M,WYDMUCH M,RUNC G,et al.Vizdoom:Adoom-based ai research platform for visual reinforcement lear-ning∥2016 IEEE Conference on Computational Intelligence and Games (CIG).IEEE,2016:1-8.<br /> [56]VINYALS O,EWALDS T,BARTUNOV S,et al.StarCraft II:A New Challenge for Reinforcement Learning.https://arxiv.org/abs/1708.04782.<br /> [57]ZHU Y,MOTTAGHI R,KOLVE E,et al.Target-driven visual navigation in indoor scenes using deep reinforcement learning∥2017 IEEE International Conference on Robotics and Automation (ICRA).IEEE,2017:3357-3364.<br /> [58]SUTSKEVER I,VINYALS O,LE Q.Sequence to sequence lear-ning with neural networks∥Advances in Neural Information Processing Systems.2014:3104-3112.<br /> [59]LI J,MONROE W,RITTER A,et al.Deep reinforcement lear-ning for dialogue generation.https://arxiv.org/abs/1707.06347.<br /> [60]PARISOTTO E,BA J,SALAKHUTDINOV R.Actor-mimic:deep multitaskand transfer reinforcement learning∥Proceedings of the International Conference on Learning Representations.San Juan,Puerto Rico,2016:156-171.<br /> [61]CHEN X G,YU Y.Reinforcement Learning and Its Application to the Game of Go.Acta Automatica Sinica,2016,42(5):685-695.(in Chinese)<br /> 陈兴国,俞扬.强化学习及其在电脑围棋中的应用.自动化学报,2016,42(5):685-695.
[1] 徐涌鑫, 赵俊峰, 王亚沙, 谢冰, 杨恺.
时序知识图谱表示学习
Temporal Knowledge Graph Representation Learning
计算机科学, 2022, 49(9): 162-171. https://doi.org/10.11896/jsjkx.220500204
[2] 熊丽琴, 曹雷, 赖俊, 陈希亮.
基于值分解的多智能体深度强化学习综述
Overview of Multi-agent Deep Reinforcement Learning Based on Value Factorization
计算机科学, 2022, 49(9): 172-182. https://doi.org/10.11896/jsjkx.210800112
[3] 饶志双, 贾真, 张凡, 李天瑞.
基于Key-Value关联记忆网络的知识图谱问答方法
Key-Value Relational Memory Networks for Question Answering over Knowledge Graph
计算机科学, 2022, 49(9): 202-207. https://doi.org/10.11896/jsjkx.220300277
[4] 刘兴光, 周力, 刘琰, 张晓瀛, 谭翔, 魏急波.
基于边缘智能的频谱地图构建与分发方法
Construction and Distribution Method of REM Based on Edge Intelligence
计算机科学, 2022, 49(9): 236-241. https://doi.org/10.11896/jsjkx.220400148
[5] 汤凌韬, 王迪, 张鲁飞, 刘盛云.
基于安全多方计算和差分隐私的联邦学习方案
Federated Learning Scheme Based on Secure Multi-party Computation and Differential Privacy
计算机科学, 2022, 49(9): 297-305. https://doi.org/10.11896/jsjkx.210800108
[6] 孙奇, 吉根林, 张杰.
基于非局部注意力生成对抗网络的视频异常事件检测方法
Non-local Attention Based Generative Adversarial Network for Video Abnormal Event Detection
计算机科学, 2022, 49(8): 172-177. https://doi.org/10.11896/jsjkx.210600061
[7] 袁唯淋, 罗俊仁, 陆丽娜, 陈佳星, 张万鹏, 陈璟.
智能博弈对抗方法:博弈论与强化学习综合视角对比分析
Methods in Adversarial Intelligent Game:A Holistic Comparative Analysis from Perspective of Game Theory and Reinforcement Learning
计算机科学, 2022, 49(8): 191-204. https://doi.org/10.11896/jsjkx.220200174
[8] 史殿习, 赵琛然, 张耀文, 杨绍武, 张拥军.
基于多智能体强化学习的端到端合作的自适应奖励方法
Adaptive Reward Method for End-to-End Cooperation Based on Multi-agent Reinforcement Learning
计算机科学, 2022, 49(8): 247-256. https://doi.org/10.11896/jsjkx.210700100
[9] 王剑, 彭雨琦, 赵宇斐, 杨健.
基于深度学习的社交网络舆情信息抽取方法综述
Survey of Social Network Public Opinion Information Extraction Based on Deep Learning
计算机科学, 2022, 49(8): 279-293. https://doi.org/10.11896/jsjkx.220300099
[10] 郝志荣, 陈龙, 黄嘉成.
面向文本分类的类别区分式通用对抗攻击方法
Class Discriminative Universal Adversarial Attack for Text Classification
计算机科学, 2022, 49(8): 323-329. https://doi.org/10.11896/jsjkx.220200077
[11] 姜梦函, 李邵梅, 郑洪浩, 张建朋.
基于改进位置编码的谣言检测模型
Rumor Detection Model Based on Improved Position Embedding
计算机科学, 2022, 49(8): 330-335. https://doi.org/10.11896/jsjkx.210600046
[12] 胡艳羽, 赵龙, 董祥军.
一种用于癌症分类的两阶段深度特征选择提取算法
Two-stage Deep Feature Selection Extraction Algorithm for Cancer Classification
计算机科学, 2022, 49(7): 73-78. https://doi.org/10.11896/jsjkx.210500092
[13] 程成, 降爱莲.
基于多路径特征提取的实时语义分割方法
Real-time Semantic Segmentation Method Based on Multi-path Feature Extraction
计算机科学, 2022, 49(7): 120-126. https://doi.org/10.11896/jsjkx.210500157
[14] 侯钰涛, 阿布都克力木·阿布力孜, 哈里旦木·阿布都克里木.
中文预训练模型研究进展
Advances in Chinese Pre-training Models
计算机科学, 2022, 49(7): 148-163. https://doi.org/10.11896/jsjkx.211200018
[15] 周慧, 施皓晨, 屠要峰, 黄圣君.
基于主动采样的深度鲁棒神经网络学习
Robust Deep Neural Network Learning Based on Active Sampling
计算机科学, 2022, 49(7): 164-169. https://doi.org/10.11896/jsjkx.210600044
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!