Computer Science ›› 2020, Vol. 47 ›› Issue (3): 182-191.doi: 10.11896/jsjkx.190200352

• Artificial Intelligence • Previous Articles     Next Articles

Survey on Sparse Reward in Deep Reinforcement Learning

ANG Wei-yi1,BAI Chen-jia2,CAI Chao1,ZHAO Ying-nan2,LIU Peng2   

  1. (China Unicom Network Technology Research Institute, Beijing 100048, China)1;
    (School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China)2
  • Received:2019-02-24 Online:2020-03-15 Published:2020-03-30
  • About author:YANG Wei-yi,born in 1993,postgra-duate.Her main research interests include machine learning,internet of things and reinforcement learning. BAI Chen-jia,born in 1993,Ph.D,is member of China Computer Federation.His main research interests include reinforcement learning and neural network.
  • Supported by:
    This work was supported by the National Natural Science Foundation of China (61671175, 61672190).

Abstract: As an important research direction of machine learning,reinforcement learning is a kind of method of finding out the optimal policy by interacting with the environment.In recent years,deep learning is widely used in reinforcement learning algorithm,forming a new research field named deep reinforcement learning.As a new machine learning method,deep reinforcement learning has the ability to perceive complex inputs and solve optimal policies.It is applied to robot control and complex decision-making problems.The sparse reward problem is the core problem of reinforcement learning in solving practical tasks.Sparse reward problem exists widely in practical applications.Solving the sparse reward problem is conducive to improving the sample-efficiency and the quality of optimal policy,and promoting the application of deep reinforcement learning to practical tasks.Firstly,an overview of the core algorithm of deep reinforcement learning was given.Then five solutions of sparse reward problem were introduced,including reward design and learning,experience replay,exploration and exploitation,multi-goal learning and auxiliary tasks.Finally,the related researches were summarized and prospected.

Key words: Artificial intelligence, Deep learning, Deep reinforcement learning, Reinforcement learning, Sparse reward

CLC Number: 

  • TP181
[1]SUTTON R S,BARTO A G.Reinforcement learning:An intro- duction[M].MIT Press,US,2018.
[2]LECUN Y,BENGIO Y,HINTON G.Deep learning[J].nature,2015,521(7553):436.
[3]LI Y.Deep reinforcement learning:An overview[J].arXiv: 1701.07274,2017.
[4]SILVER D,HUANG A,MADDISON C J,et al.Mastering the game of Go with deep neural networks and tree search[J].Nature,2016,529(7587):484.
[5]SILVER D,SCHRITTWIESER J,SIMONYAN K,et al.Mastering the game of go without human knowledge[J].Nature,2017,550(7676):354.
[6]SILVER D,HUBERT T,SCHRITTWIESER J,et al.A general reinforcement learning algorithm that masters chess,shogi,and Go through self-play[J].Science,2018,362(6419):1140-1144.
[7]PLAPPERT M,ANDRYCHOWICZ M,RAY A,et al.Multi- goal reinforcement learning:Challenging robotics environments and request for research[J].arXiv:1802.09464,2018.
[8]SCHAUL T,QUAN J,ANTONOGLOU I,et al.Prioritized ex- perience replay[J].arXiv:1511.05952,2015.
[9]LEVINE S,PASTOR P,KRIZHEVSKY A,et al.Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection[J].The International Journal of Robotics Research,2018,37(4/5):421-436.
[10]ISELE D,RAHIMI R,COSGUN A,et al.Navigating occluded intersections with autonomous vehicles using deep reinforcement learning[C]∥2018 IEEE International Conference on Robotics and Automation (ICRA).IEEE,2018:2034-2039.
[11]BELLMAN R.A Markovian decision process[J].Journal of Mathematics and Mechanics,1957,6(5):679-684.
[12]MNIH V,KAVUKCUOGLU K,SILVER D,et al.Playing atari with deep reinforcement learning[J].arXiv:1312.5602,2013.
[13]MNIH V,KAVUKCUOGLU K,SILVER D,et al.Human-level control through deep reinforcement learning[J].Nature,2015,518(7540):529.
[14]HASSELT H V.Double Q-learning[C]∥Advances in Neural Information Processing Systems.2010:2613-2621.
[15] HASSELT H V,GUEZ A,SILVER D.Deep reinforcement learning with double q-learning[C]∥Thirtieth AAAI Confe-rence on Artificial Intelligence.2016.
[16]HORGAN D,QUAN J,BUDDEN D,et al.Distributed prioritized experience replay[C]∥International Conference onLear-ning Representations.2018.
[17]WANG Z,SCHAUL T,HESSEL M,et al.Dueling Network Ar- chitectures for Deep Reinforcement Learning[C]∥International Conference on Machine Learning.2016:1995-2003.
[18]BELLEMARE M G,DABNEY W,MUNOS R.A distributional perspective on reinforcement learning[C]∥International Conference on Machine Learning.2017:449-458.
[19]HESSEL M,MODAYIL J,VAN HASSELT H,et al.Rainbow:Combining improvements in deep reinforcement learning[C]∥Thirty-Second AAAI Conference on Artificial Intelligence.2018.
[20]DE ASIS K,HERNANDEZ-GARCIA J F,HOLLAND G Z, et al.Multi-step reinforcement learning:A unifying algorithm[C]∥Thirty-Second AAAI Conference on Artificial Intelligence.2018.
[21]FORTUNATO M,AZAR M G,PIOT B,et al.Noisy networks for exploration[C]∥International Conference on Learning Representations.2018.
[22]PRECUP D,SUTTON R S,DASGUPTA S.Off-policy temporal-difference learning with function approximation[C]∥International Conference on Machine Learning.2001:417-424.
[23]BROWNE C B,POWLEY E,WHITEHOUSE D,et al.A survey of monte carlo tree search methods[J].IEEE Transactions on Computational Intelligence and AI in Games,2012,4(1):1-43.
[24]SILVER D,LEVER G,HEESS N,et al.Deterministic policy gradient algorithms[C]∥International Conference on Machine Learning.2014.
[25]MNIH V,BADIA A P,MIRZA M,et al.Asynchronous methods for deep reinforcement learning[C]∥International conference on machine learning.2016:1928-1937.
[26]WYMANN B,ESPIÉ E,GUIONNEAU C,et al.Torcs,the open racing car simulator[J].Software,2000,4(6).
[27]TODOROV E,EREZ T,TASSA Y.Mujoco:A physics engine for model-based control[C]∥2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.IEEE,2012:5026-5033.
[28]KEMPKA M,WYDMUCH M,RUNC G,et al.Vizdoom:A doom-based ai research platform for visual reinforcement lear-ning[C]∥2016 IEEE Conference on Computational Intelligence and Games (CIG).IEEE,2016:1-8.
[29]BEATTIE C,LEIBO J Z,TEPLYASHIN D,et al.Deepmind lab[J].arXiv:1612.03801,2016.
[30]BABAEIZADEH M,FROSIO I,TYREE S,et al.Reinforcement learning through asynchronous advantage actor-critic on a gpu[C]∥International Conference on Learning Representations.2017.
[31]ESPEHOLT L,SOYER H,MUNOS R,et al.IMPALA:Scala- ble Distributed Deep-RL with Importance Weighted Actor-Learner Architectures[C]∥International Conference on Machine Learning.2018:1406-1415.
[32]SCHULMAN J,LEVINE S,ABBEEL P,et al.Trust Region Policy Optimization[C]∥International Conference on Machine Learning.2015,37:1889-1897.
[33]WU Y,MANSIMOV E,GROSSE R B,et al.Scalable trust-region method for deep reinforcement learning using kronecker-factored approximation[C]∥Advances in neural information processing systems.2017:5279-5288.
[34]SCHULMAN J,WOLSKI F,DHARIWAL P,et al.Proximal policy optimization algorithms[J].arXiv:1707.06347,2017.
[35]SCHULMAN J,MORITZ P,LEVINE S,et al.High-dimensional continuous control using generalized advantage estimation[C]∥International Conference on Learning Representations.2016.
[36]NACHUM O,NOROUZI M,XU K,et al.Bridging the gap between value and policy based reinforcement learning[C]∥Advances in Neural Information Processing Systems.2017:2775-2785.
[37]LILLICRAP T P,HUNT J J,PRITZEL A,et al.Continuous control with deep reinforcement learning[C]∥International Conference on Learning Representations.2016.
[38]FUJIMOTO S,HOOF H,MEGER D.Addressing Function Approximation Error in Actor-Critic Methods[C]∥International Conference on Machine Learning.2018:1582-1591.
[39]HAUSKNECHT M,STONE P.Deep reinforcement learning in parameterized action space[C]∥International Conference on Learning Representations.2016.
[40]STONE P.What’s hot at RoboCup[C]∥Thirtieth AAAI Conference on Artificial Intelligence.2016.
[41]HAARNOJA T,TANG H,ABBEEL P,et al.Reinforcement learning with deep energy-based policies[C]∥International Conference on Machine Learning.2017:1352-1361.
[42]HAARNOJA T,ZHOU A,ABBEEL P,et al.Soft Actor-Critic:Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor[C]∥International Conference on Machine Learning.2018:1856-1865.
[43]SCHULMAN J,CHEN X,ABBEEL P.Equivalence between policy gradients and soft q-learning[J].arXiv:1704.06440,2017.
[44]GU S,LILLICRAP T,GHAHRAMANI Z,et al.Q-prop:Sample-efficient policy gradient with an off-policy critic[C]∥International Conference on Learning Representations.2017.
[45]O’DONOGHUE B,MUNOS R,KAVUKCUOGLU K,et al. Combining policy gradient and Q-learning[C]∥International Conference on Learning Representations.2017.
[46]WANG Z,BAPST V,HEESS N,et al.Sample efficient actor-critic with experience replay[C]∥International Conference on Learning Representations.2017.
[47]ZHAO X Y,DING S F.Research on Deep Reinforcement Lear- ning[J].Computer Science,2018,45(7):1-6.
[48]OPENAI.Faulty Reward Functions in the Wild[EB/OL].ht- tps://blog.openai.com/faulty-reward-functions.2017.
[49]RUSSELL S,NORVIG P.Artificial Intelligence A Modern Approach 3rd Edition Pdf[J].Hong Kong:Pearson Education Asia,2011.
[50]AMODEI D,OLAH C,STEINHARDT J,et al.Concrete Problems in AI Safety[J].arXiv:1606.06565,2016.
[51]NG A Y,RUSSELL S J.Algorithms for inverse reinforcement learning[C]∥ICML.2000,1:2.
[52]ZIEBART B D,MAAS A L,BAGNELL J A,et al.Maximum entropy inverse reinforcement learning[C]∥AAAI Conference on Artificial Intelligence.2008:1433-1438.
[53]AGHASADEGHI N,BRETL T.Maximum entropy inverse reinforcement learning in continuous state spaces with path integrals[C]∥2011 IEEE/RSJ International Conference on Intelligent Robots and Systems.IEEE,2011:1561-1566.
[54]FINN C,LEVINE S,ABBEEL P.Guided cost learning:Deep inverse optimal control via policy optimization[C]∥International Conference on Machine Learning.2016:49-58.
[55]HADFIELD-MENELL D,MILLI S,ABBEEL P,et al.Inverse reward design[C]∥Advances in neural information processing systems.2017:6765-6774.
[56]CHRISTIANO P F,LEIKE J,BROWN T,et al.Deep reinforcement learning from human preferences[C]∥Advances in Neural Information Processing Systems.2017:4299-4307.
[57]ZHANG K F,YU Y.Methodologies for Imitation Learning via Inverse Reinforcement Learning:A Review[J].Journal of Computer Research and Development,2019,56(2):254-261.
[58]HOU Y,LIU L,WEI Q,et al.A novel DDPG method with prioritized experience replay[C]∥2017 IEEE International Confe-rence on Systems,Man,and Cybernetics (SMC).IEEE,2017:316-321.
[59]TAVAKOLI A,PARDO F,KORMUSHEV P.Action branching architectures for deep reinforcement learning[C]∥Thirty-Se-cond AAAI Conference on Artificial Intelligence.2018.
[60]HORGAN D,QUAN J,BUDDEN D,et al.Distributed prioritized experience replay[C]∥International Conference on Lear-ning Representations.2018.
[61]DE BRUIN T,KOBER J,TUYLS K,et al.Experience selection in deep reinforcement learning for control[J].The Journal of Machine Learning Research,2018,19(1):347-402.
[62]BAI C J,LIU P,ZHAO W,et al.Active Sampling for Deep Q-Learning Based on TD-error Adaptive Correction[J].Journal of Computer Research and Development,2019,56(2):262-280.
[63]CHAPELLE O,LI L.An empirical evaluation of thompson sampling[C]∥Advances in neural information processing systems.2011:2249-2257.
[64]KOLTER J Z,NG A Y.Near-Bayesian exploration in polyno- mial time[C]∥Proceedings of the 26th Annual International Conference on Machine Learning.ACM,2009:513-520.
[65]OSBAND I,BLUNDELL C,PRITZEL A,et al.Deep explora- tion via bootstrapped DQN[C]∥Advances in neural information processing systems.2016:4026-4034.
[66]BELLEMARE M,SRINIVASAN S,OSTROVSKI G,et al.Unifying count-based exploration and intrinsic motivation[C]∥Advances in Neural Information Processing Systems.2016:1471-1479.
[67]OSTROVSKI G,BELLEMARE M G,VAN DEN OORD A,et al.Count-based exploration with neural density models[C]∥Proceedings of the 34th International Conference on Machine Learning.2017:2721-2730.
[68]VAN OORD A,KALCHBRENNER N,KAVUKCUOGLU K. Pixel Recurrent Neural Networks[C]∥International Conference on Machine Learning.2016:1747-1756.
[69]SALIMANS T,KARPATHY A,CHEN X,et al.Pixelcnn++:A pixelcnn implementation with discretized logistic mixture likelihood and other modifications[C]∥International Conference on Learning Representations (ICLR).2017.
[70]TANG H,HOUTHOOFT R,FOOTE D,et al.#Exploration:A study of count-based exploration for deep reinforcement learning[C]∥Advances in Neural Information Processing Systems.2017:2753-2762.
[71]HOUTHOOFT R,CHEN X,DUAN Y,et al.Vime:Variational information maximizing exploration[C]∥Advances in Neural Information Processing Systems.2016:1109-1117.
[72]STADIE B C,LEVINE S,ABBEEL P.Incentivizing exploration in reinforcement learning with deep predictive models[J].arXiv:1507.00814,2015.
[73]PATHAK D,AGRAWAL P,EFROS A A,et al.Curiosity-dri- ven Exploration by Self-supervised Prediction[C]∥International Conference on Machine Learning.2017:2778-2787.
[74]BURDA Y,EDWARDS H,PATHAK D,et al.Large-scale study of curiosity-driven learning[C]∥International Conference on Learning Representations (ICLR).2019.
[75]BURDA Y,EDWARDS H,STORKEY A,et al.Exploration by random network distillation[C]∥International Conference on Learning Representations (ICLR).2019.
[76]FU J,CO-REYES J,LEVINE S.Ex2:Exploration with exemplar models for deep reinforcement learning[C]∥Advances in Neural Information Processing Systems.2017:2577-2587.
[77]OSBAND I,ASLANIDES J,CASSIRER A.Randomized prior functions for deep reinforcement learning[C]∥Advances in Neural Information Processing Systems.2018:8626-8638.
[78]CONTI E,MADHAVAN V,SUCH F P,et al.Improving exploration in evolution strategies for deep reinforcement learning via a population of novelty-seeking agents[C]∥Advances in Neural Information Processing Systems.2018:5032-5043.
[79]GUPTA A,MENDONCA R,LIU Y X,et al.Meta-reinforce- ment learning of structured exploration strategies[C]∥Advances in Neural Information Processing Systems.2018:5307-5316.
[80]ANDRYCHOWICZ M,WOLSKI F,RAY A,et al.Hindsight experience replay[C]∥Advances in Neural Information Processing Systems.2017:5048-5058.
[81]SUTTON R S,MODAYIL J,DELP M,et al.Horde:A scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction[C]∥The 10th International Confe-rence on Autonomous Agents and Multiagent Systems.2011:761-768.
[82]SCHAUL T,HORGAN D,GREGOR K,et al.Universal value function approximators[C]∥International Conference on Machine Learning.2015:1312-1320.
[83]RAUBER P,UMMADISINGU A,MUTZ F,et al.Hindsight policy gradients[C]∥International Conference on Learning Representations (ICLR).2019.
[84]FANG M,ZHOU C,SHI B,et al.DHER:Hindsight Experience Replay for Dynamic Goals[C]∥International Conference on Learning Representations (ICLR).2019.
[85]LANKA S,WU T.ARCHER:Aggressive Rewards to Counter bias in Hindsight Experience Replay[J].arXiv:1809.02070,2018.
[86]NAIR A V,PONG V,DALAL M,et al.Visual reinforcement learning with imagined goals[C]∥Advances in Neural Information Processing Systems.2018:9209-9220.
[87]KINGMA D P,WELLING M.Auto-encoding variational bayes[J].arXiv:1312.6114,2013.
[88]SCHMIDHUBER J.Powerplay:Training an increasingly general problem solver by continually searching for the simplest still unsolvable problem[J].Frontiers in psychology,2013,4:313.
[89]FLORENSA C,HELD D,WULFMEIER M,et al.Reverse curriculum generation for reinforcement learning[C]∥International conference on Robot Learning.2017.
[90]FLORENSA C,HELD D,GENG X,et al.Automatic goal genera- tion for reinforcement learning agents[C]∥International Conference on Machine Learning.2018:1514-1523.
[91]GOODFELLOW I,POUGET-ABADIE J,MIRZA M,et al.Gene- rative adversarial nets[C]∥Advances in Neural Information Processing Systems.2014:2672-2680.
[92]SUKHBAATAR S,LIN Z,KOSTRIKOV I,et al.Intrinsic motivation and automatic curricula via asymmetric self-play[C]∥International Conference on Learning Representations (ICLR).2018.
[93]JADERBERG M,MNIH V,CZARNECKI W M,et al.Rein- forcement learning with unsupervised auxiliary tasks[C]∥International Conference on Learning Representations (ICLR).2017.
[94]MIROWSKI P,PASCANU R,VIOLA F,et al.Learning to navi- gate in complex environments[C]∥International Conference on Learning Representations (ICLR).2017.
[95]MIROWSKI P,GRIMES M,MALINOWSKI M,et al.Learning to navigate in cities without a map[C]∥Advances in Neural Information Processing Systems.2018:2424-2435.
[96]PARISOTTO E,SALAKHUTDINOV R.Neural map:Struc- tured memory for deep reinforcement learning[C]∥Internatio-nal Conference on Learning Representations.2018.
[97]HOCHREITER S,SCHMIDHUBER J.Long short-term memory[J].Neural Computation,1997,9(8):1735-1780.
[98]GU S,LILLICRAP T,SUTSKEVER I,et al.Continuous deep q-learning with model-based acceleration[C]∥International Conference on Machine Learning.2016:2829-2838.
[99]XU Z,VAN HASSELT H P,SILVER D.Meta-gradient reinforcement learning[C]∥Advances in Neural Information Processing Systems.2018:2402-2413.
[100]NACHUM O,GU S S,LEE H,et al.Data-efficient hierarchical reinforcement learning[C]∥Advances in Neural Information Processing Systems.2018:3307-3317.
[101]TENENBAUM J.Building machines that learn and think like people[C]∥Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems.2018:5-5.
[1] RAO Zhi-shuang, JIA Zhen, ZHANG Fan, LI Tian-rui. Key-Value Relational Memory Networks for Question Answering over Knowledge Graph [J]. Computer Science, 2022, 49(9): 202-207.
[2] LIU Xing-guang, ZHOU Li, LIU Yan, ZHANG Xiao-ying, TAN Xiang, WEI Ji-bo. Construction and Distribution Method of REM Based on Edge Intelligence [J]. Computer Science, 2022, 49(9): 236-241.
[3] TANG Ling-tao, WANG Di, ZHANG Lu-fei, LIU Sheng-yun. Federated Learning Scheme Based on Secure Multi-party Computation and Differential Privacy [J]. Computer Science, 2022, 49(9): 297-305.
[4] XU Yong-xin, ZHAO Jun-feng, WANG Ya-sha, XIE Bing, YANG Kai. Temporal Knowledge Graph Representation Learning [J]. Computer Science, 2022, 49(9): 162-171.
[5] SHI Dian-xi, ZHAO Chen-ran, ZHANG Yao-wen, YANG Shao-wu, ZHANG Yong-jun. Adaptive Reward Method for End-to-End Cooperation Based on Multi-agent Reinforcement Learning [J]. Computer Science, 2022, 49(8): 247-256.
[6] WANG Jian, PENG Yu-qi, ZHAO Yu-fei, YANG Jian. Survey of Social Network Public Opinion Information Extraction Based on Deep Learning [J]. Computer Science, 2022, 49(8): 279-293.
[7] HAO Zhi-rong, CHEN Long, HUANG Jia-cheng. Class Discriminative Universal Adversarial Attack for Text Classification [J]. Computer Science, 2022, 49(8): 323-329.
[8] JIANG Meng-han, LI Shao-mei, ZHENG Hong-hao, ZHANG Jian-peng. Rumor Detection Model Based on Improved Position Embedding [J]. Computer Science, 2022, 49(8): 330-335.
[9] SUN Qi, JI Gen-lin, ZHANG Jie. Non-local Attention Based Generative Adversarial Network for Video Abnormal Event Detection [J]. Computer Science, 2022, 49(8): 172-177.
[10] YUAN Wei-lin, LUO Jun-ren, LU Li-na, CHEN Jia-xing, ZHANG Wan-peng, CHEN Jing. Methods in Adversarial Intelligent Game:A Holistic Comparative Analysis from Perspective of Game Theory and Reinforcement Learning [J]. Computer Science, 2022, 49(8): 191-204.
[11] HU Yan-yu, ZHAO Long, DONG Xiang-jun. Two-stage Deep Feature Selection Extraction Algorithm for Cancer Classification [J]. Computer Science, 2022, 49(7): 73-78.
[12] CHENG Cheng, JIANG Ai-lian. Real-time Semantic Segmentation Method Based on Multi-path Feature Extraction [J]. Computer Science, 2022, 49(7): 120-126.
[13] HOU Yu-tao, ABULIZI Abudukelimu, ABUDUKELIMU Halidanmu. Advances in Chinese Pre-training Models [J]. Computer Science, 2022, 49(7): 148-163.
[14] ZHOU Hui, SHI Hao-chen, TU Yao-feng, HUANG Sheng-jun. Robust Deep Neural Network Learning Based on Active Sampling [J]. Computer Science, 2022, 49(7): 164-169.
[15] SU Dan-ning, CAO Gui-tao, WANG Yan-nan, WANG Hong, REN He. Survey of Deep Learning for Radar Emitter Identification Based on Small Sample [J]. Computer Science, 2022, 49(7): 226-235.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!