基于轨迹感知的稀疏奖励探索方法

doi:10.11896/jsjkx.220700010

Abstract

Abstract: When dealing with sparse reward problems,existing deep RL algorithms often lead to hard exploration,they often only rely on the pre-designed environment reward,so it is difficult to achieve good results.In this situation,it is necessary to design rewards more carefully,make more accurate judgments and feedback on the exploration status of agents.The asynchronous advantage actor-critic(A3C) algorithm improves the training efficiency through parallel training,and improves the training speed of the original algorithm.However,for the environment with sparse rewards,it cannot well solve the problem of difficult exploration.To solve the problem of poor exploration effect of A3C algorithm in sparse reward environment,A3C based on exploration trajectory perception(ETP-A3C) is proposed.The algorithm can perceive the exploration trajectory of the agent when it is difficult to explore in training,further judge and decide the exploration direction of the agent,and help the agent get out of the exploration dilemma as soon as possible.In order to verify the effectiveness of ETP-A3C algorithm,a comparative experiment is carried out with baseline algorithm in five different environments of Super Mario Brothers.The results show that this method has significantly improved the learning speed and model stability.

Key words: Artificial intelligence, Knowledge transfer, Deep reinforcement learning, Asynchronous Advantage Actor-Critic, Exploration-Utilization problem

CLC Number:

TP181

ZHANG Qiyang, CHEN Xiliang, ZHANG Qiao. Sparse Reward Exploration Method Based on Trajectory Perception[J].Computer Science, 2023, 50(1): 262-269.

References

[1]SUTTON R S,BARTO A G.Reinforcement learning: An introduction[M].MIT Press,2018.
[2]SILVER D,SINGH S,PRECUP D,et al.Reward is enough[J].Artificial Intelligence,2021,299: 103535.
[3]CHENTANEZ N,BARTO A,SINGH S.Intrinsically motivated reinforcement learning[C]// Proceedings of the 17th International Conference on Neural Information Processing Systems.2004:1281-1288.
[4]ZHU Z,LIN K,ZHOU J.Transfer Learning in Deep Reinforcement Learning: A Survey[J].arXiv:2009.07888,2020.
[5]PATHAK D,AGRAWAL P,EFROS A A,et al.Curiosity-dri-ven exploration by self-supervised prediction[C]//International Conference on Machine Learning.PMLR,2017: 2778-2787.
[6]TAO Y,GENC S,CHUNG J,et al.Repaint: Knowledge transfer in deep reinforcement learning[C]//International Conference on Machine Learning.PMLR,2021:10141-10152.
[7]WATKINS C J C H,DAYAN P.Q-learning[J].Machine Lear-ning,1992,8(3):279-292.
[8]RUMMERY G A,NIRANJAN M.On-line Q-learning usingconnectionist systems[M].Cambridge,UK:University of Cambridge,1994.
[9]MNIH V,KAVUKCUOGLU K,SILVER D,et al.Human-level control through deep reinforcement learning[J].Nature,2015,518(7540):529-533.
[10]MARBACH P,TSITSIKLIS J N.Simulation-based optimization of Markov reward processes[J].IEEE Transactions on Automatic Control,2001,46(2):191-209.
[11]SUTTON R S,MCALLESTER D,SINGH S,et al.Policy gra-dient methods for reinforcement learning with function approximation[J].Advances in Neural Information Processing Systems(NIPS 1999),2000,12:1057-1063.
[12]KONDA V R,TSITSIKLIS J N.Actorcitic agorithms[C]//Proceedings of the 12th International Conference on Neural Information Processing Systems.1999:1008-1014.
[13]SCHULMAN J,LEVINE S,ABBEEL P,et al.Trust region po-licy optimization[C]//International Conference on Machine Learning.PMLR,2015:1889-1897.
[14]SCHULMAN J,WOLSKI F,DHARIWAL P,et al.Proximal policy optimization algorithms[J].arXiv:1707.06347,2017.
[15]LILLICRAP T P,HUNT J J,PRITZEL A,et al.Continuouscontrol with deep reinforcement learning[C]//ICLR(Poster).2016.
[16]MNIH V,BADIA A P,MIRZA M,et al.Asynchronous methods for deep reinforcement learning[C]//International Conference on Machine Learning.PMLR,2016:1928-1937.
[17]BURDA Y,EDWARDS H,STORKEY A,et al.Exploration by random network distillation[C]//Seventh International Confe-rence on Learning Representations.2019:1-17.
[18]BELLEMARE M G,SRINIVASAN S,OSTROVSKI G,et al.Unifying count-based exploration and intrinsic motivation[C]//Proceedings of the 30th International Conference on Neural Information Processing Systems.2016:1479-1487.
[19]MACHADO M C,BELLEMARE M G,BOWLING M.Count-based exploration with the successor representation[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2020:5125-5133.
[20]HOUTHOOFT R,CHEN X,DUAN Y,et al.VIME:variationalinformation maximizing exploration[C]//Proceedings of the 30th International Conference on Neural Information Processing Systems.2016:1117-1125.
[21]MOHAMED S,REZENDE D J.Variational information maxi-misation for intrinsically motivated reinforcement learning[C]//Proceedings of the 28th International Conference on Neural Information Processing Systems-Volume 2.2015:2125-2133.
[22]SERT E,BAR-YAM Y,MORALES A J.Segregation dynamics with reinforcement learning and agent based modeling[J].Scientific Reports,2020,10(1):1-12.
[23]SCHULMAN J,MORITZ P,LEVINE S,et al.High-dimensionalcontinuous control using generalized advantage estimation[J].arXiv:1506.02438,2015.

Related Articles 15

[1]	HUANG Yuzhou, WANG Lisong, QIN Xiaolin. Bi-level Path Planning Method for Unmanned Vehicle Based on Deep Reinforcement Learning [J]. Computer Science, 2023, 50(1): 194-204.
[2]	XU Ping'an, LIU Quan. Deep Reinforcement Learning Based on Similarity Constrained Dual Policy Distillation [J]. Computer Science, 2023, 50(1): 253-261.
[3]	WEI Nan, WEI Xianglin, FAN Jianhua, XUE Yu, HU Yongyang. Backdoor Attack Against Deep Reinforcement Learning-based Spectrum Access Model [J]. Computer Science, 2023, 50(1): 351-361.
[4]	YU Bin, LI Xue-hua, PAN Chun-yu, LI Na. Edge-Cloud Collaborative Resource Allocation Algorithm Based on Deep Reinforcement Learning [J]. Computer Science, 2022, 49(7): 248-253.
[5]	TANG Feng, FENG Xiang, YU Hui-qun. Multi-task Cooperative Optimization Algorithm Based on Adaptive Knowledge Transfer andResource Allocation [J]. Computer Science, 2022, 49(7): 254-262.
[6]	LI Meng-fei, MAO Ying-chi, TU Zi-jian, WANG Xuan, XU Shu-fang. Server-reliability Task Offloading Strategy Based on Deep Deterministic Policy Gradient [J]. Computer Science, 2022, 49(7): 271-279.
[7]	XIE Wan-cheng, LI Bin, DAI Yue-yue. PPO Based Task Offloading Scheme in Aerial Reconfigurable Intelligent Surface-assisted Edge Computing [J]. Computer Science, 2022, 49(6): 3-11.
[8]	HONG Zhi-li, LAI Jun, CAO Lei, CHEN Xi-liang, XU Zhi-xiong. Study on Intelligent Recommendation Method of Dueling Network Reinforcement Learning Based on Regret Exploration [J]. Computer Science, 2022, 49(6): 149-157.
[9]	LI Ye, CHEN Song-can. Physics-informed Neural Networks:Recent Advances and Prospects [J]. Computer Science, 2022, 49(4): 254-262.
[10]	LI Peng, YI Xiu-wen, QI De-kang, DUAN Zhe-wen, LI Tian-rui. Heating Strategy Optimization Method Based on Deep Learning [J]. Computer Science, 2022, 49(4): 263-268.
[11]	OUYANG Zhuo, ZHOU Si-yuan, LYU Yong, TAN Guo-ping, ZHANG Yue, XIANG Liang-liang. DRL-based Vehicle Control Strategy for Signal-free Intersections [J]. Computer Science, 2022, 49(3): 46-51.
[12]	LI Sun, CAO Feng, LIU Zi-shan. Study on Quality Evaluation Method of Speech Datasets for Algorithm Model [J]. Computer Science, 2022, 49(11A): 210800246-6.
[13]	CAI Yue, WANG En-liang, SUN Zhe, SUN Zhi-xin. Study on Dual Sequence Decision-making for Trucks and Cargo Matching Based on Dual Pointer Network [J]. Computer Science, 2022, 49(11A): 210800257-9.
[14]	ZHAO Hong, CHANG You-kang, WANG Wei-jie. Survey of Adversarial Attacks and Defense Methods for Deep Neural Networks [J]. Computer Science, 2022, 49(11A): 210900163-11.
[15]	WANG Lu, WEN Wu-song. Study on Distributed Intrusion Detection System Based on Artificial Intelligence [J]. Computer Science, 2022, 49(10): 353-357.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Sparse Reward Exploration Method Based on Trajectory Perception

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 15

Metrics

Comments

Recommended 0