Computer Science ›› 2021, Vol. 48 ›› Issue (12): 297-303.doi: 10.11896/jsjkx.201000163

• Artificial Intelligence • Previous Articles     Next Articles

Proximal Policy Optimization Based on Self-directed Action Selection

SHEN Yi1, LIU Quan1,2,3,4   

  1. 1 School of Computer and Technology,Soochow University,Suzhou,Jiangsu 215006,China
    2 Provincial Key Laboratory for Computer Information Processing Technology,Soochow University,Suzhou,Jiangsu 215006,China
    3 Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education,Jilin University,Changchun 130012,China
    4 Collaborative Innovation Center of Novel Software Technology and Industrialization,Nanjing 210000,China
  • Received:2020-10-28 Revised:2021-03-11 Online:2021-12-15 Published:2021-11-26
  • About author:SHEN Yi,born in 1995,postgraduate.Her main research interests include deep reinforcement learning and so on.
    LIU Quan,born in 1969,Ph.D,professor,is a member of China Computer Federation.His main research interests include deep reinforcement learning and automated reasoning.
  • Supported by:
    National Natural Science Foundation of China(61772355,61702055,61502323,61502329),Jiangsu Province Na-tural Science Research University Major Projects(18KJA520011,17KJA520004),Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education,Jilin University(93K172014K04,93K172017K18),Suzhou Industrial Application of Basic Research Program Part(SYG201422) and A Project Funded by the Priority Academic Program Development of Jiangsu Higher Education Institutions.

Abstract: The optimization algorithm of monotonous improvement of strategy in reinforcement learning is a current research hotspot,and it has achieved good performance in both discrete and continuous control tasks.Proximal policy optimization(PPO)algorithm is a classic strategy monotonic promotion algorithm,but it is an on-policy algorithm with low sample utilization.To solve this problem,an algorithm named proximal policy optimization based on self-directed action selection(SDAS-PPO)is proposed.The SDAS-PPO algorithm not only uses the sample experience according to the importance sampling weight,but also adds a synchronously updated experience pool to store its own excellent sample experience,and uses the self-directed network learned from the experience pool to guide the choice of actions.The SDAS-PPO algorithm greatly improves the sample utilization rate and ensures that the intelligent body can learn quickly and effectively when training the network model.In order to verify the effectiveness of the SDAS-PPO algorithm,the SDAS-PPO algorithm and the TRPO algorithm,PPO algorithm and PPO-AMBER algorithm are used in the continuous control task Mujoco simulation platform for comparative experiments.Experimental results show that this method has better performance in most environments.

Key words: Deep reinforcement learning, Policy gradient, Proximal policy optimization, Reinforcement learning, Self-directed

CLC Number: 

  • TP181
[1]SUTTON R S,BARTO A G.Reinforcement Learning:An In- troduction[M].Cambridge,MA:MIT Press,1998:6-22.
[2]PARR R,LI L,TAYLOR G,et al.An Analysis of Linear Mo- dels,Linear Value-Function Approximation,and Feature Selection for Reinforcement Learning[C]//International Conference on Machine Learning.2008.
[3]KOHL N,STONE P.Policy gradient reinforcement learning for fast quadrupedal locomotion[C]//IEEE International Confe-rence on Robotics & Automation.IEEE,2004.
[4]BARTO A G,SUTTON R S,ANDERSON C W.Neuronlike adaptive elements that can solve difficult learning control problems[J].IEEE Transaction on Systems,Man and Cybernetics,1983,13(5):834-846.
[5]SEIJEN H V,HASSELT H V,WHITESON S,et al.A theore- tical and empirical analysis of Expected Sarsa[C]//Adaptive Dynamic Programming and Reinforcement Learning,2009.IEEE,2009.
[6]KIUMARSI B,LEWIS F L,MODARES H,et al.Reinforcement Q-learning for optimal tracking control of linear discrete-time systems with unknown dynamics[J].Automatica,2014,50(4):1167-1175.
[7]TANGKARATT V,ABDOLMALEKI A,SUGIYAMA M. Guide Actor-Critic for Continuous Control[J].arXiv:1705.07606,2017.
[8]KRIZHEVSKY A,SUTSKEVER I,HINTON G.ImageNet Classification with Deep Convolutional Neural Networks[J].Advances in Neural Information Processing Systems,2012,25:1097-1105.
[9]MNIH V,KAVUKCUOGLU K,SILVER D,et al.Human-level control through deep reinforcement learning[J].Nature,2015,518(7540):529-533.
[10]LIU Q,ZHAI J W,ZHANG Z Z,et al.A review of deep reinforcement learning[J].Chinese Journal of Computers,2018,41(1):1-27.
[11]WANG Z,SCHAUL T,HESSEL M,et al.Dueling network architectures for deep reinforcement learning[J].Proceedings of the 33nd International Conference on Machine Learning.New York,USA,2016:692-700.
[12]VAN HASSELT H,GUEZ A,SILVER D.Deep Reinforcement Learning with Double Q-Learning[C]//Proceedings of theThir-tieth AAAI Conference on Artificial Intelligence.Phoenix,USA,2016:2094-2100.
[13]HAUSKNECHT M,STONE P.Deep recurrent q-learning for partially observable mdps[C]//2015 AAAI fall symposium series.2015.
[14]SILVER D,LEVER G,HEESS N,et al.Deterministic policy gradient algorithms[C]//Proc. of the 31st Int. Conf. on Machine Learning.New York:ACM,2014:387-395.
[15]LILLICRAP T P,HUNT J J,PRITZEL A,et al.Continuous control with deep reinforcement learning[J].arXiv:1509.02971,2015.
[16]SCHULMAN J,LEVINE S,ABBEEL P,et al.Trust region po- licy optimization[C]//International Conference on Machine Learning.PMLR,2015:1889-1897.
[17]SCHULMAN J,WOLSKI F,DHARIWAL P,et al.Proximal policy optimization algorithms[J].arXiv:1707.06347,2017.
[18]HAN S,SUNG Y.Amber:Adaptive multi-batch experience replay for continuous action control[J].arXiv:1710.04423,2017.
[19]LIU H,FENG Y,MAO Y,et al.Sample-efficient policy optimization with stein control variate[J].arXiv:1710.11198,2017.
[20]LING P,CAI Q P,HUANG L B.Multi-Path Policy Optimization[C]//International Conference on Autonomous Agents and Multi Agent Systems.2020:1001-1009.
[21]PAN F,CAI Q,ZENG A X,et al.Policy optimization with mo- del-based explorations[C]//Proceedings of the AAAI Confe-rence on Artificial Intelligence.2019,33:4675-4682.
[22]TOUATI A,ZHANG A,PINEAU J,et al.Stable policy optimization via off-policy divergence regularization[C]//Conference on Uncertainty in Artificial Intelligence.PMLR,2020:1328-1337.
[23]LI A,FLORENSA C,CLAVERA I,et al.Sub-policy Adaptation for Hierarchical Reinforcement Learning[C]//International Conference on Learning Representations.2019.
[24]YOSHIDA N,UCHIBE E,DOYA K.Reinforcement learning with state-dependent discount factor[C]//IEEE Third Joint International Conference on Development & Learning & Epigenetic Robotics.IEEE,2013.
[25]FU Q M,LIU Q,SUN H K,et al.A second-order TD Error fast Q(λ) algorithm[J].Pattern Recognition and Artificial Intelligence,2013(3):282-292.
[26]BROCKMAN G,CHEUNG V,PETTERSSON L,et al.Openai gym[J].arXiv:1606.01540,2016.
[27]TODOROV E,EREZ T,TASSA Y.MuJoCo:A physics engine for model-based control[C]//2012 IEEE/RSJ International Conference on Intelligent Robots and Systems(IROS).IEEE,2012.
[28]DHARIWAL P,HESSE N,MANNING C,et al.OpenAI baselines [OL].GitHub,2017.https://github.com/openai/baselines.
[1] LIU Xing-guang, ZHOU Li, LIU Yan, ZHANG Xiao-ying, TAN Xiang, WEI Ji-bo. Construction and Distribution Method of REM Based on Edge Intelligence [J]. Computer Science, 2022, 49(9): 236-241.
[2] YUAN Wei-lin, LUO Jun-ren, LU Li-na, CHEN Jia-xing, ZHANG Wan-peng, CHEN Jing. Methods in Adversarial Intelligent Game:A Holistic Comparative Analysis from Perspective of Game Theory and Reinforcement Learning [J]. Computer Science, 2022, 49(8): 191-204.
[3] SHI Dian-xi, ZHAO Chen-ran, ZHANG Yao-wen, YANG Shao-wu, ZHANG Yong-jun. Adaptive Reward Method for End-to-End Cooperation Based on Multi-agent Reinforcement Learning [J]. Computer Science, 2022, 49(8): 247-256.
[4] YU Bin, LI Xue-hua, PAN Chun-yu, LI Na. Edge-Cloud Collaborative Resource Allocation Algorithm Based on Deep Reinforcement Learning [J]. Computer Science, 2022, 49(7): 248-253.
[5] LI Meng-fei, MAO Ying-chi, TU Zi-jian, WANG Xuan, XU Shu-fang. Server-reliability Task Offloading Strategy Based on Deep Deterministic Policy Gradient [J]. Computer Science, 2022, 49(7): 271-279.
[6] XIE Wan-cheng, LI Bin, DAI Yue-yue. PPO Based Task Offloading Scheme in Aerial Reconfigurable Intelligent Surface-assisted Edge Computing [J]. Computer Science, 2022, 49(6): 3-11.
[7] HONG Zhi-li, LAI Jun, CAO Lei, CHEN Xi-liang, XU Zhi-xiong. Study on Intelligent Recommendation Method of Dueling Network Reinforcement Learning Based on Regret Exploration [J]. Computer Science, 2022, 49(6): 149-157.
[8] GUO Yu-xin, CHEN Xiu-hong. Automatic Summarization Model Combining BERT Word Embedding Representation and Topic Information Enhancement [J]. Computer Science, 2022, 49(6): 313-318.
[9] FAN Jing-yu, LIU Quan. Off-policy Maximum Entropy Deep Reinforcement Learning Algorithm Based on RandomlyWeighted Triple Q -Learning [J]. Computer Science, 2022, 49(6): 335-341.
[10] ZHANG Jia-neng, LI Hui, WU Hao-lin, WANG Zhuang. Exploration and Exploitation Balanced Experience Replay [J]. Computer Science, 2022, 49(5): 179-185.
[11] LI Peng, YI Xiu-wen, QI De-kang, DUAN Zhe-wen, LI Tian-rui. Heating Strategy Optimization Method Based on Deep Learning [J]. Computer Science, 2022, 49(4): 263-268.
[12] ZHOU Qin, LUO Fei, DING Wei-chao, GU Chun-hua, ZHENG Shuai. Double Speedy Q-Learning Based on Successive Over Relaxation [J]. Computer Science, 2022, 49(3): 239-245.
[13] LI Su, SONG Bao-yan, LI Dong, WANG Jun-lu. Composite Blockchain Associated Event Tracing Method for Financial Activities [J]. Computer Science, 2022, 49(3): 346-353.
[14] OUYANG Zhuo, ZHOU Si-yuan, LYU Yong, TAN Guo-ping, ZHANG Yue, XIANG Liang-liang. DRL-based Vehicle Control Strategy for Signal-free Intersections [J]. Computer Science, 2022, 49(3): 46-51.
[15] HUANG Xin-quan, LIU Ai-jun, LIANG Xiao-hu, WANG Heng. Load-balanced Geographic Routing Protocol in Aerial Sensor Network [J]. Computer Science, 2022, 49(2): 342-352.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!