基于策略蒸馏主仆框架的优势加权双行动者-评论家算法

doi:10.11896/jsjkx.231000170

Abstract

Abstract: Offline reinforcement learning(Offline RL) defines the task of learning from a fixed batch of dataset,which can avoid the risk of interacting with environment and improve the efficiency and stability of learning.Advantage weighted actor-critic algorithm,which combines sample efficient dynamic programming with maximum likelihood strategy updating,makes use of a large number of offline data and quickly performs online fine-grained strategy adjustment.However,the algorithm uses a random experience replay mechanism,while the actor-critic model only uses one set of actors,and data sampling and playback are unbalanced.In view of the above problems,an advantage weighted double actors-critics algorithm based on policy distillation with data expe-rience optimization and replay is proposed(DOR-PDAWAC),which adopts the mechanism of preferring new data and replaying old and new data repeatedly,uses double actors to increase exploration,and uses key-minor architecture for policy distillation to divide actors into key actor and minor actor to improve performance and efficiency.Applying algorithm to the MuJoCo task in the general D4RL dataset,and experimental results show that the proposed algorithm achieves better performance in terms of lear-ning efficiency and other aspect.

Key words: Offline reinforcement learning, Deep reinforcement learning, Policy distillation, Double actors-critics framework, Experience replay mechanism

CLC Number:

TP181

YANG Haolin, LIU Quan. Advantage Weighted Double Actors-Critics Algorithm Based on Key-Minor Architecture for Policy Distillation[J].Computer Science, 2024, 51(11): 81-94.

References

[1] SUTTON R S,BARTO A G.Reinforcement learning:An introduction[M].MIT Press,2018.
[2] GOVINDARAJAN L N,LIU R G,LINSLEY D,et al.Diagnosing and exploiting the computational demands of videos games for deep reinforcement learning[J].arXiv:2309.13181,2023.
[3] WU Q,SUN N,YANG T,et al.Deep Reinforcement Learning-Based Control for Asynchronous Motor-Actuated Triple Pendulum Crane Systems With Distributed Mass Payloads[J].IEEE Transactions on Industrial Electronics,2023,71(2):1853-1862.
[4] ZHOU X,WU L,ZHANG Y,et al.A robust deep reinforcement learning approach to driverless taxi dispatching under uncertain demand[J].Information Sciences,2023,646:119401.
[5] CHAI D,WU W,HAN Q,et al.Description Based Text Classification with Reinforcement Learning[C]//International Conference on Machine Learning.PMLR,2020:1371-1382.
[6] LI S,HU C,KE S,et al.LS-MolGen:Ligand-and-StructureDual-Driven Deep Reinforcement Learning for Target-Specific Molecular Generation Improves Binding Affinity and Novelty[J].Journal of Chemical Information and Modeling,2023,63(13):4207-4215.
[7] LEVINE S,KUMAR A,TUCKER G,et al.Offline Reinforce-ment Learning:Tutorial,Review,and Perspectives on Open Problems[J].arXiv:2005.01643,2020.
[8] LIU Q,ZHAI J W,ZHANG Z Z,et al.A survey on deep reinforcement learning[J].Chinese Journal of Computers,2018,41(1):1-27.
[9] SCHWEIGHOFER K,DINU M,RADLER A,et al.A Dataset Perspective on Offline Reinforcement Learning[C]//Conference on Lifelong Learning Agents.PMLR,2022:470-517.
[10] FUJIMOTO S,MEGER D,PRECUP D.Off-Policy Deep Rein-forcement Learning without Exploration[C]//International Conference on Machine Learning.PMLR,2019:2052-2062.
[11] KUMAR A,FU J,TUCKER G,et al.Stabilizing off-policy Q-learning via bootstrapping error reduction[C]//Proceedings of the 33rd International Conference on Neural Information Processing Systems.2019:11784-11794.
[12] KUMAR A,ZHOU A,TUCKER G,et al.Conservative Q-Learning for Offline Reinforcement Learning[C]//Advances in Neural Information Processing Systems.2020:1179-1191.
[13] FUJIMOTO S,GU S.A Minimalist Approach to Offline Reinforcement Learning[C]//Advances in Neural Information Processing Systems.2021:20132-20145.
[14] NAIR A,GUPTA A,DALAL M,et al.Awac:Accelerating online reinforcement learning with offline datasets[J].arXiv:2006.09359,2020.
[15] LUO Y,WANG Y,DONG K,et al.Relay hindsight experience replay:Continual reinforcement learning for robot manipulation tasks with sparse rewards[J].arXiv:2208.00843,2022.
[16] LI J,YU T,ZHANG X,et al.Efficient experience replay based deep deterministic policy gradient for AGC dispatch in integra-ted energy system[J].Applied Energy,2021,285:116386.
[17] GAI S,WANG D,HE L.Offline Experience Replay for Conti-nual Offline Reinforcement Learning[J].arXiv:2305.13804,2023.
[18] WANG C,WU Y,VUONG Q,et al.Striving for simplicity and performance in off-policy DRL:Output normalization and non-uniform sampling[C]//International Conference on Machine Learning.PMLR,2020:10070-10080.
[19] SHI S M,LIU Q.Deep deterministic policy gradient with classified experience replay[J].Automatica Sinica,2022,48(7):1816-1823.
[20] BARTO A G,SUTTON R S,ANDERSON C W.Neuronlikeadaptive elements that can solve difficult learning control pro-blems[J].IEEE Transactionson Systems,Man,And Cybernetics,1983(5):834-846.
[21] RUSU A A,COLMENAREJO S G,GULCEHRE C,et al.Policy distillation[J].arXiv:1511.06295,2015.
[22] MNIH V,KAVUKCUOGLU K,SILVER D,et al.Human-level control through deep reinforcement learning[J].Nature,2015,518(7540):529-533.
[23] KONDA V R,TSITSIKLIS J N.Actor-citic agorithms[C]//Proceedings of the 12th International Conference on Neural Information Processing Systems.1999:1008-1014.
[24] CHEN D,ZHANG Q.Context-Aware Bayesian Network Actor-Critic Methods for Cooperative Multi-Agent Reinforcement Learning[J].arXiv:2306.01920,2023.
[25] LAI K H,ZHA D,LI Y,et al.Dual policy distillation[J].arXiv:2006.04061,2020.
[26] HONG Z W,NAGARAJAN P,MAEDA G.Periodic intra-en-semble knowledge distillation for reinforcement learning[C]//Machine Learning and Knowledge Discovery in Databases.Research Track:European Conference,ECML PKDD 2021,Bilbao,Spain,September 13－17,2021,Proceedings,Part I 21.Springer International Publishing,2021:87-103.
[27] FEDUS W,RAMACHANDRAN P,AGARWAL R,et al.Revisiting fundamentals of experiencereplay[C]//International Conference on Machine Learning.PMLR,2020:3061-3071.
[28] ZHENG G,ZHOU S,BRAVERMAN V,et al.Selective expe-rience replay compression using coresets for lifelong deep reinforcement learning in medical imaging[J].arXiv:2302.11510,2023.
[29] PACKER C,ABBEEL P,GONZALEZ J E.Hindsight task relabelling:Experience replay for sparse reward meta-rl[J].Advances in Neural Information Processing Systems,2021,34:2466-2477.
[30] LI J,TANG C,TOMIZUKA M,et al.Hierarchical planningthrough goal-conditioned offline reinforcement learning[J].IEEE Robotics and Automation Letters,2022,7(4):10216-10223.
[31] CHEN X,GHADIRZADEH A,YU T,et al.Latent-variable advantage-weighted policy optimization for offline rl[J].arXiv:2203.08949,2022.
[32] FU J,KUMAR A,NACHUM O,et al.D4rl:Datasets for deep data-driven reinforcement learning[J].arXiv:2004.07219,2020.

Related Articles 15

[1]	WANG Tianjiu, LIU Quan, WU Lan. Offline Reinforcement Learning Algorithm for Conservative Q-learning Based on Uncertainty Weight [J]. Computer Science, 2024, 51(9): 265-272.
[2]	ZHOU Wenhui, PENG Qinghua, XIE Lei. Study on Adaptive Cloud-Edge Collaborative Scheduling Methods for Multi-object State Perception [J]. Computer Science, 2024, 51(9): 319-330.
[3]	GAO Yuzhao, NIE Yiming. Survey of Multi-agent Deep Reinforcement Learning Based on Value Function Factorization [J]. Computer Science, 2024, 51(6A): 230300170-9.
[4]	WANG Shuanqi, ZHAO Jianxin, LIU Chi, WU Wei, LIU Zhao. Fuzz Testing Method of Binary Code Based on Deep Reinforcement Learning [J]. Computer Science, 2024, 51(6A): 230800078-7.
[5]	LI Danyang, WU Liangji, LIU Hui, JIANG Jingqing. Deep Reinforcement Learning Based Thermal Awareness Energy Consumption OptimizationMethod for Data Centers [J]. Computer Science, 2024, 51(6A): 230500109-8.
[6]	YANG Xiuwen, CUI Yunhe, QIAN Qing, GUO Chun, SHEN Guowei. COURIER:Edge Computing Task Scheduling and Offloading Method Based on Non-preemptivePriorities Queuing and Prioritized Experience Replay DRL [J]. Computer Science, 2024, 51(5): 293-305.
[7]	LI Junwei, LIU Quan, XU Yapeng. Option-Critic Algorithm Based on Mutual Information Optimization [J]. Computer Science, 2024, 51(2): 252-258.
[8]	SHI Dianxi, PENG Yingxuan, YANG Huanhuan, OUYANG Qianying, ZHANG Yuhui, HAO Feng. DQN-based Multi-agent Motion Planning Method with Deep Reinforcement Learning [J]. Computer Science, 2024, 51(2): 268-277.
[9]	ZHAO Xiaoyan, ZHAO Bin, ZHANG Junna, YUAN Peiyan. Study on Cache-oriented Dynamic Collaborative Task Migration Technology [J]. Computer Science, 2024, 51(2): 300-310.
[10]	AN Yang, WANG Xiuqing, ZHAO Minghua. Mobile Robots' Path Planning Method Based on Policy Fusion and Spiking Deep ReinforcementLearning [J]. Computer Science, 2024, 51(11A): 240100211-11.
[11]	TANG Jianing, LI Chengyang, ZHOU Sida, MA Mengxing, SHI Yang. Autonomous Exploration Methods for Unmanned Aerial Vehicles Based on Deep ReinforcementLearning [J]. Computer Science, 2024, 51(11A): 231100139-6.
[12]	LU Yue, WANG Qiong, LIU Shun, LI Qingtao, LIU Yang, WANG Hongbiao. Reinforcement Learning Algorithm for Charging/Discharging Control of Electric Vehicles Considering Battery Loss [J]. Computer Science, 2024, 51(11A): 231200147-7.
[13]	CHEN Juan, WANG Yang, WU Zongling, CHEN Peng, ZHANG Fengchun , HAO Junfeng. Cloud-Edge Collaborative Task Transfer and Resource Reallocation Optimization Based on Deep Reinforcement Learning [J]. Computer Science, 2024, 51(11A): 231100170-10.
[14]	ZHAO Weidong, LU Ming, ZHANG Rui. Study on Road Crack Detection Based on Weakly Supervised Semantic Segmentation [J]. Computer Science, 2024, 51(11): 148-156.
[15]	LIU Xingguang, ZHOU Li, ZHANG Xiaoying, CHEN Haitao, ZHAO Haitao, WEI Jibo. Edge Intelligent Sensing Based UAV Space Trajectory Planning Method [J]. Computer Science, 2023, 50(9): 311-317.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Advantage Weighted Double Actors-Critics Algorithm Based on Key-Minor Architecture for Policy Distillation

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 15

Metrics

Comments

Recommended 0