计算机科学 ›› 2021, Vol. 48 ›› Issue (4): 223-228.doi: 10.11896/jsjkx.200600177

• 人工智能 • 上一篇    下一篇

基于平均神经网络参数的DQN算法

黄志勇, 吴昊霖, 王壮, 李辉   

  1. 四川大学计算机学院 成都610065
  • 收稿日期:2020-06-24 修回日期:2020-09-14 出版日期:2021-04-15 发布日期:2021-04-09
  • 通讯作者: 李辉(lihuib@scu.edu.cn)
  • 基金资助:
    教育部联合基金(6141A02011607)

DQN Algorithm Based on Averaged Neural Network Parameters

HUANG Zhi-yong, WU Hao-lin, WANG Zhuang, LI Hui   

  1. College of Computer Science,Sichuan University,Chengdu 610065,China
  • Received:2020-06-24 Revised:2020-09-14 Online:2021-04-15 Published:2021-04-09
  • About author:HUANG Zhi-yong,born in 1995,postgraduate.His main research interests include deep reinforcement learning and so on.(huangzhiyong1995@163.com)
    LI Hui,born in 1970,Ph.D,professor.His main research interests include computational intelligence,battlefield simulation and virtual reality.
  • Supported by:
    Joint Foundation of the Ministry of Education(6141A02011607).

摘要: 在深度强化学习领域,如何有效地探索环境是一个难题。深度Q网络(Deep Q-Network,DQN)使用ε -贪婪策略来探索环境,ε的大小和衰减需要人工进行调节,而调节不当会导致性能变差。这种探索策略不够高效,不能有效解决深度探索问题。针对DQN的ε -贪婪策略探索效率不够高的问题,提出一种基于平均神经网络参数的DQN算法(Averaged Parameters DQN,AP-DQN)。该算法在回合开始时,将智能体之前学习到的多个在线值网络参数进行平均,得到一个扰动神经网络参数,然后通过扰动神经网络进行动作选择,从而提高智能体的探索效率。实验结果表明,AP-DQN算法在面对深度探索问题时的探索效率优于DQN,在5个Atari游戏环境中相比DQN获得了更高的平均每回合奖励,归一化后的得分相比DQN最多提升了112.50%,最少提升了19.07%。

关键词: 深度Q网络, 深度强化学习, 深度探索, 神经网络参数

Abstract: In the field of deep reinforcement learning,how to efficiently explore environment is a hard problem.Deep Q-network algorithm explores environment with epsilon-greedy policy whose size and decay need manual tuning.Unsuitable tuning will cause a poor performance.The epsilon-greedy policy is ineffective and cannot resolve deep exploration problem.In this paper,in order to solve the problem,a deep reinforcement learning algorithm based on averaged neural network parameters (AP-DQN) is proposed.At the beginning of episode,the algorithm averages the multiple online network parameters learned by the agent to obtain a perturbed neural network parameter,and then selects an action through the perturbed neural network,which can improve the agent’s exploration efficiency.Experiment results show that the exploration efficiency of AP-DQN is better than that of DQN on deep exploration problem and AP-DQN get higher scores than DQN in five Atari games.The normalized score increases by 112.50% at most and 19.07% at least compared with DQN.

Key words: Deep exploration, Deep Q-network, Deep reinforcement learning, Neural network parameters

中图分类号: 

  • TP181
[1]SUTTON R,BARTO A.Reinforcement learning:An introduc-tion[M].Massachusetts:MIT Press,2018.
[2]GAO Y,ZHOU R I,WANG H,et al.Research on average reward reinforcement learning algorithm [J].Chinese Journal of Computers,2007,30(8):1372-1378.
[3]YANG W C,ZHANG L.Multi-agent reinforcement learningbased traffic signal control for integrated urban network:survey of state of art [J].Application Research of Computers,2018,35(6):1613-1618.
[4]TAN M.Multi-agent reinforcement learning:independent vs.cooperative agents [C]//Proceeding of the 10th International Conference on Machine Learning.San Francisco,CA:Morgan Kaufmann Publishing,1993:487-494.
[5]PETER D.Q-learning [J].Machine Learning,1992,8(3):279-292.
[6]MABU S,HATAKEYAMA H,HIRASAWA K,et al.Genetic Network Programming with Reinforcement Learning Using Sarsa Algorithm [C]//IEEE Congress on Evolutionary Computation.IEEE,2006.
[7]MNIH V,KAVUKCUOGLU K,SILVER D,et al.Playing Atari with deep reinforcement learning [J].arXiv:1312.5602v1,2013.
[8]MNIH V,KAVUKCUOGLU K,SILVER D,et al.Human-level control through deep reinforcement learning [J].Nature,2015,518(7540):529-533.
[9]VAN H V,GUEZ A,SILVER D.Deep reinforcement learning with double Q-learning [C]//Proceeding of the AAAI Conference on Artificial Intelligence.2016:2094-2100.
[10]SCHAUL T,QUAN J,ANTONOGLOU I,et al.Prioritized experience replay [C]//Proceeding of International Conference on Learning Representations.2016:53-73.
[11]WANG Z Y,SCHAUL T,HESSEL M,et al.Dueling networkarchitectures for deep reinforcement learning [C]//Proceeding of the 33rd International Conference on Machine Learning.2016:1995-2003.
[12]OSBAND I,BLUNDELL C,PRITZEL A,et al.Deep exploration via bootstrapped DQN [J].arXiv:1602.04621v3,2016.
[13]PLAPPERT M,HOUTHOOFT R,DHARIWAL P,et al.Pa-rameter space noise for exploration[J/OL].https://arxiv.org/abs/1706.01905.
[14]LIU Q,YAN Y,ZHU F,et al.A deep recurrent Q Network with exploratory noise [J].Chinese Journal of Computers,2019(7):1588-1604.
[15]YANG M,WANG J.A Bayesian deep reinforcement learning algorithm for solving deep exploration problems [J].Journal of Frontiers of Computer Science and Technology,2020,14(2):307-316.
[16]BELLEMARE M G,NADDAF Y,VENESS J,et al.The arcade learning environment:an evaluation platform for general agents [J].Journal of Artificial Intelligence Research,2013,47:253-279.
[17]SAMUEL CASTRO P,MOITRA S,GELADA C,et al.Dopa-mine:a research framework for deep reinforcement learning [J/OL].https://arxiv.org/abs/1812.06110.
[18]LIU Q,ZHAI J W,ZHONG S,et al.A deep recurrent q-network based on visual attention mechanism [J].Chinese Journal of Computers,2017,40(6):1353-1366.
[19]SCHULMAN J,WOLSKI F,DHARIWAL P,et al.Proximal policy optimization algorithms [J/OL].https://arxiv.org/abs/1707.06347.
[1] 熊丽琴, 曹雷, 赖俊, 陈希亮.
基于值分解的多智能体深度强化学习综述
Overview of Multi-agent Deep Reinforcement Learning Based on Value Factorization
计算机科学, 2022, 49(9): 172-182. https://doi.org/10.11896/jsjkx.210800112
[2] 于滨, 李学华, 潘春雨, 李娜.
基于深度强化学习的边云协同资源分配算法
Edge-Cloud Collaborative Resource Allocation Algorithm Based on Deep Reinforcement Learning
计算机科学, 2022, 49(7): 248-253. https://doi.org/10.11896/jsjkx.210400219
[3] 李梦菲, 毛莺池, 屠子健, 王瑄, 徐淑芳.
基于深度确定性策略梯度的服务器可靠性任务卸载策略
Server-reliability Task Offloading Strategy Based on Deep Deterministic Policy Gradient
计算机科学, 2022, 49(7): 271-279. https://doi.org/10.11896/jsjkx.210600040
[4] 谢万城, 李斌, 代玥玥.
空中智能反射面辅助边缘计算中基于PPO的任务卸载方案
PPO Based Task Offloading Scheme in Aerial Reconfigurable Intelligent Surface-assisted Edge Computing
计算机科学, 2022, 49(6): 3-11. https://doi.org/10.11896/jsjkx.220100249
[5] 洪志理, 赖俊, 曹雷, 陈希亮, 徐志雄.
基于遗憾探索的竞争网络强化学习智能推荐方法研究
Study on Intelligent Recommendation Method of Dueling Network Reinforcement Learning Based on Regret Exploration
计算机科学, 2022, 49(6): 149-157. https://doi.org/10.11896/jsjkx.210600226
[6] 李鹏, 易修文, 齐德康, 段哲文, 李天瑞.
一种基于深度学习的供热策略优化方法
Heating Strategy Optimization Method Based on Deep Learning
计算机科学, 2022, 49(4): 263-268. https://doi.org/10.11896/jsjkx.210300155
[7] 欧阳卓, 周思源, 吕勇, 谭国平, 张悦, 项亮亮.
基于深度强化学习的无信号灯交叉路口车辆控制
DRL-based Vehicle Control Strategy for Signal-free Intersections
计算机科学, 2022, 49(3): 46-51. https://doi.org/10.11896/jsjkx.210700010
[8] 代珊珊, 刘全.
基于动作约束深度强化学习的安全自动驾驶方法
Action Constrained Deep Reinforcement Learning Based Safe Automatic Driving Method
计算机科学, 2021, 48(9): 235-243. https://doi.org/10.11896/jsjkx.201000084
[9] 成昭炜, 沈航, 汪悦, 王敏, 白光伟.
基于深度强化学习的无人机辅助弹性视频多播机制
Deep Reinforcement Learning Based UAV Assisted SVC Video Multicast
计算机科学, 2021, 48(9): 271-277. https://doi.org/10.11896/jsjkx.201000078
[10] 梁俊斌, 张海涵, 蒋婵, 王天舒.
移动边缘计算中基于深度强化学习的任务卸载研究进展
Research Progress of Task Offloading Based on Deep Reinforcement Learning in Mobile Edge Computing
计算机科学, 2021, 48(7): 316-323. https://doi.org/10.11896/jsjkx.200800095
[11] 王英恺, 王青山.
能量收集无线通信系统中基于强化学习的能量分配策略
Reinforcement Learning Based Energy Allocation Strategy for Multi-access Wireless Communications with Energy Harvesting
计算机科学, 2021, 48(7): 333-339. https://doi.org/10.11896/jsjkx.201100154
[12] 周仕承, 刘京菊, 钟晓峰, 卢灿举.
基于深度强化学习的智能化渗透测试路径发现
Intelligent Penetration Testing Path Discovery Based on Deep Reinforcement Learning
计算机科学, 2021, 48(7): 40-46. https://doi.org/10.11896/jsjkx.210400057
[13] 李贝贝, 宋佳芮, 杜卿芸, 何俊江.
DRL-IDS:基于深度强化学习的工业物联网入侵检测系统
DRL-IDS:Deep Reinforcement Learning Based Intrusion Detection System for Industrial Internet of Things
计算机科学, 2021, 48(7): 47-54. https://doi.org/10.11896/jsjkx.210400021
[14] 范艳芳, 袁爽, 蔡英, 陈若愚.
车载边缘计算中基于深度强化学习的协同计算卸载方案
Deep Reinforcement Learning-based Collaborative Computation Offloading Scheme in VehicularEdge Computing
计算机科学, 2021, 48(5): 270-276. https://doi.org/10.11896/jsjkx.201000005
[15] 范家宽, 王皓月, 赵生宇, 周添一, 王伟.
数据驱动的开源贡献度量化评估与持续优化方法
Data-driven Methods for Quantitative Assessment and Enhancement of Open Source Contributions
计算机科学, 2021, 48(5): 45-50. https://doi.org/10.11896/jsjkx.201000107
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!