计算机科学 ›› 2021, Vol. 48 ›› Issue (9): 235-243.doi: 10.11896/jsjkx.201000084
代珊珊1, 刘全1,2,3,4
DAI Shan-shan1, LIU Quan1,2,3,4
摘要: 随着人工智能的发展,自动驾驶领域的研究也日益壮大。深度强化学习(Deep Reinforcement Learning,DRL)方法是该领域的主要研究方法之一。其中,安全探索问题是该领域的一个研究热点。然而,大部分DRL算法为了提高样本的覆盖率并没有对探索方法进行安全限制,使无人车探索时会陷入某些危险状态,从而导致学习失败。针对该问题,提出了一种基于动作约束的软行动者-评论家算法(Constrained Soft Actor-critic,CSAC),该方法首先对环境奖赏进行了合理限制。无人车动作转角过大时会产生抖动,因此在奖赏函数中加入惩罚项,使无人车尽量避免陷入危险状态。另外,CSAC方法又对智能体的动作进行了约束。当目前状态选择动作后使无人车偏离轨道或者发生碰撞时,标记该动作为约束动作,在之后的训练中通过合理约束来更好地指导无人车选择新动作。为了体现CSAC方法的优势,将CSAC方法应用在自动驾驶车道保持任务中,并与SAC算法进行对比。结果表明,引入安全机制的CSAC方法可以有效避开不安全动作,提高自动驾驶过程中的稳定性,同时还加快了模型的训练速度。最后,将训练好的模型移植到带有树莓派的无人车上,进一步验证了模型的泛用性。
中图分类号:
[1]ORT T,PAULL L,RUS D.Autonomous vehicle navigation in rural environments without detailed prior maps[C]//2018 IEEE International Conference on Robotics and Automation (ICRA).IEEE,2018:2040-2047. [2]PENDLETON S D,ANDERSEN H,DU X X,et al.Perception,Planning,Control,and Coordination for Autonomous Vehicles[J].Machines,2017,5(1):6. [3]CAPORALE D,SETTIMI A,MASSA F,et al.Towards the Design of Robotic Drivers for Full-Scale Self-Driving Racing Cars [C]//2019 International Conference on Robotics and Automation (ICRA).IEEE,2019:5643-5649. [4]ZHUANG L,ZHANG Z,WANG L.The automatic segmentation of residential solar panels based on satellite images:A cross learning driven U-Net method[J].Applied Soft Computing,2020,92:106283. [5]VEDDER B,SVENSSON B J,VINTER J,et al.AutomatedTesting of Ultrawideband Positioning for Autonomous Driving[J].Journal of Robotics,2020,2020:1-15. [6]BOJARSKI M,DEL TESTA D,DWORAKOWSKI D,et al.End to End Learning for Self-Driving Cars[J].arXiv:1604.07316,2016. [7]XU H,GAO Y,YU F,et al.End-to-End Learning of Driving Models from LargeScale Video Datasets [C]//2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).IEEE,2017:2174-2182. [8]CHEN L,WANG Q,LU X,et al.Learning Driving ModelsFrom Parallel End-to-End Driving Data Set[J].Proceedings of the IEEE,2020,108(2):262-273. [9]CODEVILLA F,MULLER M.End-to-end driving via conditio-nal imitation learning[C]//2018 IEEE International Conference on Robotics and Automation (ICRA).IEEE,2018:4693-4700. [10]SUTTOM R S,BARTO A G.Reinforcement learning:An introduction [M].MIT Press,1998. [11]MAXIMILIAN J,RAOUL D,MARIN T,et al.End-to-EndRace Driving with Deep Reinforcement Learning[C]//International Conference on Robotics and Automation (ICRA).IEEE,2018:2070-2075. [12]KENDALL A,HAWKE J,JANZ D,et al.Learning to Drive in a Day [C]//2018 IEEE International Conference on Robotics and Automation (ICRA).IEEE,2019:8248-8254. [13]TOROMANOFF M,WIRBEL E,MOUTAR-DE F.End-to-End Model-Free Reinforcement Learning for Urban Driving Using Implicit Affordances[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:7153-7162. [14]CHEN S,WANG M,SONG W,et al.Stabilization Approaches for Reinforcement Learning-Based End-to-End Autonomous Driving[J].IEEE Transactions on Vehicular Technology,2020,69(5):4740-4750. [15]MNIH V,KAVUKCUOGLU K,SILVER D,et al.Human-level control through deep reinforcement learning[J].Nature,2015,518(7540):529-533. [16]HAARNOJA T,ZHOU A,ABBEEL P,et al.Soft Actor-Critic:Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor [C]//International Conference on Machine Learning ICML.2018. [17]SHI W,SONG S,WU C.Soft Policy Gradient Method for Maximum Entropy Deep Reinforcement Learning [C]//International Joint Conference on Artificial Intelligence (IJCAI).2019. [18]ZHU F,WU W,FU Y C,et al.Security depth reinforcementlearning method based on double depth network[J].Acta Computerica,2019,42(8). [19]GARCI A J,FERNÁNDEZ F.A comprehensive survey on safe reinforcement learning[J].Journal of Machine Learning Research,2015,16(1):1437-1480. [20]GARCIA J,FERNANDEZ F.Safe Exploration of State and Action Spaces in Reinforcement Learning[J].Journal of Artificial Intelligence Research,2014,45(1). [21]BERKENKARMP F,TURCHETTA M,SCHOELLIG A P,et al.Safe model-based reinforcement learning with stability guarantees[J].arXiv:1705.08551,2017. [22]MAZUMDER S,LIU B,WANG S,et al.Action permissibility in deep reinforcement learning and application to autonomous dri-ving[C]//KDD'18 Deep Learning Day.2018. [23]LIU Q,ZHAI J W,ZHANG Z,et al.A review of deep reinforcement learning[J].Acta Computerica Sinica,2018,41(1):1-27. [24]LEE K,SAIGOL K,THEODOROU E A.Early Failure Detection of Deep End-to-End Control Policy by Reinforcement Learning[C]//2019 International Conference on Robotics and Automation (ICRA).IEEE,2019. [25]FUJIMOTO S,HOOF H,MEGER D.Addressing function approximation error in actor-critic methods[C]//International Conference on Machine Learning.PMLR,2018:1587-1596. [26]ZIEBART B D,MAAS A L,BAGNELL J A,et al.Maximum entropy inverse reinforcement learning[C]//AAAI Conference on Artificial Intelligence (AAAI).2008:1433-1438. [27]LEVINE S,FINN C,DARRELL T,et al.End-to-End Training of Deep Visuomotor Policies[J].Journal of Machine Learning Research,2015,17(1):1334-1373. [28]O'DONOGHUE B,MUNOS R,KAVUKCUOGLU K,et al.PGQ:Combining policy gradient and Q-learning[J].arXiv:1611.01626,2016. [29]NACHUM O,NOROUZI M,XU K,et al.Bridging the gap between value and policy based reinforcement learning[C]//Ad-vances in Neural Information Processing Systems (NIPS).2017:2772-2782. [30]HAARNOJA T,TANG H,ABBEEL P,et al.Reinforcementlearning with deep energy-based policies[C]//International Conference on Machine Learning (ICML).2017:1352-1361. [31]MINK J W.The basal ganglia:focused selection and inhibition of competing motor programs[J].Progress in Neurobiology,1996,50(4):381-425. [32]LIPTON Z C,AZIZZADENESHELI K,KUMAR A,et al.Combating reinforcement learning's sisyphean curse with intrinsic fear[J].arXiv:1611.01211,2016. [33]AGARWAL A,ABHINAU K V,DUNOVAN K,et al.BetterSafe than Sorry:Evidence Accumulation Allows for Safe Reinforcement Learning[J].arXiv:1809.09147,2018. [34]REN J,MCLSAAC K A,PATEL R V,et al.A potential field model using generalized sigmoid functions[J].IEEE Transactions on Systems,Man,and Cybernetics,Part B (Cybernetics),2007,37(2):477-484. [35]GOMES G S,LUDERMIR T B.Complementary log-log andprobit:activation functions implemented in artificial neural networks[C]//2008 Eighth International Conference on Hybrid Intelligent Systems.IEEE,2008:939-942. [36]SCHULMAN J,ABBEEL P,CHEN X.Equivalence betweenpolicy gradients and soft Q-learning[J].arXiv:1704.06440,2017a. [37]CHEN Z,HUANG X.End-to-end learning for lane keeping of self-driving cars[C]//2017 IEEE Intelligent Vehicles Sympo-sium (IV).IEEE,2017. [38]BADRINARAYANAN V,KENDALL A,CIPOLLA R.Seg-net:A deep convolutional encoder-decoder architecture for scene segmentation[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2017,39(12):2481-2495. [39]SILVER D,HUANG A,MADDISON C J A,et al.Masteringthe game of go with deep neural networks and tree search[J].Nature,2016,529(7587):484-489. [40]SILVER D,HUBERT T,SCHRITTWIESER I J,et al.Mastering chess and shogi by self-play with a general reinforcement learning algorithm[C]//CoRR.2017. |
[1] | 熊丽琴, 曹雷, 赖俊, 陈希亮. 基于值分解的多智能体深度强化学习综述 Overview of Multi-agent Deep Reinforcement Learning Based on Value Factorization 计算机科学, 2022, 49(9): 172-182. https://doi.org/10.11896/jsjkx.210800112 |
[2] | 于滨, 李学华, 潘春雨, 李娜. 基于深度强化学习的边云协同资源分配算法 Edge-Cloud Collaborative Resource Allocation Algorithm Based on Deep Reinforcement Learning 计算机科学, 2022, 49(7): 248-253. https://doi.org/10.11896/jsjkx.210400219 |
[3] | 李梦菲, 毛莺池, 屠子健, 王瑄, 徐淑芳. 基于深度确定性策略梯度的服务器可靠性任务卸载策略 Server-reliability Task Offloading Strategy Based on Deep Deterministic Policy Gradient 计算机科学, 2022, 49(7): 271-279. https://doi.org/10.11896/jsjkx.210600040 |
[4] | 谢万城, 李斌, 代玥玥. 空中智能反射面辅助边缘计算中基于PPO的任务卸载方案 PPO Based Task Offloading Scheme in Aerial Reconfigurable Intelligent Surface-assisted Edge Computing 计算机科学, 2022, 49(6): 3-11. https://doi.org/10.11896/jsjkx.220100249 |
[5] | 洪志理, 赖俊, 曹雷, 陈希亮, 徐志雄. 基于遗憾探索的竞争网络强化学习智能推荐方法研究 Study on Intelligent Recommendation Method of Dueling Network Reinforcement Learning Based on Regret Exploration 计算机科学, 2022, 49(6): 149-157. https://doi.org/10.11896/jsjkx.210600226 |
[6] | 范静宇, 刘全. 基于随机加权三重Q学习的异策略最大熵强化学习算法 Off-policy Maximum Entropy Deep Reinforcement Learning Algorithm Based on RandomlyWeighted Triple Q -Learning 计算机科学, 2022, 49(6): 335-341. https://doi.org/10.11896/jsjkx.210300081 |
[7] | 李鹏, 易修文, 齐德康, 段哲文, 李天瑞. 一种基于深度学习的供热策略优化方法 Heating Strategy Optimization Method Based on Deep Learning 计算机科学, 2022, 49(4): 263-268. https://doi.org/10.11896/jsjkx.210300155 |
[8] | 欧阳卓, 周思源, 吕勇, 谭国平, 张悦, 项亮亮. 基于深度强化学习的无信号灯交叉路口车辆控制 DRL-based Vehicle Control Strategy for Signal-free Intersections 计算机科学, 2022, 49(3): 46-51. https://doi.org/10.11896/jsjkx.210700010 |
[9] | 成昭炜, 沈航, 汪悦, 王敏, 白光伟. 基于深度强化学习的无人机辅助弹性视频多播机制 Deep Reinforcement Learning Based UAV Assisted SVC Video Multicast 计算机科学, 2021, 48(9): 271-277. https://doi.org/10.11896/jsjkx.201000078 |
[10] | 梁俊斌, 张海涵, 蒋婵, 王天舒. 移动边缘计算中基于深度强化学习的任务卸载研究进展 Research Progress of Task Offloading Based on Deep Reinforcement Learning in Mobile Edge Computing 计算机科学, 2021, 48(7): 316-323. https://doi.org/10.11896/jsjkx.200800095 |
[11] | 王英恺, 王青山. 能量收集无线通信系统中基于强化学习的能量分配策略 Reinforcement Learning Based Energy Allocation Strategy for Multi-access Wireless Communications with Energy Harvesting 计算机科学, 2021, 48(7): 333-339. https://doi.org/10.11896/jsjkx.201100154 |
[12] | 周仕承, 刘京菊, 钟晓峰, 卢灿举. 基于深度强化学习的智能化渗透测试路径发现 Intelligent Penetration Testing Path Discovery Based on Deep Reinforcement Learning 计算机科学, 2021, 48(7): 40-46. https://doi.org/10.11896/jsjkx.210400057 |
[13] | 李贝贝, 宋佳芮, 杜卿芸, 何俊江. DRL-IDS:基于深度强化学习的工业物联网入侵检测系统 DRL-IDS:Deep Reinforcement Learning Based Intrusion Detection System for Industrial Internet of Things 计算机科学, 2021, 48(7): 47-54. https://doi.org/10.11896/jsjkx.210400021 |
[14] | 范艳芳, 袁爽, 蔡英, 陈若愚. 车载边缘计算中基于深度强化学习的协同计算卸载方案 Deep Reinforcement Learning-based Collaborative Computation Offloading Scheme in VehicularEdge Computing 计算机科学, 2021, 48(5): 270-276. https://doi.org/10.11896/jsjkx.201000005 |
[15] | 范家宽, 王皓月, 赵生宇, 周添一, 王伟. 数据驱动的开源贡献度量化评估与持续优化方法 Data-driven Methods for Quantitative Assessment and Enhancement of Open Source Contributions 计算机科学, 2021, 48(5): 45-50. https://doi.org/10.11896/jsjkx.201000107 |
|