基于序列建模的生成式强化学习研究综述

doi:10.11896/jsjkx.231000037

摘要/Abstract

摘要： 强化学习是机器学习中关于如何学习决策的分支,是一个序列决策问题,通过与环境反复交互试错找到最优策略。强化学习可以与生成模型结合使用来优化其性能,通常用于微调生成模型,提高其创建高质量内容的能力。强化学习过程也可以视为一个通用的序列建模问题,对任务轨迹上的分布进行建模,通过预训练生成模型产生一系列动作来获取一系列的高回报。在对输入信息进行建模的基础上,生成式强化学习能够更好地处理不确定性和未知的环境,更高效地将序列数据转换成用于决策的策略。首先针对强化学习算法和序列建模方法进行了介绍,对数据序列的建模过程进行了分析,然后按神经网络模型的类型进行分类探讨了强化学习的发展现状,在此基础上梳理了与生成模型结合的相关方法,并分析了强化学习方法在生成式预训练模型中的应用,最后总结了相关技术在理论和应用上的发展状况。

关键词: 人工智能, 强化学习, 神经网络, 生成模型, 序列建模

Abstract: Reinforcement learning is a branch of machine learning on how to learn decisions,which is a sequential decision-making problem that involves repeatedly interacting with the environment to find the optimal strategy through trial and error.Reinforcement learning can be combined with generative models to optimize their performance,and is typically used to fine-tune generative models and improve their ability to create high-quality content.The reinforcement learning process can also be seen as a general sequence modeling problem,modeling the distribution on task trajectories,and generating a series of actions through pre-training to obtain a series of high returns.Based on modeling input information,generative reinforcement learning can better handle uncertain and unknown environments,and more efficiently transform sequence data into strategies for decision-making.Firstly,an introduction is given to reinforcement learning algorithms and sequence modeling methods,and the modeling process of data sequences is analyzed.The development status of reinforcement learning is discussed according to different neural network models used.Based on this,relevant methods combined with generative models are summarized,and the application of reinforcement learning methods in generative pre-training models is analyzed.Finally,the development status of relevant technologies in theory and application is summarized.

Key words: Artificial intelligence, Reinforcement learning, Neural network, Generative model, Sequence modeling

中图分类号:

TP181

姚天磊, 陈希亮, 余沛毅. 基于序列建模的生成式强化学习研究综述[J]. 计算机科学, 2024, 51(11): 213-228. https://doi.org/10.11896/jsjkx.231000037

YAO Tianlei, CHEN Xiliang, YU Peiyi. Review of Generative Reinforcement Learning Based on Sequence Modeling[J]. Computer Science, 2024, 51(11): 213-228. https://doi.org/10.11896/jsjkx.231000037

参考文献

[1] RADFORD A,NARASIMHAN K,SALIMANS T,et al.Improving Language Understanding by Generative Pre-Training[J].Computation and Language,2017,4(6):212-220.
[2] MNIH V,KAVUKCUOGLU K,SILVER D,et al.Playing Atariwith Deep Reinforcement Learning[C]//Proceedings of the Deep Learning Workshop at NIPS.San Diego:NIPS,2013:812-826.
[3] VASWANI A,SHAZEER N,PARMAR N,et al.Attention isAll You Need[C]//Advances in Neural Information Processing Systems.San Diego:NIPS,2017:5998-6008.
[4] CHEN L,LU K,RAJESWARAN A,et al.Decision Transfor-mer:Reinforcement Learning via Sequence Modeling[C]//International Conference on Learning Representations.Washington DC,2021:3307-3319.
[5] JANNER M,LI Q,LEVINE S.Reinforcement Learning as OneBig Sequence Modeling Problem[C]//Proceedings of the Annual Conference on Neural Information Processing Systems.San Diego:NIPS,2021:1213-1225.
[6] LI H,UMAR N,CHEN R,et al.Deep Reinforcement Learning[C]//ICASSP 2018-2019IEEE International Conference on Acoustics,Speech and Signal Processing.NewYork:ICASSP,2018:2432-2449.
[7] HOPFIELD J J.Neural networks and physical systems-withemergent collective computational abilities[J].Proceedings of the National Academy of Sciences of the United States of America,2018,79:2554-2558.
[8] BAHDANAU D,CHO K,BENGIO Y.Neural Machine Transla-tion by Jointly Learning to Align and Translate[C]//International Conference on Learning Representations.Washington DC,2015:1409-1420.
[9] URIEL S,HAGGAI R,YOTAM E,et al.Sequential Modeling with Multiple Attributes for Watchlist Recommendation in E-Commerce[C]//Proceedings of the 15th ACM International Conference on Web Search and Data Mining(WSDM).2022:937-946.
[10] SCHULMAN J,WOLSKI F,DHARIWAL P,et al.ProximalPolicy Optimization Algorithms[C]//Advances in Neural Information Processing Systems.San Diego:NIPS,2017:2054-2068.
[11] SILVER D,LEVER G,HESS N,et al.Deterministic Policy Gradient Algorithms[C]//International Conferenceon Machine Learning.New York:ICML,2014:1892-1904.
[12] SKORDILIS E,MOGHADDASS R,FARHAT M T,et al.AGenerative Reinforcement Learning Framework for Predictive Analytics[C]//2023 Annual Reliability and Maintainability Symposium(RAMS.2023:1-7.
[13] ZHAO S Y,GROVER A.Decision Stacks:Flexible Reinforce-ment Learning via ModularGenerative Models[C]//Proceedings of the 37th International Conference on Neural Information Processing Systems(NIPS ’23).2023:80306-80323.
[14] GOODFELLOW I,POUGET J,MIRZA M,et al.GenerativeAdversarial Nets[C]//Neural Information Processing Systems MIT Press.San Diego:NIPS,2014:3844-3852.
[15] ZHANG B,SENNRICH R.A Lightweight Recurrent Network for Sequence Modeling[C]//Proceeding of the 57^th Annual Meeting of the Association for Computational Linguistics.2019:1538-1548.
[16] BO P,ERIC A,QUENTIN A,et al.RWKV:Reinventing RNNs for the Transformer Era[J].arXiv.2305.13048,2023.
[17] KHOI M N,QUANG P,BINH T N.Adaptive-saturated RNN:Remember more with less instability[J].arXiv:2304.11790,2023.
[18] HOCHREITER S.The Vanishing Gradient Problem DuringLearning Recurrent Neural Nets and Problem Solutions[J].International Journal of Uncertainty,Fuzziness and Knowledge-Based Systems,1998,6(2):107-116.
[19] CHEN J K,QIU X P,LIU P F,et al.Meta Multi-Task Learning for Sequence Modeling[C]//Proceedings of the AAAI Confe-rence on Artificial Intelligence.Menlo Park:AAAI,2018.
[20] LIU Y J,MENG F D,ZHANG J C,et al.GCDT:A Global Context Enhanced Deep Transition Architecture for Sequence Labeling[C]//Annual Meeting of the Association for Computational Linguistics.Stroudsburg:ACL,2019:426-436.
[21] SUTSKEVERI,VINYALS O,LE Q.Sequence to SequenceLearning with Neural Networks[C]//Advances in Neural Information Processing Systems 34－35th Conference on Neural Information Processing Systems.San Diego:NIPS,2016:3844-3852.
[22] CHO K,MERRIENBOER B,GULCEHRE C,et al. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation[C]//Proceedings of the 2014 Confe-rence on Empirical Methodsin Natural Language Processing.Stroudsburg:ACL,2014:96-112.
[23] CHAUDHARI S,MITHAL V,POLATKANG,et al.An Attentive Survey of Attention Models[J].ACM Transactions on Intelligent Systems and Technology(TIST),2021,12(5):1-32.
[24] TOOMARIAN N,BARHEN J.Fast temporal neural learningusing teacher forcing[C]//IJCNN-91-Seattle International Joint Conference on Neural Networks.1991:817-822.
[25] LIN Z,FFENG M,SANTOS C,et al.A Structured Self-attentive Sentence Embedding[J].arXiv:1703.03130,2017.
[26] WICKENS C.Attention:Theory,Principles,Models and Applications[J].International Journal of Human-Computer Interaction,2021,37(5):403-417.
[27] CORDONNIER J,LOUKAS A,JAGGI M.Multi-Head Attention:Collaborate Instead ofConcatenate[J].arXiv:2006.16362,2020.
[28] SUNDERMEYER M,SCHLUTER R,NEY H.LSTM NeuralNetworks for Language Modeling[C]//Annual conference of the International Speech Communication Association.Baixas:ISCA,2012:106-119.
[29] LIU Y,SHAO Z,HOFFMANN N.Global Attention Mecha-nism:Retain Information to Enhance Channel-Spatial Interactions[J].arXiv.2112.05561,2021.
[30] LUONG M,PHAM H,MANNING C.Effective Approaches to Attention-based Neural Machine Translation[C]//Conference on Empirical Methods in Natural Language Processing.Stroudsburg:ACL,2015:2067-2081.
[31] ROMBACH R,BLATTMANN A,LORENZ D,et al.High-Re-solution Image Synthesis with Latent Diffusion Models[C]//2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR).2022:10674-10685.
[32] MING K,JAESEOK J,YOUNG J.Diffusion Models alreadyhave a Semantic Latent Space[C]//International Conference on Learning Representations(ICLR).2023:312-325.
[33] LIU L,LIU X,GAO J,et al.Understanding the Difficulty ofTraining Transformers[C]//Conference on Empirical Methods in Natural Language Processing.Stroudsburg:ACL,2020:1667-1679.
[34] KALYAN K,RAJASEKHARAN A,SANGEETHA S.AM-MUS:A Survey of Transformer-based Pretrained Models in Natural Language Processing[J].arXiv:2108.05542,2021.
[35] ZHANG C,LI C,ZHANG C,et al.One Small Step for Generative AI,One Giant Leap for AGI:A Complete Survey on Chat-GPT in AIGC Era[J].arXiv:2304.06488,2023.
[36] DEVLIN J,CHANG M,LEE K,et al.BERT:Pre-training ofDeep Bidirectional Transformers for Language Understanding[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.Stroudsburg:ACL,2019:4171-4186.
[37] GARG D,HEJNA J,GEIST M,et al.Extreme Q-Learning:MaxEnt RL without Entropy[J].arXiv:2301.02328,2023.
[38] WANG Y T,PAN Y H,YAN M,et al.A Survey on ChatGPT:AI-Generated Contents,Challenges,and Solutions[C]//IEEE Open Journal of the Computer Society.2023:1-20.
[39] WU Y N,ZHOU Q,ZHANG T H,et al.Discovery of Potent,Selective,and Orally Bioavailable Inhibitors against Phospho-diesterase-9,a Novel Target for the Treatment of Vascular Dementia[J].Journal of Medicinal Chemistry,2019,62(8):4218-4224.
[40] BILGRAM V,LAARMANN F.Accelerating Innovation WithGenerative AI:AI-Augmented Digital Prototyping and Innovation Methods[J].IEEE Engineering Management Review,2023,51(2):18-25.
[41] XU H,JIANG L,LI J.et al.Offline RL with No-OOD Actions:In-Sample Learning viaImplicit Value Regularization[J].arXiv:2303.15810,2023.
[42] WANG H N,LIU N,ZHANG Y Y,et al.A Reviewof Deep Reinforcement Learning[J].Frontiers of Information Technology &Electronic Engineering,2020,21(12):63-82.
[43] MNIH V,KAVUKCUOGLU K,SILVER D,et al.Human-level control through deep reinforcement learning[J].Nature,2015,518:529-533.
[44] HAUSKNECHT M,STONE P.Deep Recurrent Q-Learning for Partially Observable MDPs[C]//AAAI Fall Symposium on Sequential Decision Making for Intelligent Agents.Menlo Park:AAAI,2015:1-9.
[45] DUAN Y,SCHULMAN J,CHEN X.et al.RL²:Fast Reinforcement Learning via SlowReinforcement Learning[J].arXiv:1611.02779,2016.
[46] LI X,LI L,GAO J.et al.Recurrent Reinforcement Learning:A Hybrid Approach[J].arXiv:1509.03044,2015.
[47] STAMATELIS G,KALOUPTSIDIS N.Active hypothesis testing in unknown environments using recurrent neural networks and model free reinforcement learning[J].arXiv:2303.10623,2023.
[48] QUERIDO G,SARDINHA A,MELO F.Learning to Perceive in Deep Model-Free Reinforcement Learning[J].arXiv:2301.03730,2023.
[49] D’ALONZO M,RUSSELL R.Symmetry Detection in Trajectory Data for More Meaningful Reinforcement Learning Representations[C]//Appears in Proceedings of AAAI FSS-22 Sympo-sium.Menlo Park:AAAI,2022:1452-1468.
[50] HE H,CHEN J B,XU K,et al.Diffusion Model is an Effective Planner and Data Synthesizer for Multi-Task Reinforcement Learning[J].arXiv:2305.18459,2023.
[51] ADA S.E,OZTOP E,EMRE U.Diffusion Policies for Out-of-Distribution Generalization in Offline Reinforcement Learning[J].arXiv:2307.04726,2023.
[52] BOGDAN M,WALTER A,BAUTISTAM,et al.Value function estimation using conditiona ldiffusion models for control[J].arXiv:2306.07290,2023.
[53] FELIPE N,TIM F,JOAO F.Extracting Reward Functions from Diffusion Models[J].arXiv:2306.01804,2023.
[54] SARTHAK M,ORBINIAN A,STEFAN B,et al.DiffusionBased Representation Learning[C]//Proceedings of the 40 th International Conference on Machine Learning.2023:24963-24982.
[55] FENG Y S,LI J.A Review of Research on Deep Learning Based on the Development of Representation Learning[J].Microcontrollers & Embedded Systems,2022,22(11):3-6.
[56] WU S,XIAO X,DING Q,et al.Adversarial Sparse Transformer for Time Series Forecasting[C]//Neural Information Processing Systems.San Diego:NIPS,2020:844-856.
[57] LIM B,ARIK S,LOEFF N,et al.Temporal Fusion Transfor-mers for Interpretable Multi-horizon Time Series Forecasting[J].International Journal of Forecasting,2021,37(4):1748-1764.
[58] WU N,GREEN B,XUE B,et al.Deep Transformer Models for Time Series Forecasting:The Influenza Prevalence Case[J].arXiv:2001.08317,2020.
[59] TANG Y,HA D.The Sensory Neuron as a Transformer:Permutation-Invariant Neural Networks for Reinforcement Lear-ning[C]//Neural Information Processing Systems.San Diego:NIPS,2021:384-397.
[60] KURIN V,IGL M,ROCKTSCHEL T,et al.My Body is aCage:the Role of Morphology in Graph Based Incompatible Control[C]//International Conference on Learning Representations.Washington DC:ICLR,2021:471-484.
[61] DOSOVITSKIY A,BEYER L,KOLESNIKOV A,et al.AnImage is Worth 16x16 Words:Transformers for Image Recognition at Scale[C]//International Conference on Learning Representations.Washington DC:ICLR,2021:571-583.
[62] PARISOTTO E,SONG H,RAE J,et al.Stabilizing Transfor-mers for Reinforcement Learning[C]//International Conference on Machine Learning.New York:ICML,2019:1423-1436.
[63] DAI Z,YANG Z,YANG Y,et al.Transformer-XL:Attentive Language Models beyonda Fixed-Length Context[C]//Annual Meeting of the Association for Computational Linguistics.Stroudsburg:ACL,2019:932-947.
[64] BANINO A,BADIA A,WALKER J,et al.CoBERL:Contrastive BERT for Reinforcement Learning[C]//International Conference on Learning Representation.Washington DC:ICLR,2021:1074-1083.
[65] HU S,ZHU F,CHANG X,et al.UPDeT:Universal Multi-agentReinforcement Learning via Policy Decoupling with Transfor-mers[C]//International Conference on Learning Representations.Washington DC:ICLR,2021:720-734.
[66] SCHMIDHUBER J.Reinforcement Learning Upside Down:Don’t Predict Rewards-Just Map Them to Actions[J].arXiv:1912.02875,2019.
[67] WANG K,ZHAO H,LUO X,et al.Bootstrapped Transformer for Offline Reinforcement Learning[C]//Thirty-Sixth Confe-rence on Neural Information Processing Systems.NewOrleans.San Diego:NIPS,2022:1244-1258.
[68] FURUTA H,MATSUO Y,GU S.Generalized Decision Transformer for Offline Hindsight Information Matching[C]//International Conference on Learning Representations.Washington DC:ICLR,2021:784-796.
[69] YAMAGATA T,KHALIL A,SANTOS R.Q-learning Decision Transformer:Leveraging Dynamic Programming for Conditional Sequence Modelling in Offline RL[J].arXiv:2209.03993,2022.
[70] XU M,SHEN Y,ZHUANG S,et al.Prompting Decision Transformer for Few-Shot Policy Generalization[C]//International Conference on Machine Learning.New York:ICML,2022:206-222.
[71] LASKIN M,WANG L.In-context Reinforcement Learningwith Algorithm Distillation[J].arXiv:2210.14215,2022.
[72] MENG L,GOODWIN M,YAZIDI A.Deep ReinforcementLearning with Swin Transformer[J].arXiv:2206.15269,2022.
[73] MAO H Y,ZHAO R,CHEN H,et al.Transformer in Transformer as Backbone for Deep Reinforcement Learning[J].ar-Xiv:2212.14538,2022.
[74] HU S,SHEN L,ZHANG Y,et al.Graph Decision Transformer[J].arXiv:2303.03747,2023.
[75] ESSLINGER K,PLATT R,AMATO C.Deep Transformer Q-Networks for Partially Observable Reinforcement Learning[J].arXiv:20222206.01078,2022.
[76] ZHENG Q,HENAFF M,AMOS B.et al.Semi-Supervised Offline Reinforcement Learning with Action-Free Trajectories[J].arXiv:2210.06518,2022.
[77] ZHU D,WANG Y,SCHMIDHUBER J.et al.Guiding OnlineReinforcement Learning with Action-Free Offline Pre-training[J].arXiv:2301.12876,2023.
[78] LEE K,NACHUM O,YANG M,et al.Multi-Game DecisionTransformers[C]//Neural Information Processing Systems.San Diego:NIPS,2022:1844-1852.
[79] REED S,ZOLNA K,PARISOTTO E,et al.A Generalist Agent[J/OL].Transactions on Machine Learning Research,2022:2835-8856.https://openreview.net/forum?id=1ikK0kHjvj.
[80] MELO L.Transformers are Meta-Reinforcement Learners[J].arXiv:2206.06614,2022.
[81] YUAN H,ZHANG C,WANG H.et al.Plan4MC:Skill Reinforcement Learning and Planning for Open-World Minecraft Tasks[J].arXiv:20232303.16563,2023.
[82] ZHU X Z,CHEN Y T,TIAN H,et al.Ghost in the Minecraft:Generally Capable Agents for Open-World Environments via Large Language Models with Text-based Knowledge and Me-mory[J].arXiv:2305.17144,2023.
[83] BAI Y,JONES A,NDOUSSE K,et al.Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback[J].arXiv:2204.05862,2022.
[84] RAMAMURTHY R,AMMANABROLU P,BRANTLEY K,et al.Is Reinforcement Learning(Not) for Natural Language Processing:Benchmarks,Baselines,and Building Block for Natural Language Policy Optimization[J].arXiv:2210.01241,2022.
[85] GAO L,SCHULMAN J,HILTON J.Scaling Laws for Reward Model Over optimization[J].arXiv:2210.10760,2022.
[86] GLAESE A,MCALEESE N,TRKEBACZ M,et al.Improving alignment of dialogue agents via targeted human judgements[J].arXiv:2209.14375,2022.
[87] LU P,QIU L,CHANG K,et al.Dynamic Prompt L-earning via Policy Gradient for Semi-structured Mathematical Reasoning[J].arXiv:2209.14610,2022.
[88] TEAM A,BAUER A,et al.Human-Timescale Adaptation in an Open-Ended Task Space[J].arXiv:2301.07608,2023.
[89] COLEMAN M,RUSSAKOVSKY O,ALLEN C,et al.Discrete Diffusion Reward Guidance Methods for Offline Reinforcement Learning[C]//ICML 2023 Workshop:Sampling and Optimization in Discrete Space.2023.
[90] PARISOTTO E,SALAKHUTDINOV R.Efficient Transfor-mers in Reinforcement Learning using Actor-Learner Distillation[C]//International Conference on Learning Representations.Washington DC:ICLR,2021:107-123.
[91] SILVA D D,MILLS N,EI A M,et al.ChatGPT and GenerativeAI Guidelines for Addressing Academic Integrity and Augmenting Pre-Existing Chatbots[C]//2023 IEEE International Conference on Industrial Technology(ICIT).2023:1-6.
[92] RAHMAYANTI S R,FATICHAH C,SUCIATI N,et al.Sketch Generation From Real Object Images Using Generative Adversarial Network and Deep Reinforcement Learning[C]//2021 13th International Conference on Information & Communication Technology and System(ICTS).2021:134-139.
[93] AYDIN A,SURER E.Using Generative Adversarial Nets onAtari Games for Feature Extraction in Deep Reinforcement Learning[C]//2020 28th Signal Processing and Communications Applications Conference(SIU).2020:1-4.
[94] YU C,WANG F.Generative AI:How It Changes Our Lives?Take Vision & Language as an Example[C]//2023 Interna-tional VLSI Symposium on Technology,Systems and Applications(VLSI-TSA/VLSI-DAT).2023:1-11.
[95] LIU X,YANG H,GAO J,et al.FinRL:deep reinforcementlearning framework to automate trading in quantitative finance[C]//Proceedings of the Second ACM International Conference on AI in Finance.2022:264-278.
[96] DALAL G,DVIJOTHAM K,VECERÍK M.et al.Safe Exploration in Continuous Action Spaces[J].arXiv:1801.08757,2018.
[97] ZHOU B S,ZHU Y.A Counting Method based on Deep Reinforcement Learning Combined with Generative Adversarial Network[C]//2022 International Conference on Machine Learning,Cloud Computing and Intelligent Mining(MLCCIM).2022:431-434.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed