基于序列建模的生成式强化学习研究综述

doi:10.11896/jsjkx.231000037

Abstract

Abstract: Reinforcement learning is a branch of machine learning on how to learn decisions,which is a sequential decision-making problem that involves repeatedly interacting with the environment to find the optimal strategy through trial and error.Reinforcement learning can be combined with generative models to optimize their performance,and is typically used to fine-tune generative models and improve their ability to create high-quality content.The reinforcement learning process can also be seen as a general sequence modeling problem,modeling the distribution on task trajectories,and generating a series of actions through pre-training to obtain a series of high returns.Based on modeling input information,generative reinforcement learning can better handle uncertain and unknown environments,and more efficiently transform sequence data into strategies for decision-making.Firstly,an introduction is given to reinforcement learning algorithms and sequence modeling methods,and the modeling process of data sequences is analyzed.The development status of reinforcement learning is discussed according to different neural network models used.Based on this,relevant methods combined with generative models are summarized,and the application of reinforcement learning methods in generative pre-training models is analyzed.Finally,the development status of relevant technologies in theory and application is summarized.

Key words: Artificial intelligence, Reinforcement learning, Neural network, Generative model, Sequence modeling

CLC Number:

TP181

YAO Tianlei, CHEN Xiliang, YU Peiyi. Review of Generative Reinforcement Learning Based on Sequence Modeling[J].Computer Science, 2024, 51(11): 213-228.

References

[1] RADFORD A,NARASIMHAN K,SALIMANS T,et al.Improving Language Understanding by Generative Pre-Training[J].Computation and Language,2017,4(6):212-220.
[2] MNIH V,KAVUKCUOGLU K,SILVER D,et al.Playing Atariwith Deep Reinforcement Learning[C]//Proceedings of the Deep Learning Workshop at NIPS.San Diego:NIPS,2013:812-826.
[3] VASWANI A,SHAZEER N,PARMAR N,et al.Attention isAll You Need[C]//Advances in Neural Information Processing Systems.San Diego:NIPS,2017:5998-6008.
[4] CHEN L,LU K,RAJESWARAN A,et al.Decision Transfor-mer:Reinforcement Learning via Sequence Modeling[C]//International Conference on Learning Representations.Washington DC,2021:3307-3319.
[5] JANNER M,LI Q,LEVINE S.Reinforcement Learning as OneBig Sequence Modeling Problem[C]//Proceedings of the Annual Conference on Neural Information Processing Systems.San Diego:NIPS,2021:1213-1225.
[6] LI H,UMAR N,CHEN R,et al.Deep Reinforcement Learning[C]//ICASSP 2018-2019IEEE International Conference on Acoustics,Speech and Signal Processing.NewYork:ICASSP,2018:2432-2449.
[7] HOPFIELD J J.Neural networks and physical systems-withemergent collective computational abilities[J].Proceedings of the National Academy of Sciences of the United States of America,2018,79:2554-2558.
[8] BAHDANAU D,CHO K,BENGIO Y.Neural Machine Transla-tion by Jointly Learning to Align and Translate[C]//International Conference on Learning Representations.Washington DC,2015:1409-1420.
[9] URIEL S,HAGGAI R,YOTAM E,et al.Sequential Modeling with Multiple Attributes for Watchlist Recommendation in E-Commerce[C]//Proceedings of the 15th ACM International Conference on Web Search and Data Mining(WSDM).2022:937-946.
[10] SCHULMAN J,WOLSKI F,DHARIWAL P,et al.ProximalPolicy Optimization Algorithms[C]//Advances in Neural Information Processing Systems.San Diego:NIPS,2017:2054-2068.
[11] SILVER D,LEVER G,HESS N,et al.Deterministic Policy Gradient Algorithms[C]//International Conferenceon Machine Learning.New York:ICML,2014:1892-1904.
[12] SKORDILIS E,MOGHADDASS R,FARHAT M T,et al.AGenerative Reinforcement Learning Framework for Predictive Analytics[C]//2023 Annual Reliability and Maintainability Symposium(RAMS.2023:1-7.
[13] ZHAO S Y,GROVER A.Decision Stacks:Flexible Reinforce-ment Learning via ModularGenerative Models[C]//Proceedings of the 37th International Conference on Neural Information Processing Systems(NIPS ’23).2023:80306-80323.
[14] GOODFELLOW I,POUGET J,MIRZA M,et al.GenerativeAdversarial Nets[C]//Neural Information Processing Systems MIT Press.San Diego:NIPS,2014:3844-3852.
[15] ZHANG B,SENNRICH R.A Lightweight Recurrent Network for Sequence Modeling[C]//Proceeding of the 57^th Annual Meeting of the Association for Computational Linguistics.2019:1538-1548.
[16] BO P,ERIC A,QUENTIN A,et al.RWKV:Reinventing RNNs for the Transformer Era[J].arXiv.2305.13048,2023.
[17] KHOI M N,QUANG P,BINH T N.Adaptive-saturated RNN:Remember more with less instability[J].arXiv:2304.11790,2023.
[18] HOCHREITER S.The Vanishing Gradient Problem DuringLearning Recurrent Neural Nets and Problem Solutions[J].International Journal of Uncertainty,Fuzziness and Knowledge-Based Systems,1998,6(2):107-116.
[19] CHEN J K,QIU X P,LIU P F,et al.Meta Multi-Task Learning for Sequence Modeling[C]//Proceedings of the AAAI Confe-rence on Artificial Intelligence.Menlo Park:AAAI,2018.
[20] LIU Y J,MENG F D,ZHANG J C,et al.GCDT:A Global Context Enhanced Deep Transition Architecture for Sequence Labeling[C]//Annual Meeting of the Association for Computational Linguistics.Stroudsburg:ACL,2019:426-436.
[21] SUTSKEVERI,VINYALS O,LE Q.Sequence to SequenceLearning with Neural Networks[C]//Advances in Neural Information Processing Systems 34－35th Conference on Neural Information Processing Systems.San Diego:NIPS,2016:3844-3852.
[22] CHO K,MERRIENBOER B,GULCEHRE C,et al. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation[C]//Proceedings of the 2014 Confe-rence on Empirical Methodsin Natural Language Processing.Stroudsburg:ACL,2014:96-112.
[23] CHAUDHARI S,MITHAL V,POLATKANG,et al.An Attentive Survey of Attention Models[J].ACM Transactions on Intelligent Systems and Technology(TIST),2021,12(5):1-32.
[24] TOOMARIAN N,BARHEN J.Fast temporal neural learningusing teacher forcing[C]//IJCNN-91-Seattle International Joint Conference on Neural Networks.1991:817-822.
[25] LIN Z,FFENG M,SANTOS C,et al.A Structured Self-attentive Sentence Embedding[J].arXiv:1703.03130,2017.
[26] WICKENS C.Attention:Theory,Principles,Models and Applications[J].International Journal of Human-Computer Interaction,2021,37(5):403-417.
[27] CORDONNIER J,LOUKAS A,JAGGI M.Multi-Head Attention:Collaborate Instead ofConcatenate[J].arXiv:2006.16362,2020.
[28] SUNDERMEYER M,SCHLUTER R,NEY H.LSTM NeuralNetworks for Language Modeling[C]//Annual conference of the International Speech Communication Association.Baixas:ISCA,2012:106-119.
[29] LIU Y,SHAO Z,HOFFMANN N.Global Attention Mecha-nism:Retain Information to Enhance Channel-Spatial Interactions[J].arXiv.2112.05561,2021.
[30] LUONG M,PHAM H,MANNING C.Effective Approaches to Attention-based Neural Machine Translation[C]//Conference on Empirical Methods in Natural Language Processing.Stroudsburg:ACL,2015:2067-2081.
[31] ROMBACH R,BLATTMANN A,LORENZ D,et al.High-Re-solution Image Synthesis with Latent Diffusion Models[C]//2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR).2022:10674-10685.
[32] MING K,JAESEOK J,YOUNG J.Diffusion Models alreadyhave a Semantic Latent Space[C]//International Conference on Learning Representations(ICLR).2023:312-325.
[33] LIU L,LIU X,GAO J,et al.Understanding the Difficulty ofTraining Transformers[C]//Conference on Empirical Methods in Natural Language Processing.Stroudsburg:ACL,2020:1667-1679.
[34] KALYAN K,RAJASEKHARAN A,SANGEETHA S.AM-MUS:A Survey of Transformer-based Pretrained Models in Natural Language Processing[J].arXiv:2108.05542,2021.
[35] ZHANG C,LI C,ZHANG C,et al.One Small Step for Generative AI,One Giant Leap for AGI:A Complete Survey on Chat-GPT in AIGC Era[J].arXiv:2304.06488,2023.
[36] DEVLIN J,CHANG M,LEE K,et al.BERT:Pre-training ofDeep Bidirectional Transformers for Language Understanding[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.Stroudsburg:ACL,2019:4171-4186.
[37] GARG D,HEJNA J,GEIST M,et al.Extreme Q-Learning:MaxEnt RL without Entropy[J].arXiv:2301.02328,2023.
[38] WANG Y T,PAN Y H,YAN M,et al.A Survey on ChatGPT:AI-Generated Contents,Challenges,and Solutions[C]//IEEE Open Journal of the Computer Society.2023:1-20.
[39] WU Y N,ZHOU Q,ZHANG T H,et al.Discovery of Potent,Selective,and Orally Bioavailable Inhibitors against Phospho-diesterase-9,a Novel Target for the Treatment of Vascular Dementia[J].Journal of Medicinal Chemistry,2019,62(8):4218-4224.
[40] BILGRAM V,LAARMANN F.Accelerating Innovation WithGenerative AI:AI-Augmented Digital Prototyping and Innovation Methods[J].IEEE Engineering Management Review,2023,51(2):18-25.
[41] XU H,JIANG L,LI J.et al.Offline RL with No-OOD Actions:In-Sample Learning viaImplicit Value Regularization[J].arXiv:2303.15810,2023.
[42] WANG H N,LIU N,ZHANG Y Y,et al.A Reviewof Deep Reinforcement Learning[J].Frontiers of Information Technology &Electronic Engineering,2020,21(12):63-82.
[43] MNIH V,KAVUKCUOGLU K,SILVER D,et al.Human-level control through deep reinforcement learning[J].Nature,2015,518:529-533.
[44] HAUSKNECHT M,STONE P.Deep Recurrent Q-Learning for Partially Observable MDPs[C]//AAAI Fall Symposium on Sequential Decision Making for Intelligent Agents.Menlo Park:AAAI,2015:1-9.
[45] DUAN Y,SCHULMAN J,CHEN X.et al.RL²:Fast Reinforcement Learning via SlowReinforcement Learning[J].arXiv:1611.02779,2016.
[46] LI X,LI L,GAO J.et al.Recurrent Reinforcement Learning:A Hybrid Approach[J].arXiv:1509.03044,2015.
[47] STAMATELIS G,KALOUPTSIDIS N.Active hypothesis testing in unknown environments using recurrent neural networks and model free reinforcement learning[J].arXiv:2303.10623,2023.
[48] QUERIDO G,SARDINHA A,MELO F.Learning to Perceive in Deep Model-Free Reinforcement Learning[J].arXiv:2301.03730,2023.
[49] D’ALONZO M,RUSSELL R.Symmetry Detection in Trajectory Data for More Meaningful Reinforcement Learning Representations[C]//Appears in Proceedings of AAAI FSS-22 Sympo-sium.Menlo Park:AAAI,2022:1452-1468.
[50] HE H,CHEN J B,XU K,et al.Diffusion Model is an Effective Planner and Data Synthesizer for Multi-Task Reinforcement Learning[J].arXiv:2305.18459,2023.
[51] ADA S.E,OZTOP E,EMRE U.Diffusion Policies for Out-of-Distribution Generalization in Offline Reinforcement Learning[J].arXiv:2307.04726,2023.
[52] BOGDAN M,WALTER A,BAUTISTAM,et al.Value function estimation using conditiona ldiffusion models for control[J].arXiv:2306.07290,2023.
[53] FELIPE N,TIM F,JOAO F.Extracting Reward Functions from Diffusion Models[J].arXiv:2306.01804,2023.
[54] SARTHAK M,ORBINIAN A,STEFAN B,et al.DiffusionBased Representation Learning[C]//Proceedings of the 40 th International Conference on Machine Learning.2023:24963-24982.
[55] FENG Y S,LI J.A Review of Research on Deep Learning Based on the Development of Representation Learning[J].Microcontrollers & Embedded Systems,2022,22(11):3-6.
[56] WU S,XIAO X,DING Q,et al.Adversarial Sparse Transformer for Time Series Forecasting[C]//Neural Information Processing Systems.San Diego:NIPS,2020:844-856.
[57] LIM B,ARIK S,LOEFF N,et al.Temporal Fusion Transfor-mers for Interpretable Multi-horizon Time Series Forecasting[J].International Journal of Forecasting,2021,37(4):1748-1764.
[58] WU N,GREEN B,XUE B,et al.Deep Transformer Models for Time Series Forecasting:The Influenza Prevalence Case[J].arXiv:2001.08317,2020.
[59] TANG Y,HA D.The Sensory Neuron as a Transformer:Permutation-Invariant Neural Networks for Reinforcement Lear-ning[C]//Neural Information Processing Systems.San Diego:NIPS,2021:384-397.
[60] KURIN V,IGL M,ROCKTSCHEL T,et al.My Body is aCage:the Role of Morphology in Graph Based Incompatible Control[C]//International Conference on Learning Representations.Washington DC:ICLR,2021:471-484.
[61] DOSOVITSKIY A,BEYER L,KOLESNIKOV A,et al.AnImage is Worth 16x16 Words:Transformers for Image Recognition at Scale[C]//International Conference on Learning Representations.Washington DC:ICLR,2021:571-583.
[62] PARISOTTO E,SONG H,RAE J,et al.Stabilizing Transfor-mers for Reinforcement Learning[C]//International Conference on Machine Learning.New York:ICML,2019:1423-1436.
[63] DAI Z,YANG Z,YANG Y,et al.Transformer-XL:Attentive Language Models beyonda Fixed-Length Context[C]//Annual Meeting of the Association for Computational Linguistics.Stroudsburg:ACL,2019:932-947.
[64] BANINO A,BADIA A,WALKER J,et al.CoBERL:Contrastive BERT for Reinforcement Learning[C]//International Conference on Learning Representation.Washington DC:ICLR,2021:1074-1083.
[65] HU S,ZHU F,CHANG X,et al.UPDeT:Universal Multi-agentReinforcement Learning via Policy Decoupling with Transfor-mers[C]//International Conference on Learning Representations.Washington DC:ICLR,2021:720-734.
[66] SCHMIDHUBER J.Reinforcement Learning Upside Down:Don’t Predict Rewards-Just Map Them to Actions[J].arXiv:1912.02875,2019.
[67] WANG K,ZHAO H,LUO X,et al.Bootstrapped Transformer for Offline Reinforcement Learning[C]//Thirty-Sixth Confe-rence on Neural Information Processing Systems.NewOrleans.San Diego:NIPS,2022:1244-1258.
[68] FURUTA H,MATSUO Y,GU S.Generalized Decision Transformer for Offline Hindsight Information Matching[C]//International Conference on Learning Representations.Washington DC:ICLR,2021:784-796.
[69] YAMAGATA T,KHALIL A,SANTOS R.Q-learning Decision Transformer:Leveraging Dynamic Programming for Conditional Sequence Modelling in Offline RL[J].arXiv:2209.03993,2022.
[70] XU M,SHEN Y,ZHUANG S,et al.Prompting Decision Transformer for Few-Shot Policy Generalization[C]//International Conference on Machine Learning.New York:ICML,2022:206-222.
[71] LASKIN M,WANG L.In-context Reinforcement Learningwith Algorithm Distillation[J].arXiv:2210.14215,2022.
[72] MENG L,GOODWIN M,YAZIDI A.Deep ReinforcementLearning with Swin Transformer[J].arXiv:2206.15269,2022.
[73] MAO H Y,ZHAO R,CHEN H,et al.Transformer in Transformer as Backbone for Deep Reinforcement Learning[J].ar-Xiv:2212.14538,2022.
[74] HU S,SHEN L,ZHANG Y,et al.Graph Decision Transformer[J].arXiv:2303.03747,2023.
[75] ESSLINGER K,PLATT R,AMATO C.Deep Transformer Q-Networks for Partially Observable Reinforcement Learning[J].arXiv:20222206.01078,2022.
[76] ZHENG Q,HENAFF M,AMOS B.et al.Semi-Supervised Offline Reinforcement Learning with Action-Free Trajectories[J].arXiv:2210.06518,2022.
[77] ZHU D,WANG Y,SCHMIDHUBER J.et al.Guiding OnlineReinforcement Learning with Action-Free Offline Pre-training[J].arXiv:2301.12876,2023.
[78] LEE K,NACHUM O,YANG M,et al.Multi-Game DecisionTransformers[C]//Neural Information Processing Systems.San Diego:NIPS,2022:1844-1852.
[79] REED S,ZOLNA K,PARISOTTO E,et al.A Generalist Agent[J/OL].Transactions on Machine Learning Research,2022:2835-8856.https://openreview.net/forum?id=1ikK0kHjvj.
[80] MELO L.Transformers are Meta-Reinforcement Learners[J].arXiv:2206.06614,2022.
[81] YUAN H,ZHANG C,WANG H.et al.Plan4MC:Skill Reinforcement Learning and Planning for Open-World Minecraft Tasks[J].arXiv:20232303.16563,2023.
[82] ZHU X Z,CHEN Y T,TIAN H,et al.Ghost in the Minecraft:Generally Capable Agents for Open-World Environments via Large Language Models with Text-based Knowledge and Me-mory[J].arXiv:2305.17144,2023.
[83] BAI Y,JONES A,NDOUSSE K,et al.Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback[J].arXiv:2204.05862,2022.
[84] RAMAMURTHY R,AMMANABROLU P,BRANTLEY K,et al.Is Reinforcement Learning(Not) for Natural Language Processing:Benchmarks,Baselines,and Building Block for Natural Language Policy Optimization[J].arXiv:2210.01241,2022.
[85] GAO L,SCHULMAN J,HILTON J.Scaling Laws for Reward Model Over optimization[J].arXiv:2210.10760,2022.
[86] GLAESE A,MCALEESE N,TRKEBACZ M,et al.Improving alignment of dialogue agents via targeted human judgements[J].arXiv:2209.14375,2022.
[87] LU P,QIU L,CHANG K,et al.Dynamic Prompt L-earning via Policy Gradient for Semi-structured Mathematical Reasoning[J].arXiv:2209.14610,2022.
[88] TEAM A,BAUER A,et al.Human-Timescale Adaptation in an Open-Ended Task Space[J].arXiv:2301.07608,2023.
[89] COLEMAN M,RUSSAKOVSKY O,ALLEN C,et al.Discrete Diffusion Reward Guidance Methods for Offline Reinforcement Learning[C]//ICML 2023 Workshop:Sampling and Optimization in Discrete Space.2023.
[90] PARISOTTO E,SALAKHUTDINOV R.Efficient Transfor-mers in Reinforcement Learning using Actor-Learner Distillation[C]//International Conference on Learning Representations.Washington DC:ICLR,2021:107-123.
[91] SILVA D D,MILLS N,EI A M,et al.ChatGPT and GenerativeAI Guidelines for Addressing Academic Integrity and Augmenting Pre-Existing Chatbots[C]//2023 IEEE International Conference on Industrial Technology(ICIT).2023:1-6.
[92] RAHMAYANTI S R,FATICHAH C,SUCIATI N,et al.Sketch Generation From Real Object Images Using Generative Adversarial Network and Deep Reinforcement Learning[C]//2021 13th International Conference on Information & Communication Technology and System(ICTS).2021:134-139.
[93] AYDIN A,SURER E.Using Generative Adversarial Nets onAtari Games for Feature Extraction in Deep Reinforcement Learning[C]//2020 28th Signal Processing and Communications Applications Conference(SIU).2020:1-4.
[94] YU C,WANG F.Generative AI:How It Changes Our Lives?Take Vision & Language as an Example[C]//2023 Interna-tional VLSI Symposium on Technology,Systems and Applications(VLSI-TSA/VLSI-DAT).2023:1-11.
[95] LIU X,YANG H,GAO J,et al.FinRL:deep reinforcementlearning framework to automate trading in quantitative finance[C]//Proceedings of the Second ACM International Conference on AI in Finance.2022:264-278.
[96] DALAL G,DVIJOTHAM K,VECERÍK M.et al.Safe Exploration in Continuous Action Spaces[J].arXiv:1801.08757,2018.
[97] ZHOU B S,ZHU Y.A Counting Method based on Deep Reinforcement Learning Combined with Generative Adversarial Network[C]//2022 International Conference on Machine Learning,Cloud Computing and Intelligent Mining(MLCCIM).2022:431-434.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Review of Generative Reinforcement Learning Based on Sequence Modeling

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 15

Metrics

Comments

Recommended 0

[1]	ZHU Fukun, TENG Zhen, SHAO Wenze, GE Qi, SUN Yubao. Semantic-guided Neural Network Critical Data Routing Path [J]. Computer Science, 2024, 51(9): 155-161.
[2]	YAN Xin, HUANG Zhiqiu, SHI Fan, XU Heng. Study on Following Car Model with Different Driving Styles Based on Proximal PolicyOptimization Algorithm [J]. Computer Science, 2024, 51(9): 223-232.
[3]	WANG Tianjiu, LIU Quan, WU Lan. Offline Reinforcement Learning Algorithm for Conservative Q-learning Based on Uncertainty Weight [J]. Computer Science, 2024, 51(9): 265-272.
[4]	ZHOU Wenhui, PENG Qinghua, XIE Lei. Study on Adaptive Cloud-Edge Collaborative Scheduling Methods for Multi-object State Perception [J]. Computer Science, 2024, 51(9): 319-330.
[5]	TANG Ying, WANG Baohui. Study on SSL/TLS Encrypted Malicious Traffic Detection Algorithm Based on Graph Neural Networks [J]. Computer Science, 2024, 51(9): 365-370.
[6]	LIU Renyu, CHEN Xin, SHANG Honghui, ZHANG Yunquan. Optimization of Atomic Kinetics Monte Carlo Program TensorKMC Based on Machine Learning Atomic Potential Functions [J]. Computer Science, 2024, 51(9): 23-30.
[7]	CHEN Liang, SUN Cong. Deep-learning Based DKOM Attack Detection for Linux System [J]. Computer Science, 2024, 51(9): 383-392.
[8]	LI Jingwen, YE Qi, RUAN Tong, LIN Yupian, XUE Wandong. Semi-supervised Text Style Transfer Method Based on Multi-reward Reinforcement Learning [J]. Computer Science, 2024, 51(8): 263-271.
[9]	XU Bei, LIU Tong. Semi-supervised Emotional Music Generation Method Based on Improved Gaussian Mixture Variational Autoencoders [J]. Computer Science, 2024, 51(8): 281-296.
[10]	CHEN Shanshan, YAO Subin. Study on Recommendation Algorithms Based on Knowledge Graph and Neighbor PerceptionAttention Mechanism [J]. Computer Science, 2024, 51(8): 313-323.
[11]	CHENG Xuefeng, DONG Minggang. Dynamic Multi-objective Optimization Algorithm Based on RNN Information Accumulation [J]. Computer Science, 2024, 51(8): 333-344.
[12]	ZENG Zihui, LI Chaoyang, LIAO Qing. Multivariate Time Series Anomaly Detection Algorithm in Missing Value Scenario [J]. Computer Science, 2024, 51(7): 108-115.
[13]	HU Haibo, YANG Dan, NIE Tiezheng, KOU Yue. Graph Contrastive Learning Incorporating Multi-influence and Preference for Social Recommendation [J]. Computer Science, 2024, 51(7): 146-155.
[14]	HAN Bing, DENG Lixiang, ZHENG Yi, REN Shuang. Survey of 3D Point Clouds Upsampling Methods [J]. Computer Science, 2024, 51(7): 167-196.
[15]	XU Xiaohua, ZHOU Zhangbing, HU Zhongxu, LIN Shixun, YU Zhenjie. Lightweight Deep Neural Network Models for Edge Intelligence:A Survey [J]. Computer Science, 2024, 51(7): 257-271.