计算机科学 ›› 2025, Vol. 52 ›› Issue (12): 231-238.doi: 10.11896/jsjkx.250100094

• 人工智能 • 上一篇    下一篇

基于记忆增强的长文本建模方法

刘炜杰, 汤泽成, 李俊涛   

  1. 苏州大学计算机科学与技术学院人工智能实验室 江苏 苏州 215031
  • 收稿日期:2025-01-14 修回日期:2025-04-28 出版日期:2025-12-15 发布日期:2025-12-09
  • 通讯作者: 李俊涛(ljt@suda.edu.cn)
  • 作者简介:(20224227039@stu.suda.edu.cn)

MemLong:Memory-augmented Retrieval for Long Text Modeling

LIU Weijie, TANG Zecheng, LI Juntao   

  1. School of Computer Science and Technology Artificial Intelligence Laboratory, Soochow University, SuZhou, Jiangsu 215031, China
  • Received:2025-01-14 Revised:2025-04-28 Published:2025-12-15 Online:2025-12-09
  • About author:LIU Weijie,born in 1999,postgraduate.His main research interests include RAG,long-context language model,NLP and LLMs.
    LI Juntao,born in 1992,Ph.D,associate professor.His main research interest is natural language generation.

摘要: 大语言模型(LLMs)近年来取得了显著进展,并在多个领域展现出卓越的性能。然而,由于注意力机制的二次时空复杂度以及生成过程中键值对缓存不断增长所带来的显存消耗,处理长文本相关任务仍然是LLMs面临的一大挑战。为了解决此问题,提出了基于记忆增强的长文本建模方法MemLong,旨在利用外部检索器检索历史信息来增强长上下文语言建模的能力。MemLong将一个非参数化的检索-记忆模块与一个部分可训练的大语言模型相结合,并引入了一种能够利用语义层面相关的文本块的细粒度可控的检索注意力机制。非参数化的检索-记忆模块负责从外部知识库中检索与当前输入相关的历史信息,而大语言模型则将检索到的信息和当前输入融合在一起并生成输出。细粒度可控的检索注意力机制允许模型在生成过程中动态地调整对检索信息的关注程度,从而实现更精准的文本生成。在多个长上下文语言建模基准测试上的综合评估表明,MemLong方法始终优于其他先进的LLMs。此外,MemLong显著提升了模型处理长文本的能力。在单卡3090 GPU上,MemLong可以将上下文长度从4 000扩展到80 000,提升了20倍。这一突破性的进展使得MemLong能够处理更长的输入文本,从而更好地理解和生成长文本内容,为处理超长文本任务提供了新的可能性,并为未来长文本语言建模的研究开辟了新的方向。

关键词: 检索增强语言建模, 长文本生成, 自然语言处理, 长文本评估, 检索增强生成

Abstract: Recent advancements in Large Language Models (LLMs) have yielded remarkable success across diverse fields.However,handling long contexts remains a significant challenge for LLMs due to the quadratic time and space complexity of attention mechanisms and the growing memory consumption of the key-value cache during generation.To address this issue,this paper proposes MemLong’a memory-augmented method for long-text modeling,which enhances long-context language modeling by leveraging an external retriever to access historical information.MemLong integrates a non-parametric retrieval-memory module with a partially trainable large language model,and introduces a fine-grained,controllable retrieval attention mechanism that effectively utilizes semantically relevant text blocks.The non-parametric module is responsible for retrieving relevant historical information from an external knowledge base,while the LLM generates outputs by fusing this retrieved information with the current input.The proposed attention mechanism allows the model to dynamically adjust its focus on the retrieved information during generation.Comprehensive evaluations on multiple long-context language modeling benchmarks demonstrate that MemLong consistently outperforms other state-of-the-art LLMs.Furthermore,MemLong significantly enhances the model’s capacity to process long texts.On a single NVIDIA 3090 GPU,MemLong can scale the effective context length from 4 000 to 80 000 tokens,representing a 20-fold increase.This breakthrough enables MemLong to process longer input texts,leading to a better understanding and ge-neration of long-form content.It provides new possibilities for tackling ultra-long text tasks and opens up promising new directions for future research in long-text language modeling.

Key words: Retrieval augment language modeling, Long context generation, Natural language processing, Long context evaluation, Retrieval augment generation

中图分类号: 

  • TP311
[1]VASWANI A,SHAZEER N,PARMAR N,et al.Attention is all you need[C]//Proceedings of the 31st International Confe-rence on Neural Information Processing Systems.2017:6000-6010.
[2]KOH H Y,JU J X,LIU M,et al.An empirical survey on long document summarization:Datasets,models,and metrics[J].ACM computing surveys,2022,55(8):1-35.
[3]WANG J,LEONG C T,WANG J S,et al.Instruct once,chat consistently in multiple rounds:An efficient tuning framework for dialogue[J].arXiv:2402.06967,2024.
[4]BELTAGY I,PETERS M E,COHAN A.Longformer:Thelong-document transformer[J].arXiv:2004.05150,2020.
[5]WANG S,LI B Z,KHABSA M,et al.Linformer:Self-attention with linear complexity[J].arXiv:2006.04768,2020.
[6]KITAEV N,KAISER Ł,LEVSKAYA A.Reformer:The efficient transformer[J].arXiv:2001.04451,2020.
[7]XIAO G X,TIAN Y D,CHEN B D,et al.Efficient streaming language models with attention sinks[J].arXiv:2309.17453,2023.
[8]CHEN S Y,WONG S,CHEN L J,et al.Extending context window of large language models via positional interpolation[J].arXiv:2306.15595,2023.
[9]LU Y,ZHOU X,HE W,et al.Longheads:Multi-head attention is secretly a long context processor[J].arXiv:2402.10685,2024.
[10]DAI Z H,YANG Z L,YANG Y M,et al.Transformer-xl:Attentive language models beyond a fixed-length context[J].ar-Xiv:1901.02860,2019.
[11]BERTSCH A,ALON U,NEUBIG G,et al.Unlimiformer:Long-range transformers with unlimited length input[C]//Advances in Neural Information Processing Systems.2024.
[12]YU H F,ZHANG Y,BI W,et al.Trams:Training-free memory selection for long-range language modeling[J].arXiv:2310.15494,2023.
[13]WU Y H,RABE M N,HUTCHINS D,et al.Memorizing transformers[J].arXiv:2203.08913,2022.
[14]WANG W Z,DONG L,CHENG H,et al.Augmenting language models with long-term memory[C]//Advances in Neural Information Processing Systems.2024.
[15]RUBIN O,BERANT J.Long-range language modeling withself-retrieval[J].arXiv:2306.13421,2023.
[16] TOUVRON H,LAVRIL T,IZACARD G,et al.Llama:Openand efficient foundation language models[J].arXiv:2302.13971,2023.
[17]ZHANG R R,HAN J M,LIU C,et al.Llama-adapter:Efficient fine-tuning of language models with zero-init attention[J].ar-Xiv:2303.16199,2023.
[18]SU J L,AHMED M,LU Y,et al.Roformer:Enhanced transformer with rotary position embedding[J].Neurocomputing,2024,568:127063.
[19]HU E J,SHEN Y L,WALLIS P,et al.Lora:Low-rank adaptation of large language models[J].arXiv:2106.09685,2021.
[20]FU Y,PANDA R,NIU X Y,et al.Data engineering for scaling language models to 128k context[J].arXiv:2402.10171,2024.
[21]TWORKOWSKI S,STANISZEWSKI K,PACEK M,et al.Focused transformer:Contrastive training for context scaling[C]//Advances in Neural Information Processing Systems.2024.
[22]RAE J W,POTAPENKO A,JAYAKUMAR S M,et al.Compressive transformers for long-range sequence modelling[J].arXiv:1911.05507,2019.
[23]ZHU Y K,KIROS R,ZEMEL R,et al.Aligning books and mo-vies:Towards story-like visual explanations by watching movies and reading books[C]//Proceedings of the IEEE International Conference on Computer Vision.2015:19-27.
[24]MERITY S,XIONG C M,BRADBURY J,et al.Pointer sentinel mixture models[J].arXiv:1609.07843,2016.
[25]AZERBAYEV Z,SCHOELKOPF H,PASTER K,et al.Llemma:An open language model for mathematics[J].arXiv:2310.10631,2023.
[26]YEN H,GAO T Y,CHEN D Q.Long-context language mode-ling with parallel context encoding[J].arXiv:2402.16617,2024.
[27]JOHNSON J,DOUZE M,JEÉGOU H.Billion-scale similarity search withgpus[J].IEEE Transactions on Big Data,2019,7(3):535-547.
[28]CHEN Y K,QIAN S J,TANG H T,et al.Longlora:Efficient fine-tuning of long-context large language models[J].arXiv:2309.12307,2023.
[29]PENG B W,QUESNELLE J,FAN H L,et al.Yarn:Efficient context window extension of large language models[J].arXiv:2309.00071,2023.
[30]ABDIN M,JACOBS S A,AWAN A A,et al.Phi-3 technical report:A highly capable language model locally on your phone[J].arXiv:2404.14219,2024.
[31]BROWN T,MANN B,RYDER N,et al.Language models are few-shot learners[J].Advances in neural information processing systems,2020,33:1877-1901.
[32]PRESS O,SMITH N A,LEWIS M.Train short,test long:Attention with linear biases enables input length extrapolation[J].arXiv:2108.12409,2021.
[33]LEWIS P,PEREZ E,PIKTUS A,et al.Retrieval-augmentedgeneration for knowledge-intensivenlp tasks[J].Advances in Neural Information Processing Systems,2020,33:9459-9474.
[34]IZACARD G,GRAVE E.Leverag- ing passage retrieval withgenerative models for open domain question answering[J].ar-Xiv:2007.01282,2020.
[35]RAM O,LEVINE Y,DALMEDIGOS I,et al.In-context retrie-val-augmented language models[J].Transactions of the Association for Computational Linguistics,2023,11:1316-1331.
[36]YU W H,ITER D,WANG S H,et al.Generate rather than retrieve:Large language models are strong context generators[J].arXiv:2209.10063,2022.
[37]ASAI A,WU Z Q,WANG Y Z,et al.Self-rag:Learning to retrieve,generate,and critique through self-reflection[J]. arXiv:2310.11511,2023.
[38]GUU K,LEE K,TUNG Z,et al.Realm:Retrieval- augmented language model pre-training[J].arXiv:2002.08909,2020.
[39]KHANDELWAL U,LEVY O,JURAFSKY D,et al.Generalization through memorization:Nearest neighbor language models[J].arXiv:1911.00172,2019.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!