Computer Science ›› 2025, Vol. 52 ›› Issue (12): 231-238.doi: 10.11896/jsjkx.250100094

• Artificial Intelligence • Previous Articles     Next Articles

MemLong:Memory-augmented Retrieval for Long Text Modeling

LIU Weijie, TANG Zecheng, LI Juntao   

  1. School of Computer Science and Technology Artificial Intelligence Laboratory, Soochow University, SuZhou, Jiangsu 215031, China
  • Received:2025-01-14 Revised:2025-04-28 Online:2025-12-15 Published:2025-12-09
  • About author:LIU Weijie,born in 1999,postgraduate.His main research interests include RAG,long-context language model,NLP and LLMs.
    LI Juntao,born in 1992,Ph.D,associate professor.His main research interest is natural language generation.

Abstract: Recent advancements in Large Language Models (LLMs) have yielded remarkable success across diverse fields.However,handling long contexts remains a significant challenge for LLMs due to the quadratic time and space complexity of attention mechanisms and the growing memory consumption of the key-value cache during generation.To address this issue,this paper proposes MemLong’a memory-augmented method for long-text modeling,which enhances long-context language modeling by leveraging an external retriever to access historical information.MemLong integrates a non-parametric retrieval-memory module with a partially trainable large language model,and introduces a fine-grained,controllable retrieval attention mechanism that effectively utilizes semantically relevant text blocks.The non-parametric module is responsible for retrieving relevant historical information from an external knowledge base,while the LLM generates outputs by fusing this retrieved information with the current input.The proposed attention mechanism allows the model to dynamically adjust its focus on the retrieved information during generation.Comprehensive evaluations on multiple long-context language modeling benchmarks demonstrate that MemLong consistently outperforms other state-of-the-art LLMs.Furthermore,MemLong significantly enhances the model’s capacity to process long texts.On a single NVIDIA 3090 GPU,MemLong can scale the effective context length from 4 000 to 80 000 tokens,representing a 20-fold increase.This breakthrough enables MemLong to process longer input texts,leading to a better understanding and ge-neration of long-form content.It provides new possibilities for tackling ultra-long text tasks and opens up promising new directions for future research in long-text language modeling.

Key words: Retrieval augment language modeling, Long context generation, Natural language processing, Long context evaluation, Retrieval augment generation

CLC Number: 

  • TP311
[1]VASWANI A,SHAZEER N,PARMAR N,et al.Attention is all you need[C]//Proceedings of the 31st International Confe-rence on Neural Information Processing Systems.2017:6000-6010.
[2]KOH H Y,JU J X,LIU M,et al.An empirical survey on long document summarization:Datasets,models,and metrics[J].ACM computing surveys,2022,55(8):1-35.
[3]WANG J,LEONG C T,WANG J S,et al.Instruct once,chat consistently in multiple rounds:An efficient tuning framework for dialogue[J].arXiv:2402.06967,2024.
[4]BELTAGY I,PETERS M E,COHAN A.Longformer:Thelong-document transformer[J].arXiv:2004.05150,2020.
[5]WANG S,LI B Z,KHABSA M,et al.Linformer:Self-attention with linear complexity[J].arXiv:2006.04768,2020.
[6]KITAEV N,KAISER Ł,LEVSKAYA A.Reformer:The efficient transformer[J].arXiv:2001.04451,2020.
[7]XIAO G X,TIAN Y D,CHEN B D,et al.Efficient streaming language models with attention sinks[J].arXiv:2309.17453,2023.
[8]CHEN S Y,WONG S,CHEN L J,et al.Extending context window of large language models via positional interpolation[J].arXiv:2306.15595,2023.
[9]LU Y,ZHOU X,HE W,et al.Longheads:Multi-head attention is secretly a long context processor[J].arXiv:2402.10685,2024.
[10]DAI Z H,YANG Z L,YANG Y M,et al.Transformer-xl:Attentive language models beyond a fixed-length context[J].ar-Xiv:1901.02860,2019.
[11]BERTSCH A,ALON U,NEUBIG G,et al.Unlimiformer:Long-range transformers with unlimited length input[C]//Advances in Neural Information Processing Systems.2024.
[12]YU H F,ZHANG Y,BI W,et al.Trams:Training-free memory selection for long-range language modeling[J].arXiv:2310.15494,2023.
[13]WU Y H,RABE M N,HUTCHINS D,et al.Memorizing transformers[J].arXiv:2203.08913,2022.
[14]WANG W Z,DONG L,CHENG H,et al.Augmenting language models with long-term memory[C]//Advances in Neural Information Processing Systems.2024.
[15]RUBIN O,BERANT J.Long-range language modeling withself-retrieval[J].arXiv:2306.13421,2023.
[16] TOUVRON H,LAVRIL T,IZACARD G,et al.Llama:Openand efficient foundation language models[J].arXiv:2302.13971,2023.
[17]ZHANG R R,HAN J M,LIU C,et al.Llama-adapter:Efficient fine-tuning of language models with zero-init attention[J].ar-Xiv:2303.16199,2023.
[18]SU J L,AHMED M,LU Y,et al.Roformer:Enhanced transformer with rotary position embedding[J].Neurocomputing,2024,568:127063.
[19]HU E J,SHEN Y L,WALLIS P,et al.Lora:Low-rank adaptation of large language models[J].arXiv:2106.09685,2021.
[20]FU Y,PANDA R,NIU X Y,et al.Data engineering for scaling language models to 128k context[J].arXiv:2402.10171,2024.
[21]TWORKOWSKI S,STANISZEWSKI K,PACEK M,et al.Focused transformer:Contrastive training for context scaling[C]//Advances in Neural Information Processing Systems.2024.
[22]RAE J W,POTAPENKO A,JAYAKUMAR S M,et al.Compressive transformers for long-range sequence modelling[J].arXiv:1911.05507,2019.
[23]ZHU Y K,KIROS R,ZEMEL R,et al.Aligning books and mo-vies:Towards story-like visual explanations by watching movies and reading books[C]//Proceedings of the IEEE International Conference on Computer Vision.2015:19-27.
[24]MERITY S,XIONG C M,BRADBURY J,et al.Pointer sentinel mixture models[J].arXiv:1609.07843,2016.
[25]AZERBAYEV Z,SCHOELKOPF H,PASTER K,et al.Llemma:An open language model for mathematics[J].arXiv:2310.10631,2023.
[26]YEN H,GAO T Y,CHEN D Q.Long-context language mode-ling with parallel context encoding[J].arXiv:2402.16617,2024.
[27]JOHNSON J,DOUZE M,JEÉGOU H.Billion-scale similarity search withgpus[J].IEEE Transactions on Big Data,2019,7(3):535-547.
[28]CHEN Y K,QIAN S J,TANG H T,et al.Longlora:Efficient fine-tuning of long-context large language models[J].arXiv:2309.12307,2023.
[29]PENG B W,QUESNELLE J,FAN H L,et al.Yarn:Efficient context window extension of large language models[J].arXiv:2309.00071,2023.
[30]ABDIN M,JACOBS S A,AWAN A A,et al.Phi-3 technical report:A highly capable language model locally on your phone[J].arXiv:2404.14219,2024.
[31]BROWN T,MANN B,RYDER N,et al.Language models are few-shot learners[J].Advances in neural information processing systems,2020,33:1877-1901.
[32]PRESS O,SMITH N A,LEWIS M.Train short,test long:Attention with linear biases enables input length extrapolation[J].arXiv:2108.12409,2021.
[33]LEWIS P,PEREZ E,PIKTUS A,et al.Retrieval-augmentedgeneration for knowledge-intensivenlp tasks[J].Advances in Neural Information Processing Systems,2020,33:9459-9474.
[34]IZACARD G,GRAVE E.Leverag- ing passage retrieval withgenerative models for open domain question answering[J].ar-Xiv:2007.01282,2020.
[35]RAM O,LEVINE Y,DALMEDIGOS I,et al.In-context retrie-val-augmented language models[J].Transactions of the Association for Computational Linguistics,2023,11:1316-1331.
[36]YU W H,ITER D,WANG S H,et al.Generate rather than retrieve:Large language models are strong context generators[J].arXiv:2209.10063,2022.
[37]ASAI A,WU Z Q,WANG Y Z,et al.Self-rag:Learning to retrieve,generate,and critique through self-reflection[J]. arXiv:2310.11511,2023.
[38]GUU K,LEE K,TUNG Z,et al.Realm:Retrieval- augmented language model pre-training[J].arXiv:2002.08909,2020.
[39]KHANDELWAL U,LEVY O,JURAFSKY D,et al.Generalization through memorization:Nearest neighbor language models[J].arXiv:1911.00172,2019.
[1] CHENG Zhangtao, HUANG Haoran, XUE He, LIU Leyuan, ZHONG Ting, ZHOU Fan. Event Causality Identification Model Based on Prompt Learning and Hypergraph [J]. Computer Science, 2025, 52(9): 303-312.
[2] LIU Le, XIAO Rong, YANG Xiao. Application of Decoupled Knowledge Distillation Method in Document-level RelationExtraction [J]. Computer Science, 2025, 52(8): 277-287.
[3] ZHENG Cheng, YANG Nan. Aspect-based Sentiment Analysis Based on Syntax,Semantics and Affective Knowledge [J]. Computer Science, 2025, 52(7): 218-225.
[4] LIU Yanlun, XIAO Zheng, NIE Zhenyu, LE Yuquan, LI Kenli. Case Element Association with Evidence Extraction for Adjudication Assistance [J]. Computer Science, 2025, 52(2): 222-230.
[5] XU Siyao, ZENG Jianjun, ZHANG Weiyan, YE Qi, ZHU Yan. Dependency Parsing for Chinese Electronic Medical Record Enhanced by Dual-scale Collaboration of Large and Small Language Models [J]. Computer Science, 2025, 52(2): 253-260.
[6] ZHANG Peng, ZHANG Daojuan, CHEN Kai, ZHAO Yufei, ZHANG Yingjie, FEI Kexiong. Enhancing NLP Robustness Against Attacks with Retrieval-augmented Classification and Decoupled Representations [J]. Computer Science, 2025, 52(12): 428-434.
[7] XIA Peng, ZHANG Yijun, QI Ji. Multi-agent Collaborative Code Generation Technology Driven by Large Language Models [J]. Computer Science, 2025, 52(11A): 241200033-9.
[8] YUAN Tianhao, WANG Yongjun, WANG Baoshan, WANG Zhongyuan. Review of Artificial Intelligence Generated Content Applications in Natural Language Processing [J]. Computer Science, 2025, 52(11A): 241200156-12.
[9] WEI Hao, ZHANG Zongyu, DIAO Hongyue, DENG Yaochen. Review of Application of Information Extraction Technology in Digital Humanities [J]. Computer Science, 2025, 52(11A): 250600198-10.
[10] ZHAO Hongyi, LI Zhiyuan, BU Fanliang. Multi-language Embedding Graph Convolutional Network for Hate Speech Detection [J]. Computer Science, 2025, 52(11A): 241200023-8.
[11] FU Juan. Research on Application of Deep Learning-based Natural Language Processing Technology inIntelligent Translation Systems [J]. Computer Science, 2025, 52(11A): 241000037-6.
[12] ZHANG Jiawei, WANG Zhongqing, CHEN Jiali. Multi-grained Sentiment Analysis of Comments Based on Text Generation [J]. Computer Science, 2025, 52(10): 239-246.
[13] ZHANG Jian, LI Hui, ZHANG Shengming, WU Jie, PENG Ying. Review of Pre-training Methods for Visually-rich Document Understanding [J]. Computer Science, 2025, 52(1): 259-276.
[14] GUO Zhiqiang, GUAN Donghai, YUAN Weiwei. Word-Character Model with Low Lexical Information Loss for Chinese NER [J]. Computer Science, 2024, 51(8): 272-280.
[15] LI Bin, WANG Haochang. Implementation and Application of Chinese Grammatical Error Diagnosis System Based on CRF [J]. Computer Science, 2024, 51(6A): 230900073-6.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!