计算机科学 ›› 2025, Vol. 52 ›› Issue (12): 231-238.doi: 10.11896/jsjkx.250100094
刘炜杰, 汤泽成, 李俊涛
LIU Weijie, TANG Zecheng, LI Juntao
摘要: 大语言模型(LLMs)近年来取得了显著进展,并在多个领域展现出卓越的性能。然而,由于注意力机制的二次时空复杂度以及生成过程中键值对缓存不断增长所带来的显存消耗,处理长文本相关任务仍然是LLMs面临的一大挑战。为了解决此问题,提出了基于记忆增强的长文本建模方法MemLong,旨在利用外部检索器检索历史信息来增强长上下文语言建模的能力。MemLong将一个非参数化的检索-记忆模块与一个部分可训练的大语言模型相结合,并引入了一种能够利用语义层面相关的文本块的细粒度可控的检索注意力机制。非参数化的检索-记忆模块负责从外部知识库中检索与当前输入相关的历史信息,而大语言模型则将检索到的信息和当前输入融合在一起并生成输出。细粒度可控的检索注意力机制允许模型在生成过程中动态地调整对检索信息的关注程度,从而实现更精准的文本生成。在多个长上下文语言建模基准测试上的综合评估表明,MemLong方法始终优于其他先进的LLMs。此外,MemLong显著提升了模型处理长文本的能力。在单卡3090 GPU上,MemLong可以将上下文长度从4 000扩展到80 000,提升了20倍。这一突破性的进展使得MemLong能够处理更长的输入文本,从而更好地理解和生成长文本内容,为处理超长文本任务提供了新的可能性,并为未来长文本语言建模的研究开辟了新的方向。
中图分类号:
| [1]VASWANI A,SHAZEER N,PARMAR N,et al.Attention is all you need[C]//Proceedings of the 31st International Confe-rence on Neural Information Processing Systems.2017:6000-6010. [2]KOH H Y,JU J X,LIU M,et al.An empirical survey on long document summarization:Datasets,models,and metrics[J].ACM computing surveys,2022,55(8):1-35. [3]WANG J,LEONG C T,WANG J S,et al.Instruct once,chat consistently in multiple rounds:An efficient tuning framework for dialogue[J].arXiv:2402.06967,2024. [4]BELTAGY I,PETERS M E,COHAN A.Longformer:Thelong-document transformer[J].arXiv:2004.05150,2020. [5]WANG S,LI B Z,KHABSA M,et al.Linformer:Self-attention with linear complexity[J].arXiv:2006.04768,2020. [6]KITAEV N,KAISER Ł,LEVSKAYA A.Reformer:The efficient transformer[J].arXiv:2001.04451,2020. [7]XIAO G X,TIAN Y D,CHEN B D,et al.Efficient streaming language models with attention sinks[J].arXiv:2309.17453,2023. [8]CHEN S Y,WONG S,CHEN L J,et al.Extending context window of large language models via positional interpolation[J].arXiv:2306.15595,2023. [9]LU Y,ZHOU X,HE W,et al.Longheads:Multi-head attention is secretly a long context processor[J].arXiv:2402.10685,2024. [10]DAI Z H,YANG Z L,YANG Y M,et al.Transformer-xl:Attentive language models beyond a fixed-length context[J].ar-Xiv:1901.02860,2019. [11]BERTSCH A,ALON U,NEUBIG G,et al.Unlimiformer:Long-range transformers with unlimited length input[C]//Advances in Neural Information Processing Systems.2024. [12]YU H F,ZHANG Y,BI W,et al.Trams:Training-free memory selection for long-range language modeling[J].arXiv:2310.15494,2023. [13]WU Y H,RABE M N,HUTCHINS D,et al.Memorizing transformers[J].arXiv:2203.08913,2022. [14]WANG W Z,DONG L,CHENG H,et al.Augmenting language models with long-term memory[C]//Advances in Neural Information Processing Systems.2024. [15]RUBIN O,BERANT J.Long-range language modeling withself-retrieval[J].arXiv:2306.13421,2023. [16] TOUVRON H,LAVRIL T,IZACARD G,et al.Llama:Openand efficient foundation language models[J].arXiv:2302.13971,2023. [17]ZHANG R R,HAN J M,LIU C,et al.Llama-adapter:Efficient fine-tuning of language models with zero-init attention[J].ar-Xiv:2303.16199,2023. [18]SU J L,AHMED M,LU Y,et al.Roformer:Enhanced transformer with rotary position embedding[J].Neurocomputing,2024,568:127063. [19]HU E J,SHEN Y L,WALLIS P,et al.Lora:Low-rank adaptation of large language models[J].arXiv:2106.09685,2021. [20]FU Y,PANDA R,NIU X Y,et al.Data engineering for scaling language models to 128k context[J].arXiv:2402.10171,2024. [21]TWORKOWSKI S,STANISZEWSKI K,PACEK M,et al.Focused transformer:Contrastive training for context scaling[C]//Advances in Neural Information Processing Systems.2024. [22]RAE J W,POTAPENKO A,JAYAKUMAR S M,et al.Compressive transformers for long-range sequence modelling[J].arXiv:1911.05507,2019. [23]ZHU Y K,KIROS R,ZEMEL R,et al.Aligning books and mo-vies:Towards story-like visual explanations by watching movies and reading books[C]//Proceedings of the IEEE International Conference on Computer Vision.2015:19-27. [24]MERITY S,XIONG C M,BRADBURY J,et al.Pointer sentinel mixture models[J].arXiv:1609.07843,2016. [25]AZERBAYEV Z,SCHOELKOPF H,PASTER K,et al.Llemma:An open language model for mathematics[J].arXiv:2310.10631,2023. [26]YEN H,GAO T Y,CHEN D Q.Long-context language mode-ling with parallel context encoding[J].arXiv:2402.16617,2024. [27]JOHNSON J,DOUZE M,JEÉGOU H.Billion-scale similarity search withgpus[J].IEEE Transactions on Big Data,2019,7(3):535-547. [28]CHEN Y K,QIAN S J,TANG H T,et al.Longlora:Efficient fine-tuning of long-context large language models[J].arXiv:2309.12307,2023. [29]PENG B W,QUESNELLE J,FAN H L,et al.Yarn:Efficient context window extension of large language models[J].arXiv:2309.00071,2023. [30]ABDIN M,JACOBS S A,AWAN A A,et al.Phi-3 technical report:A highly capable language model locally on your phone[J].arXiv:2404.14219,2024. [31]BROWN T,MANN B,RYDER N,et al.Language models are few-shot learners[J].Advances in neural information processing systems,2020,33:1877-1901. [32]PRESS O,SMITH N A,LEWIS M.Train short,test long:Attention with linear biases enables input length extrapolation[J].arXiv:2108.12409,2021. [33]LEWIS P,PEREZ E,PIKTUS A,et al.Retrieval-augmentedgeneration for knowledge-intensivenlp tasks[J].Advances in Neural Information Processing Systems,2020,33:9459-9474. [34]IZACARD G,GRAVE E.Leverag- ing passage retrieval withgenerative models for open domain question answering[J].ar-Xiv:2007.01282,2020. [35]RAM O,LEVINE Y,DALMEDIGOS I,et al.In-context retrie-val-augmented language models[J].Transactions of the Association for Computational Linguistics,2023,11:1316-1331. [36]YU W H,ITER D,WANG S H,et al.Generate rather than retrieve:Large language models are strong context generators[J].arXiv:2209.10063,2022. [37]ASAI A,WU Z Q,WANG Y Z,et al.Self-rag:Learning to retrieve,generate,and critique through self-reflection[J]. arXiv:2310.11511,2023. [38]GUU K,LEE K,TUNG Z,et al.Realm:Retrieval- augmented language model pre-training[J].arXiv:2002.08909,2020. [39]KHANDELWAL U,LEVY O,JURAFSKY D,et al.Generalization through memorization:Nearest neighbor language models[J].arXiv:1911.00172,2019. |
|
||