计算机科学 ›› 2024, Vol. 51 ›› Issue (7): 296-302.doi: 10.11896/jsjkx.231100201
翁彧1, 罗皓予1, 超木日力格1, 刘轩1, 董俊1, 刘征1,2
WENG Yu1, LUO Haoyu1, Chaomurilige1, LIU Xuan 1, DONG Jun1, LIU Zheng1,2
摘要: 针对现有的模型无法处理多民族低资源语言自动摘要生成的问题,基于CINO 提出了一种面向多民族低资源语言的抽取式摘要模型CINOSUM。为扩大文本摘要的语言范围,首先构建了多种民族语言的摘要数据集MESUM。为解决以往模型在低资源语言上效果不佳的问题,构建了一个框架,采用统一的句子抽取器,以进行不同民族语言的抽取式摘要生成。此外,提出采用多语言数据集的联合训练方法,旨在弥补知识获取上的不足,进而扩展在低资源语言上的应用,显著增强模型的适应性与灵活性。最终,在MESUM数据集上开展了广泛的实验研究,实验结果表明CINOSUM模型在包括藏语和维吾尔语在内的多民族低资源语言环境中表现卓越,并且在ROUGE评价体系下取得了显著的性能提升。
中图分类号:
[1]MIHALCEA R,TARAU P.TextRank:Bringing order into text[C]//Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing.2004:404-411. [2]AIZAWA A.An information-theoretic perspective of tf-idf mea-sures[J].Information Processing & Management,2003,39(1):45-65. [3]DEVLIN J,CHANG M W,LEE K,et al.Bert:Pre-training of deep bidirectional transformers for language understanding[J].arXiv:1810.04805,2018. [4]LIU Y.Fine-tune BERT for extractive summarization[J].ar-Xiv:1903.10318,2019. [5]ZHANG H,LIU X,ZHANG J.Extractive summarization viachatgpt for faithful summary generation[J].arXiv:2304.04193,2023. [6]ZHANG H,LIU X,ZHANG J.Diffusum:Generation enhanced extractive summarization with diffusion[J].arXiv:2305.01735,2023. [7]MEDSKER L R,JAIN L C.Recurrent neural networks[J].Design and Applications,2001,5(64/65/66/67):2. [8]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[C]//Proceedings of the 31st International Confe-rence on Neural Information Processing Systems.2017:6000-6010. [9]YANG Z,XU Z,CUI Y,et al.Cino:A chinese minority pre-trained language model[J].arXiv:2202.13558,2022. [10]YAN X D,WANG Y Q,HUANG S,et al.Tibetan Text Summarization Dataset.China Scientific Data:Online English and Chinese Edition,2022,7(2):39-45. [11]HU B,CHEN Q,ZHU F.LCSTS:A large scale chinese short text summarization dataset[J].arXiv:1506.05865,2015. [12]HOU L W,HU P,CAO W L.Research on Chinese Generative Automatic Summarization with Topic Keyword Information Fusion[J].Acta Automatica Sinica,2019,45(3):530-539. [13]HUANG B,LIU C C.Chinese Automatic Text SummarizationBased on Weighted TextRank[J].Application Research of Computers,2020,37(2):407-410. [14]CHEN Y Z,LI B L,YU S W.Design and Implementation of Tibetan Word Segmentation System[J].Journal of Chinese Information Processing,2003,17(3):16-21. [15]LI W.Research on Tibetan News Summary Generation Based on a Unified Model [D].Beijing:Minzu University of China,2021. [16]HUANG S,YAN X,OUYANG X,et al.Abstractive Sum-marization of Tibetan Based on end-to-end Pre-trained Model[C]//Proceedings of the 22nd Chinese National Conference on Computational Linguistics.2023:113-123. [17]LI W,YAN X D,XIE X Q.Tibetan Extractive Summary Gene-ration Based on Improved TextRank[J].Journal of Chinese Information Processing,2020,34(9):36-43. [18]CONNEAU A,KHANDELWAL K,GOYAL N,et al.Unsupervised Cross-lingual Representation Learning at Scale[J].arXiv:1911.02116,2019. [19]ZAYTAR M A,AMRANI C E.Sequence to Sequence Weather Forecasting with Long Short-Term Memory Recurrent Neural Networks [J].International Journal of Computer Applications,2016,143(11):7-11. |
|