计算机科学 ›› 2024, Vol. 51 ›› Issue (7): 296-302.doi: 10.11896/jsjkx.231100201

• 人工智能 • 上一篇    下一篇

CINOSUM:面向多民族低资源语言的抽取式摘要模型

翁彧1, 罗皓予1, 超木日力格1, 刘轩1, 董俊1, 刘征1,2   

  1. 1 中央民族大学民族语言智能分析与安全治理教育部重点实验室 北京 100081
    2 中央民族大学中国少数民族语言文学学院 北京 100081
  • 收稿日期:2023-11-30 修回日期:2024-03-14 出版日期:2024-07-15 发布日期:2024-07-10
  • 通讯作者: 刘征 (liuzheng@muc.edu.cn)
  • 作者简介:(wengyu@muc.edu.cn)
  • 基金资助:
    国家重点研发计划 (2020YFB1406702-3);国家自然科学基金 (61772575,62006257)

CINOSUM:An Extractive Summarization Model for Low-resource Multi-ethnic Language

WENG Yu1, LUO Haoyu1, Chaomurilige1, LIU Xuan 1, DONG Jun1, LIU Zheng1,2   

  1. 1 Key Laboratory of Ethnic Language Intelligent Analysis and Security Governance,Ministry of Education,Beijing 100081,China
    2 School of Chinese Ethnic Minority Languages and Literatures,Minzu University of China,Beijing 100081,China
  • Received:2023-11-30 Revised:2024-03-14 Online:2024-07-15 Published:2024-07-10
  • About author:WENG Yu,born in 1980,Ph.D,professor,Ph.D supervisor.His main research interests include machine learning and cloud computing.
    LIU Zheng,born in 1990,Ph.D.His main research interests include NLP,data mining,and AI.
  • Supported by:
    National Key R & D Program of China(2020YFB1406702-3) and National Natural Science Foundation of China(61772575,62006257).

摘要: 针对现有的模型无法处理多民族低资源语言自动摘要生成的问题,基于CINO 提出了一种面向多民族低资源语言的抽取式摘要模型CINOSUM。为扩大文本摘要的语言范围,首先构建了多种民族语言的摘要数据集MESUM。为解决以往模型在低资源语言上效果不佳的问题,构建了一个框架,采用统一的句子抽取器,以进行不同民族语言的抽取式摘要生成。此外,提出采用多语言数据集的联合训练方法,旨在弥补知识获取上的不足,进而扩展在低资源语言上的应用,显著增强模型的适应性与灵活性。最终,在MESUM数据集上开展了广泛的实验研究,实验结果表明CINOSUM模型在包括藏语和维吾尔语在内的多民族低资源语言环境中表现卓越,并且在ROUGE评价体系下取得了显著的性能提升。

关键词: 抽取式摘要, 多语言预训练模型, 低资源语言信息处理, 知识迁移

Abstract: To address the issue of existing models being unable to handle abstractive summarization for low-resource multilingual languages,this paper proposes an extractive summarization model,CINOSUM,based on CINO(a Chinese minority pre-trained language model).We construct a multi-ethnic language summarization dataset,MESUM,to extend the linguistic scope of text summarization.To overcome the poor performance of previous models on low-resource languages,a unified sentence extraction framework is employed for extractive summarization across various ethnic languages.In addition,we introduce a joint training strategy for multilingual datasets that effectively expands applications in low-resource languages,thereby greatly improving the model's adaptability and flexibility.Ultimately,this paper conducts extensive experimental study on the MESUM dataset,and the results reveal that the CINOSUM model demonstrates superior performance in multilingual low-resource linguistic environments,including Tibetan and Uyghur languages,achieving significant improvements in the ROUGE evaluation metric.

Key words: Extractive summarization, Multilingual pre-trained model, Low-resource language processing, Knowledge transfer

中图分类号: 

  • TP391
[1]MIHALCEA R,TARAU P.TextRank:Bringing order into text[C]//Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing.2004:404-411.
[2]AIZAWA A.An information-theoretic perspective of tf-idf mea-sures[J].Information Processing & Management,2003,39(1):45-65.
[3]DEVLIN J,CHANG M W,LEE K,et al.Bert:Pre-training of deep bidirectional transformers for language understanding[J].arXiv:1810.04805,2018.
[4]LIU Y.Fine-tune BERT for extractive summarization[J].ar-Xiv:1903.10318,2019.
[5]ZHANG H,LIU X,ZHANG J.Extractive summarization viachatgpt for faithful summary generation[J].arXiv:2304.04193,2023.
[6]ZHANG H,LIU X,ZHANG J.Diffusum:Generation enhanced extractive summarization with diffusion[J].arXiv:2305.01735,2023.
[7]MEDSKER L R,JAIN L C.Recurrent neural networks[J].Design and Applications,2001,5(64/65/66/67):2.
[8]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[C]//Proceedings of the 31st International Confe-rence on Neural Information Processing Systems.2017:6000-6010.
[9]YANG Z,XU Z,CUI Y,et al.Cino:A chinese minority pre-trained language model[J].arXiv:2202.13558,2022.
[10]YAN X D,WANG Y Q,HUANG S,et al.Tibetan Text Summarization Dataset.China Scientific Data:Online English and Chinese Edition,2022,7(2):39-45.
[11]HU B,CHEN Q,ZHU F.LCSTS:A large scale chinese short text summarization dataset[J].arXiv:1506.05865,2015.
[12]HOU L W,HU P,CAO W L.Research on Chinese Generative Automatic Summarization with Topic Keyword Information Fusion[J].Acta Automatica Sinica,2019,45(3):530-539.
[13]HUANG B,LIU C C.Chinese Automatic Text SummarizationBased on Weighted TextRank[J].Application Research of Computers,2020,37(2):407-410.
[14]CHEN Y Z,LI B L,YU S W.Design and Implementation of Tibetan Word Segmentation System[J].Journal of Chinese Information Processing,2003,17(3):16-21.
[15]LI W.Research on Tibetan News Summary Generation Based on a Unified Model [D].Beijing:Minzu University of China,2021.
[16]HUANG S,YAN X,OUYANG X,et al.Abstractive Sum-marization of Tibetan Based on end-to-end Pre-trained Model[C]//Proceedings of the 22nd Chinese National Conference on Computational Linguistics.2023:113-123.
[17]LI W,YAN X D,XIE X Q.Tibetan Extractive Summary Gene-ration Based on Improved TextRank[J].Journal of Chinese Information Processing,2020,34(9):36-43.
[18]CONNEAU A,KHANDELWAL K,GOYAL N,et al.Unsupervised Cross-lingual Representation Learning at Scale[J].arXiv:1911.02116,2019.
[19]ZAYTAR M A,AMRANI C E.Sequence to Sequence Weather Forecasting with Long Short-Term Memory Recurrent Neural Networks [J].International Journal of Computer Applications,2016,143(11):7-11.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!