Computer Science ›› 2024, Vol. 51 ›› Issue (7): 296-302.doi: 10.11896/jsjkx.231100201

• Artificial Intelligence • Previous Articles     Next Articles

CINOSUM:An Extractive Summarization Model for Low-resource Multi-ethnic Language

WENG Yu1, LUO Haoyu1, Chaomurilige1, LIU Xuan 1, DONG Jun1, LIU Zheng1,2   

  1. 1 Key Laboratory of Ethnic Language Intelligent Analysis and Security Governance,Ministry of Education,Beijing 100081,China
    2 School of Chinese Ethnic Minority Languages and Literatures,Minzu University of China,Beijing 100081,China
  • Received:2023-11-30 Revised:2024-03-14 Online:2024-07-15 Published:2024-07-10
  • About author:WENG Yu,born in 1980,Ph.D,professor,Ph.D supervisor.His main research interests include machine learning and cloud computing.
    LIU Zheng,born in 1990,Ph.D.His main research interests include NLP,data mining,and AI.
  • Supported by:
    National Key R & D Program of China(2020YFB1406702-3) and National Natural Science Foundation of China(61772575,62006257).

Abstract: To address the issue of existing models being unable to handle abstractive summarization for low-resource multilingual languages,this paper proposes an extractive summarization model,CINOSUM,based on CINO(a Chinese minority pre-trained language model).We construct a multi-ethnic language summarization dataset,MESUM,to extend the linguistic scope of text summarization.To overcome the poor performance of previous models on low-resource languages,a unified sentence extraction framework is employed for extractive summarization across various ethnic languages.In addition,we introduce a joint training strategy for multilingual datasets that effectively expands applications in low-resource languages,thereby greatly improving the model's adaptability and flexibility.Ultimately,this paper conducts extensive experimental study on the MESUM dataset,and the results reveal that the CINOSUM model demonstrates superior performance in multilingual low-resource linguistic environments,including Tibetan and Uyghur languages,achieving significant improvements in the ROUGE evaluation metric.

Key words: Extractive summarization, Multilingual pre-trained model, Low-resource language processing, Knowledge transfer

CLC Number: 

  • TP391
[1]MIHALCEA R,TARAU P.TextRank:Bringing order into text[C]//Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing.2004:404-411.
[2]AIZAWA A.An information-theoretic perspective of tf-idf mea-sures[J].Information Processing & Management,2003,39(1):45-65.
[3]DEVLIN J,CHANG M W,LEE K,et al.Bert:Pre-training of deep bidirectional transformers for language understanding[J].arXiv:1810.04805,2018.
[4]LIU Y.Fine-tune BERT for extractive summarization[J].ar-Xiv:1903.10318,2019.
[5]ZHANG H,LIU X,ZHANG J.Extractive summarization viachatgpt for faithful summary generation[J].arXiv:2304.04193,2023.
[6]ZHANG H,LIU X,ZHANG J.Diffusum:Generation enhanced extractive summarization with diffusion[J].arXiv:2305.01735,2023.
[7]MEDSKER L R,JAIN L C.Recurrent neural networks[J].Design and Applications,2001,5(64/65/66/67):2.
[8]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[C]//Proceedings of the 31st International Confe-rence on Neural Information Processing Systems.2017:6000-6010.
[9]YANG Z,XU Z,CUI Y,et al.Cino:A chinese minority pre-trained language model[J].arXiv:2202.13558,2022.
[10]YAN X D,WANG Y Q,HUANG S,et al.Tibetan Text Summarization Dataset.China Scientific Data:Online English and Chinese Edition,2022,7(2):39-45.
[11]HU B,CHEN Q,ZHU F.LCSTS:A large scale chinese short text summarization dataset[J].arXiv:1506.05865,2015.
[12]HOU L W,HU P,CAO W L.Research on Chinese Generative Automatic Summarization with Topic Keyword Information Fusion[J].Acta Automatica Sinica,2019,45(3):530-539.
[13]HUANG B,LIU C C.Chinese Automatic Text SummarizationBased on Weighted TextRank[J].Application Research of Computers,2020,37(2):407-410.
[14]CHEN Y Z,LI B L,YU S W.Design and Implementation of Tibetan Word Segmentation System[J].Journal of Chinese Information Processing,2003,17(3):16-21.
[15]LI W.Research on Tibetan News Summary Generation Based on a Unified Model [D].Beijing:Minzu University of China,2021.
[16]HUANG S,YAN X,OUYANG X,et al.Abstractive Sum-marization of Tibetan Based on end-to-end Pre-trained Model[C]//Proceedings of the 22nd Chinese National Conference on Computational Linguistics.2023:113-123.
[17]LI W,YAN X D,XIE X Q.Tibetan Extractive Summary Gene-ration Based on Improved TextRank[J].Journal of Chinese Information Processing,2020,34(9):36-43.
[18]CONNEAU A,KHANDELWAL K,GOYAL N,et al.Unsupervised Cross-lingual Representation Learning at Scale[J].arXiv:1911.02116,2019.
[19]ZAYTAR M A,AMRANI C E.Sequence to Sequence Weather Forecasting with Long Short-Term Memory Recurrent Neural Networks [J].International Journal of Computer Applications,2016,143(11):7-11.
[1] ZHAO Jiangjiang, WANG Yang, XU Yingying, GAO Yang. Extractive Automatic Summarization Model Based on Knowledge Distillation [J]. Computer Science, 2023, 50(6A): 210300179-7.
[2] ZHANG Qiyang, CHEN Xiliang, CAO Lei, LAI Jun, SHENG Lei. Survey on Knowledge Transfer Method in Deep Reinforcement Learning [J]. Computer Science, 2023, 50(5): 201-216.
[3] MA Hui, FENG Xiang, YU Huiqun. Multi-surrogate Multi-task Optimization Approach Based on Two-layer Knowledge Transfer [J]. Computer Science, 2023, 50(10): 203-213.
[4] XU Ping'an, LIU Quan. Deep Reinforcement Learning Based on Similarity Constrained Dual Policy Distillation [J]. Computer Science, 2023, 50(1): 253-261.
[5] ZHANG Qiyang, CHEN Xiliang, ZHANG Qiao. Sparse Reward Exploration Method Based on Trajectory Perception [J]. Computer Science, 2023, 50(1): 262-269.
[6] TANG Feng, FENG Xiang, YU Hui-qun. Multi-task Cooperative Optimization Algorithm Based on Adaptive Knowledge Transfer andResource Allocation [J]. Computer Science, 2022, 49(7): 254-262.
[7] MAO Xiang-ke, HUANG Shao-bin, YU Qin-yong. Graph Based Collaborative Extraction Method for Keywords and Summary from Documents [J]. Computer Science, 2021, 48(10): 44-50.
[8] ZHANG Ying, ZHANG Yi-fei, WANG Zhong-qing and WANG Hong-ling. Automatic Summarization Method Based on Primary and Secondary Relation Feature [J]. Computer Science, 2020, 47(6A): 6-11.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!