CINOSUM:面向多民族低资源语言的抽取式摘要模型

doi:10.11896/jsjkx.231100201

Computer Science ›› 2024, Vol. 51 ›› Issue (7): 296-302.doi: 10.11896/jsjkx.231100201

• Artificial Intelligence • Previous Articles Next Articles

CINOSUM:An Extractive Summarization Model for Low-resource Multi-ethnic Language

WENG Yu¹, LUO Haoyu¹, Chaomurilige¹, LIU Xuan ¹, DONG Jun¹, LIU Zheng^1,2

1 Key Laboratory of Ethnic Language Intelligent Analysis and Security Governance,Ministry of Education,Beijing 100081,China
2 School of Chinese Ethnic Minority Languages and Literatures,Minzu University of China,Beijing 100081,China

Received:2023-11-30 Revised:2024-03-14 Online:2024-07-15 Published:2024-07-10
About author:WENG Yu,born in 1980,Ph.D,professor,Ph.D supervisor.His main research interests include machine learning and cloud computing.
LIU Zheng,born in 1990,Ph.D.His main research interests include NLP,data mining,and AI.
Supported by:
National Key R & D Program of China(2020YFB1406702-3) and National Natural Science Foundation of China(61772575,62006257).

Abstract

Abstract: To address the issue of existing models being unable to handle abstractive summarization for low-resource multilingual languages,this paper proposes an extractive summarization model,CINOSUM,based on CINO(a Chinese minority pre-trained language model).We construct a multi-ethnic language summarization dataset,MESUM,to extend the linguistic scope of text summarization.To overcome the poor performance of previous models on low-resource languages,a unified sentence extraction framework is employed for extractive summarization across various ethnic languages.In addition,we introduce a joint training strategy for multilingual datasets that effectively expands applications in low-resource languages,thereby greatly improving the model's adaptability and flexibility.Ultimately,this paper conducts extensive experimental study on the MESUM dataset,and the results reveal that the CINOSUM model demonstrates superior performance in multilingual low-resource linguistic environments,including Tibetan and Uyghur languages,achieving significant improvements in the ROUGE evaluation metric.

Key words: Extractive summarization, Multilingual pre-trained model, Low-resource language processing, Knowledge transfer

CLC Number:

TP391

WENG Yu, LUO Haoyu, Chaomurilige, LIU Xuan , DONG Jun, LIU Zheng. CINOSUM:An Extractive Summarization Model for Low-resource Multi-ethnic Language[J].Computer Science, 2024, 51(7): 296-302.

References

[1]MIHALCEA R,TARAU P.TextRank:Bringing order into text[C]//Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing.2004:404-411.
[2]AIZAWA A.An information-theoretic perspective of tf-idf mea-sures[J].Information Processing & Management,2003,39(1):45-65.
[3]DEVLIN J,CHANG M W,LEE K,et al.Bert:Pre-training of deep bidirectional transformers for language understanding[J].arXiv:1810.04805,2018.
[4]LIU Y.Fine-tune BERT for extractive summarization[J].ar-Xiv:1903.10318,2019.
[5]ZHANG H,LIU X,ZHANG J.Extractive summarization viachatgpt for faithful summary generation[J].arXiv:2304.04193,2023.
[6]ZHANG H,LIU X,ZHANG J.Diffusum:Generation enhanced extractive summarization with diffusion[J].arXiv:2305.01735,2023.
[7]MEDSKER L R,JAIN L C.Recurrent neural networks[J].Design and Applications,2001,5(64/65/66/67):2.
[8]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[C]//Proceedings of the 31st International Confe-rence on Neural Information Processing Systems.2017:6000-6010.
[9]YANG Z,XU Z,CUI Y,et al.Cino:A chinese minority pre-trained language model[J].arXiv:2202.13558,2022.
[10]YAN X D,WANG Y Q,HUANG S,et al.Tibetan Text Summarization Dataset.China Scientific Data:Online English and Chinese Edition,2022,7(2):39-45.
[11]HU B,CHEN Q,ZHU F.LCSTS:A large scale chinese short text summarization dataset[J].arXiv:1506.05865,2015.
[12]HOU L W,HU P,CAO W L.Research on Chinese Generative Automatic Summarization with Topic Keyword Information Fusion[J].Acta Automatica Sinica,2019,45(3):530-539.
[13]HUANG B,LIU C C.Chinese Automatic Text SummarizationBased on Weighted TextRank[J].Application Research of Computers,2020,37(2):407-410.
[14]CHEN Y Z,LI B L,YU S W.Design and Implementation of Tibetan Word Segmentation System[J].Journal of Chinese Information Processing,2003,17(3):16-21.
[15]LI W.Research on Tibetan News Summary Generation Based on a Unified Model [D].Beijing:Minzu University of China,2021.
[16]HUANG S,YAN X,OUYANG X,et al.Abstractive Sum-marization of Tibetan Based on end-to-end Pre-trained Model[C]//Proceedings of the 22nd Chinese National Conference on Computational Linguistics.2023:113-123.
[17]LI W,YAN X D,XIE X Q.Tibetan Extractive Summary Gene-ration Based on Improved TextRank[J].Journal of Chinese Information Processing,2020,34(9):36-43.
[18]CONNEAU A,KHANDELWAL K,GOYAL N,et al.Unsupervised Cross-lingual Representation Learning at Scale[J].arXiv:1911.02116,2019.
[19]ZAYTAR M A,AMRANI C E.Sequence to Sequence Weather Forecasting with Long Short-Term Memory Recurrent Neural Networks [J].International Journal of Computer Applications,2016,143(11):7-11.

Related Articles 8

[1]	ZHAO Jiangjiang, WANG Yang, XU Yingying, GAO Yang. Extractive Automatic Summarization Model Based on Knowledge Distillation [J]. Computer Science, 2023, 50(6A): 210300179-7.
[2]	ZHANG Qiyang, CHEN Xiliang, CAO Lei, LAI Jun, SHENG Lei. Survey on Knowledge Transfer Method in Deep Reinforcement Learning [J]. Computer Science, 2023, 50(5): 201-216.
[3]	MA Hui, FENG Xiang, YU Huiqun. Multi-surrogate Multi-task Optimization Approach Based on Two-layer Knowledge Transfer [J]. Computer Science, 2023, 50(10): 203-213.
[4]	XU Ping'an, LIU Quan. Deep Reinforcement Learning Based on Similarity Constrained Dual Policy Distillation [J]. Computer Science, 2023, 50(1): 253-261.
[5]	ZHANG Qiyang, CHEN Xiliang, ZHANG Qiao. Sparse Reward Exploration Method Based on Trajectory Perception [J]. Computer Science, 2023, 50(1): 262-269.
[6]	TANG Feng, FENG Xiang, YU Hui-qun. Multi-task Cooperative Optimization Algorithm Based on Adaptive Knowledge Transfer andResource Allocation [J]. Computer Science, 2022, 49(7): 254-262.
[7]	MAO Xiang-ke, HUANG Shao-bin, YU Qin-yong. Graph Based Collaborative Extraction Method for Keywords and Summary from Documents [J]. Computer Science, 2021, 48(10): 44-50.
[8]	ZHANG Ying, ZHANG Yi-fei, WANG Zhong-qing and WANG Hong-ling. Automatic Summarization Method Based on Primary and Secondary Relation Feature [J]. Computer Science, 2020, 47(6A): 6-11.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

CINOSUM:An Extractive Summarization Model for Low-resource Multi-ethnic Language

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 8

Metrics

Comments

Recommended 0