计算机科学 ›› 2026, Vol. 53 ›› Issue (5): 268-275.doi: 10.11896/jsjkx.250300142

• 人工智能 • 上一篇    下一篇

基于枢轴优化自训练的汉缅机器翻译语料构建

赖华, 郭子瑞, 李英, 余正涛   

  1. 昆明理工大学信息工程与自动化学院 昆明 650500
    昆明理工大学云南省人工智能重点实验室 昆明 650500
  • 收稿日期:2025-03-26 修回日期:2025-05-09 发布日期:2026-05-08
  • 通讯作者: 李英(yingli_hlt@foxmail.com)
  • 作者简介:(405904235@qq.com)
  • 基金资助:
    国家自然科学基金(62366027,62306129);云南省基础研究项目(202401CF070121,202103AA080015,202401BC070021,202303AP140008);昆明理工大学“双一流”创建联合专项(202301BE070001-027)

Construction of Chinese-Burmese Machine Translation Corpus Based on Pivot OptimizationSelf-training

LAI Hua, GUO Zirui,LI Ying, YU Zhengtao   

  1. Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650500, China
    Yunnan Key Laboratory of Artificial Intelligence, Kunming University of Science and Technology, Kunming 650500, China
  • Received:2025-03-26 Revised:2025-05-09 Online:2026-05-08
  • About author:LAI Hua,born in 1966,master,asso-ciate professor.His main research in-terests include intelligent information processing and electrical engineering automation.
    LI Ying,born in 1991,Ph.D,associate professor.Her main research interests include natural language processing and grammar correction.
  • Supported by:
    National Natural Science Foundation of China(62366027,62306129),Yunnan Province Basic Research Project(202401CF070121,202103AA080015,202401BC070021,202303AP140008) and Kunming University of Science and Technology’s “Double First Class” Creation Joint Special Project(202301BE070001-027).

摘要: 近年来,语言模型的迅速发展极大地促进了有监督机器翻译的模型效果。然而,有监督机器翻译的性能高度依赖于平行语料库的质量。针对汉-缅高质量平行语料库资源匮乏的问题,提出了一种基于枢轴优化自训练的语料构建方法。首先,利用小规模高质量的汉-缅平行语料训练初始机器翻译模型。然后,基于该模型生成缅甸语到汉语的伪平行语料。同时,引入以英语为枢轴语言的英-缅平行语料,利用现有高质量的英-汉翻译工具将枢轴英语翻译为中文,构建第二组汉-缅伪平行语料。为进一步提高伪平行语料的质量,设计了一种跨语言表征的打分机制,基于语义相似度从两组伪平行语料中筛选出质量更高的句对。最终,利用筛选出的高质量伪平行语料对初始翻译模型进行迭代优化训练。实验结果表明,所提出的方法在汉-缅机器翻译任务中实现了平均8.32 BLEU值的提升。详细的分析实验证明,枢轴语言优化方法在初始模型性能较弱时,能够有效增强模型自训练效果,逐步提高伪平行语料质量。此外,还构建了70万条高质量汉-缅平行语料1),用于进一步促进汉-缅机器翻译的发展。

关键词: 平行语料构建, 机器翻译, 自训练, 枢轴语言, 中文, 缅甸语

Abstract: In recent years,the rapid development of language models has greatly promoted the model effect of supervised machine translation.However,the performance of supervised machine translation is highly dependent on the quality of parallel corpora.In view of the lack of high-quality Chinese-Burmese parallel corpora resources,this paper proposes a corpus construction method based on pivot optimization self-training.Firstly,the initial machine translation model is trained with a small-scale high-quality Chinese-Burmese parallel corpus.Then,a pseudo-parallel corpus from Burmese to Chinese is generated based on this model.At the same time,an English-Burmese parallel corpus with English as the pivot language is introduced,and the pivot English is translated into Chinese using existing high-quality English-Chinese translation tools to construct a second set of Chinese-Burmese pseudo-parallel corpora.To further improve the quality of the pseudo-parallel corpus,it designs a cross-lingual representation scoring mechanism to select higher quality sentence pairs from the two sets of pseudo-parallel corpora based on semantic similarity.Finally,the initial translation model is iteratively optimized and trained using the selected high-quality pseudo-parallel corpora.Experimental results show that the proposed method achieves an average 8.32 BLEU value improvement in the Chinese-Burmese machine translation task.Detailed analysis experiments prove that the pivot language optimization method can effectively enhance the model self-training effect and gradually improve the quality of pseudo-parallel corpus when the initial model performance is weak.In addition,this study constructs 700 000 high-quality Chinese-Burmese parallel corpus to further promote the development of Chinese-Burmese machine translation.

Key words: Parallel corpus construction, Machine translation, Self-training, Pivotal language, Chinese, Burmese

中图分类号: 

  • TP391.2
[1]KALCHBRENNER N,BLUNSOM P.Recurrent ContinuousTranslation Models[C]//Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing.Association for Computational Linguistics,2013:1700-1709.
[2]CHO K,VAN MERRIENBOER B,BAHDANAU D,et al.On the properties of neural machine translation:Encoder-decoder approaches[J].arXiv:1409.1259,2014.
[3]WU Y,SCHUSTER M,CHEN Z,et al.Google’s neural machine translation system:Bridging the gap between human and machine translation[J].arXiv:1609.08144,2016.
[4]LIU Y,GU J,GOYAL N,et al.Multilingual Denoising Pre-training for Neural Machine Translation[J].Transactions of the Association for Computational Linguistics,2020,8:726-742.
[5]RADFORD A,NARASIMHAN K,SALIMANS T,et al.Improving language understanding by generative pre-training[EB/OL].https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf.
[6]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[C]//Proceedings of the 31st International Confe-rence on Neural Information Processing Systems.New York:ACM,2017:6000-6010.
[7]ZHANG Z,XIE Z P.Parallel Corpus Annotated for Event Factuality Prediction[J].Journal of Chinese Computer Systems,2024,45(7):1537-1544.
[8]LIU Y,XIONG D.Construction Method of Parallel Corpus for Minority Language Machine Translation[J].Computer Science,2022,49(1):41-46.
[9]CHIRUZZO L,AMARILLA P,RÍOS A,et al.Development of a Guarani-Spanish parallel corpus[C]//Proceedings of the Twelfth Language Resources and Evaluation Conference.2020:2629-2633.
[10]XIE Q,LUONG M T,HOVY E,et al.Self-training with noisy student improves imagenet classification[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.IEEE,2020:10687-10698.
[11]LAMPLE G,CONNEAU A,DENOYER L,et al.Unsupervised machine translation using monolingual corpora only[J].arXiv:1711.00043,2017.
[12]ARTETXE M,SCHWENK H.Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond[J].Transactions of the Association for Computational Linguistics,2019,7:597-610.
[13]CHRISTODOULOUPOULOS C,STEEDMAN M.A massively parallel corpus:the bible in 100 languages[J].Language Resources and Evaluation,2015,49:375-395.
[14]THU Y K,PA W P,UTIYAMA M,et al.Introducing the Asian language treebank(ALT)[C]//Proceedings of the Tenth International Conference on Language Resources and Evaluation(LREC’16).ELRA,2016:1574-1578.
[15]EL-KISHKY A,RENDUCHINTALA A,CROSS J,et al.XLEnt:mining a large cross-lingual entity dataset with lexical-semantic-phonetic word alignment[J].arXiv:2104.08597,2021.
[16]REIMERS N,GUREVYCH I.Making monolingual sentenceembeddings multilingual using knowledge distillation[J].arXiv:2004.09813,2020.
[17]EL-KISHKY A,CHAUDHARY V,GUZMÁN F,et al.CCAligned:A massive collection of cross-lingual web-document pairs[J].arXiv:1911.06154,2019.
[18]SCHWENK H,WENZEK G,EDUNOV S,et al.CCMatrix:Mining billions of high-quality parallel sentences on the web[J].arXiv:1911.04944,2019.
[19]SCHWENK H,CHAUDHARY V,SUN S,et al.Wikimatrix:Mining 135m parallel sentences in 1620 language pairs from wikipedia[J].arXiv:1907.05791,2019.
[20]RANATHUNGA S,LEE E S A,SKENDULI M P,et al.Neural machine translation for low-resource languages:A survey[J].ACM Computing Surveys,2023,55(11):1-37.
[21]KIM Y,PETROV P,PETRUSHKOV P,et al.Pivot-basedTransfer Learning for Neural Machine Translation between Non-English Languages[C]//Proceedings of the 2019 Confe-rence on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing(EMNLP-IJCNLP).2019:866-876.
[22]PARK J,SONG J,YOON S.Building a neural machine translation system using only synthetic parallel data[J].arXiv:1704.00253,2017.
[23]CURREY A,HEAFIELD K.Zero-resource neural machinetranslation with monolingual pivot data[C]//The 3rd Workshop on Neural Generation and Translation:at EMNLP-IJCNLP 2019.2019:99-107.
[24]ZHENG H,CHENG Y,LIU Y.Maximum Expected Likelihood Estimation for Zero-resource Neural Machine Translation[C]//Twenty-Sixth International Joint Conference on Artificial Intelligence.Melbourne:International Joint Conferences on Artificial Intelligence Organization,2017:4251-4257.
[25]YANG J,YIN Y,MA S,et al.UM4:unified multilingual multiple teacher-student model for zero-resource neural machine translation[J].arXiv:2207.04900,2022.
[26]ZHANG J,ZONG C.Exploiting source-side monolingual data in neural machine translation[C]//Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing.Austin,Texas:Association for Computational Linguistics,2016:1535-1545.
[27]EDUNOV S,OTT M,AULI M,et al.Understanding Back-Translation at Scale[C]//Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing.2018:489-500.
[28]DENG H,DING L,LIU X,et al.Improving simultaneous machine translation with monolingual data[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2023:12728-12736.
[29]JIAO W,WANG X,TU Z,et al.Self-Training Sampling with Monolingual Data Uncertainty for Neural Machine Translation[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing.2021:2840-2850.
[30]HE J,GU J,SHEN J,et al.Revisiting self-training for neural sequence generation[J].arXiv:1909.13788,2019.
[31]SCHUSTER M,PALIWAL K K.Bidirectional recurrent neural networks[J].IEEE Transactions on Signal Processing,1997,45(11):2673-2681.
[32]POST M.A Call for Clarity in Reporting BLEU Scores[C]//Proceedings of the Third Conference on Machine Translation.2018:186-191.
[33]PAPINENI K,ROUKOS S,WARD T,et al.Bleu:a method for automatic evaluation of machine translation[C]//Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics.Association for Computational Linguistics,2002:311-318.
[34]KOEHN P,ZENS R,DYER C,et al.Moses:Open source toolkitfor statistical machine translation[C]//Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions.Association for Computational Linguistics,2007:177-180.
[35]SENNRICH R,HADDOW B,BIRCH A.Neural machine translation of rare words with subword units[J].arXiv:1508.07909,2015.
[36]OTT M,EDUNOV S,BAEVSKI A,et al.FAIR SEQ:A fast,extensible toolkit for sequence modeling[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics(Demonstrations).2019:48-53.
[37]GRATTAFIORI A,DUBEY A,JAUHRI A,et al.The llama 3 herd of models[J].arXiv:2407.21783,2024.
[38]ACHIAM J,ADLER S,AGARWAL S,et al.Gpt-4 technical report[J].arXiv:2303.08774,2023.
[39]HENDY A,ABDELREHIM M,SHARAF A,et al.How goodare gpt models at machine translation? a comprehensive evaluation[J].arXiv:2302.09210,2023.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!