计算机科学 ›› 2026, Vol. 53 ›› Issue (5): 268-275.doi: 10.11896/jsjkx.250300142
赖华, 郭子瑞, 李英, 余正涛
LAI Hua, GUO Zirui,LI Ying, YU Zhengtao
摘要: 近年来,语言模型的迅速发展极大地促进了有监督机器翻译的模型效果。然而,有监督机器翻译的性能高度依赖于平行语料库的质量。针对汉-缅高质量平行语料库资源匮乏的问题,提出了一种基于枢轴优化自训练的语料构建方法。首先,利用小规模高质量的汉-缅平行语料训练初始机器翻译模型。然后,基于该模型生成缅甸语到汉语的伪平行语料。同时,引入以英语为枢轴语言的英-缅平行语料,利用现有高质量的英-汉翻译工具将枢轴英语翻译为中文,构建第二组汉-缅伪平行语料。为进一步提高伪平行语料的质量,设计了一种跨语言表征的打分机制,基于语义相似度从两组伪平行语料中筛选出质量更高的句对。最终,利用筛选出的高质量伪平行语料对初始翻译模型进行迭代优化训练。实验结果表明,所提出的方法在汉-缅机器翻译任务中实现了平均8.32 BLEU值的提升。详细的分析实验证明,枢轴语言优化方法在初始模型性能较弱时,能够有效增强模型自训练效果,逐步提高伪平行语料质量。此外,还构建了70万条高质量汉-缅平行语料1),用于进一步促进汉-缅机器翻译的发展。
中图分类号:
| [1]KALCHBRENNER N,BLUNSOM P.Recurrent ContinuousTranslation Models[C]//Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing.Association for Computational Linguistics,2013:1700-1709. [2]CHO K,VAN MERRIENBOER B,BAHDANAU D,et al.On the properties of neural machine translation:Encoder-decoder approaches[J].arXiv:1409.1259,2014. [3]WU Y,SCHUSTER M,CHEN Z,et al.Google’s neural machine translation system:Bridging the gap between human and machine translation[J].arXiv:1609.08144,2016. [4]LIU Y,GU J,GOYAL N,et al.Multilingual Denoising Pre-training for Neural Machine Translation[J].Transactions of the Association for Computational Linguistics,2020,8:726-742. [5]RADFORD A,NARASIMHAN K,SALIMANS T,et al.Improving language understanding by generative pre-training[EB/OL].https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf. [6]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[C]//Proceedings of the 31st International Confe-rence on Neural Information Processing Systems.New York:ACM,2017:6000-6010. [7]ZHANG Z,XIE Z P.Parallel Corpus Annotated for Event Factuality Prediction[J].Journal of Chinese Computer Systems,2024,45(7):1537-1544. [8]LIU Y,XIONG D.Construction Method of Parallel Corpus for Minority Language Machine Translation[J].Computer Science,2022,49(1):41-46. [9]CHIRUZZO L,AMARILLA P,RÍOS A,et al.Development of a Guarani-Spanish parallel corpus[C]//Proceedings of the Twelfth Language Resources and Evaluation Conference.2020:2629-2633. [10]XIE Q,LUONG M T,HOVY E,et al.Self-training with noisy student improves imagenet classification[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.IEEE,2020:10687-10698. [11]LAMPLE G,CONNEAU A,DENOYER L,et al.Unsupervised machine translation using monolingual corpora only[J].arXiv:1711.00043,2017. [12]ARTETXE M,SCHWENK H.Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond[J].Transactions of the Association for Computational Linguistics,2019,7:597-610. [13]CHRISTODOULOUPOULOS C,STEEDMAN M.A massively parallel corpus:the bible in 100 languages[J].Language Resources and Evaluation,2015,49:375-395. [14]THU Y K,PA W P,UTIYAMA M,et al.Introducing the Asian language treebank(ALT)[C]//Proceedings of the Tenth International Conference on Language Resources and Evaluation(LREC’16).ELRA,2016:1574-1578. [15]EL-KISHKY A,RENDUCHINTALA A,CROSS J,et al.XLEnt:mining a large cross-lingual entity dataset with lexical-semantic-phonetic word alignment[J].arXiv:2104.08597,2021. [16]REIMERS N,GUREVYCH I.Making monolingual sentenceembeddings multilingual using knowledge distillation[J].arXiv:2004.09813,2020. [17]EL-KISHKY A,CHAUDHARY V,GUZMÁN F,et al.CCAligned:A massive collection of cross-lingual web-document pairs[J].arXiv:1911.06154,2019. [18]SCHWENK H,WENZEK G,EDUNOV S,et al.CCMatrix:Mining billions of high-quality parallel sentences on the web[J].arXiv:1911.04944,2019. [19]SCHWENK H,CHAUDHARY V,SUN S,et al.Wikimatrix:Mining 135m parallel sentences in 1620 language pairs from wikipedia[J].arXiv:1907.05791,2019. [20]RANATHUNGA S,LEE E S A,SKENDULI M P,et al.Neural machine translation for low-resource languages:A survey[J].ACM Computing Surveys,2023,55(11):1-37. [21]KIM Y,PETROV P,PETRUSHKOV P,et al.Pivot-basedTransfer Learning for Neural Machine Translation between Non-English Languages[C]//Proceedings of the 2019 Confe-rence on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing(EMNLP-IJCNLP).2019:866-876. [22]PARK J,SONG J,YOON S.Building a neural machine translation system using only synthetic parallel data[J].arXiv:1704.00253,2017. [23]CURREY A,HEAFIELD K.Zero-resource neural machinetranslation with monolingual pivot data[C]//The 3rd Workshop on Neural Generation and Translation:at EMNLP-IJCNLP 2019.2019:99-107. [24]ZHENG H,CHENG Y,LIU Y.Maximum Expected Likelihood Estimation for Zero-resource Neural Machine Translation[C]//Twenty-Sixth International Joint Conference on Artificial Intelligence.Melbourne:International Joint Conferences on Artificial Intelligence Organization,2017:4251-4257. [25]YANG J,YIN Y,MA S,et al.UM4:unified multilingual multiple teacher-student model for zero-resource neural machine translation[J].arXiv:2207.04900,2022. [26]ZHANG J,ZONG C.Exploiting source-side monolingual data in neural machine translation[C]//Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing.Austin,Texas:Association for Computational Linguistics,2016:1535-1545. [27]EDUNOV S,OTT M,AULI M,et al.Understanding Back-Translation at Scale[C]//Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing.2018:489-500. [28]DENG H,DING L,LIU X,et al.Improving simultaneous machine translation with monolingual data[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2023:12728-12736. [29]JIAO W,WANG X,TU Z,et al.Self-Training Sampling with Monolingual Data Uncertainty for Neural Machine Translation[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing.2021:2840-2850. [30]HE J,GU J,SHEN J,et al.Revisiting self-training for neural sequence generation[J].arXiv:1909.13788,2019. [31]SCHUSTER M,PALIWAL K K.Bidirectional recurrent neural networks[J].IEEE Transactions on Signal Processing,1997,45(11):2673-2681. [32]POST M.A Call for Clarity in Reporting BLEU Scores[C]//Proceedings of the Third Conference on Machine Translation.2018:186-191. [33]PAPINENI K,ROUKOS S,WARD T,et al.Bleu:a method for automatic evaluation of machine translation[C]//Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics.Association for Computational Linguistics,2002:311-318. [34]KOEHN P,ZENS R,DYER C,et al.Moses:Open source toolkitfor statistical machine translation[C]//Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions.Association for Computational Linguistics,2007:177-180. [35]SENNRICH R,HADDOW B,BIRCH A.Neural machine translation of rare words with subword units[J].arXiv:1508.07909,2015. [36]OTT M,EDUNOV S,BAEVSKI A,et al.FAIR SEQ:A fast,extensible toolkit for sequence modeling[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics(Demonstrations).2019:48-53. [37]GRATTAFIORI A,DUBEY A,JAUHRI A,et al.The llama 3 herd of models[J].arXiv:2407.21783,2024. [38]ACHIAM J,ADLER S,AGARWAL S,et al.Gpt-4 technical report[J].arXiv:2303.08774,2023. [39]HENDY A,ABDELREHIM M,SHARAF A,et al.How goodare gpt models at machine translation? a comprehensive evaluation[J].arXiv:2302.09210,2023. |
|
||