基于枢轴优化自训练的汉缅机器翻译语料构建

doi:10.11896/jsjkx.250300142

Abstract

Abstract: In recent years,the rapid development of language models has greatly promoted the model effect of supervised machine translation.However,the performance of supervised machine translation is highly dependent on the quality of parallel corpora.In view of the lack of high-quality Chinese-Burmese parallel corpora resources,this paper proposes a corpus construction method based on pivot optimization self-training.Firstly,the initial machine translation model is trained with a small-scale high-quality Chinese-Burmese parallel corpus.Then,a pseudo-parallel corpus from Burmese to Chinese is generated based on this model.At the same time,an English-Burmese parallel corpus with English as the pivot language is introduced,and the pivot English is translated into Chinese using existing high-quality English-Chinese translation tools to construct a second set of Chinese-Burmese pseudo-parallel corpora.To further improve the quality of the pseudo-parallel corpus,it designs a cross-lingual representation scoring mechanism to select higher quality sentence pairs from the two sets of pseudo-parallel corpora based on semantic similarity.Finally,the initial translation model is iteratively optimized and trained using the selected high-quality pseudo-parallel corpora.Experimental results show that the proposed method achieves an average 8.32 BLEU value improvement in the Chinese-Burmese machine translation task.Detailed analysis experiments prove that the pivot language optimization method can effectively enhance the model self-training effect and gradually improve the quality of pseudo-parallel corpus when the initial model performance is weak.In addition,this study constructs 700 000 high-quality Chinese-Burmese parallel corpus to further promote the development of Chinese-Burmese machine translation.

Key words: Parallel corpus construction, Machine translation, Self-training, Pivotal language, Chinese, Burmese

CLC Number:

TP391.2

LAI Hua, GUO Zirui,LI Ying, YU Zhengtao. Construction of Chinese-Burmese Machine Translation Corpus Based on Pivot OptimizationSelf-training[J].Computer Science, 2026, 53(5): 268-275.

References

[1]KALCHBRENNER N,BLUNSOM P.Recurrent ContinuousTranslation Models[C]//Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing.Association for Computational Linguistics,2013:1700-1709.
[2]CHO K,VAN MERRIENBOER B,BAHDANAU D,et al.On the properties of neural machine translation:Encoder-decoder approaches[J].arXiv:1409.1259,2014.
[3]WU Y,SCHUSTER M,CHEN Z,et al.Google’s neural machine translation system:Bridging the gap between human and machine translation[J].arXiv:1609.08144,2016.
[4]LIU Y,GU J,GOYAL N,et al.Multilingual Denoising Pre-training for Neural Machine Translation[J].Transactions of the Association for Computational Linguistics,2020,8:726-742.
[5]RADFORD A,NARASIMHAN K,SALIMANS T,et al.Improving language understanding by generative pre-training[EB/OL].https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf.
[6]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[C]//Proceedings of the 31st International Confe-rence on Neural Information Processing Systems.New York:ACM,2017:6000-6010.
[7]ZHANG Z,XIE Z P.Parallel Corpus Annotated for Event Factuality Prediction[J].Journal of Chinese Computer Systems,2024,45(7):1537-1544.
[8]LIU Y,XIONG D.Construction Method of Parallel Corpus for Minority Language Machine Translation[J].Computer Science,2022,49(1):41-46.
[9]CHIRUZZO L,AMARILLA P,RÍOS A,et al.Development of a Guarani-Spanish parallel corpus[C]//Proceedings of the Twelfth Language Resources and Evaluation Conference.2020:2629-2633.
[10]XIE Q,LUONG M T,HOVY E,et al.Self-training with noisy student improves imagenet classification[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.IEEE,2020:10687-10698.
[11]LAMPLE G,CONNEAU A,DENOYER L,et al.Unsupervised machine translation using monolingual corpora only[J].arXiv:1711.00043,2017.
[12]ARTETXE M,SCHWENK H.Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond[J].Transactions of the Association for Computational Linguistics,2019,7:597-610.
[13]CHRISTODOULOUPOULOS C,STEEDMAN M.A massively parallel corpus:the bible in 100 languages[J].Language Resources and Evaluation,2015,49:375-395.
[14]THU Y K,PA W P,UTIYAMA M,et al.Introducing the Asian language treebank(ALT)[C]//Proceedings of the Tenth International Conference on Language Resources and Evaluation(LREC’16).ELRA,2016:1574-1578.
[15]EL-KISHKY A,RENDUCHINTALA A,CROSS J,et al.XLEnt:mining a large cross-lingual entity dataset with lexical-semantic-phonetic word alignment[J].arXiv:2104.08597,2021.
[16]REIMERS N,GUREVYCH I.Making monolingual sentenceembeddings multilingual using knowledge distillation[J].arXiv:2004.09813,2020.
[17]EL-KISHKY A,CHAUDHARY V,GUZMÁN F,et al.CCAligned:A massive collection of cross-lingual web-document pairs[J].arXiv:1911.06154,2019.
[18]SCHWENK H,WENZEK G,EDUNOV S,et al.CCMatrix:Mining billions of high-quality parallel sentences on the web[J].arXiv:1911.04944,2019.
[19]SCHWENK H,CHAUDHARY V,SUN S,et al.Wikimatrix:Mining 135m parallel sentences in 1620 language pairs from wikipedia[J].arXiv:1907.05791,2019.
[20]RANATHUNGA S,LEE E S A,SKENDULI M P,et al.Neural machine translation for low-resource languages:A survey[J].ACM Computing Surveys,2023,55(11):1-37.
[21]KIM Y,PETROV P,PETRUSHKOV P,et al.Pivot-basedTransfer Learning for Neural Machine Translation between Non-English Languages[C]//Proceedings of the 2019 Confe-rence on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing(EMNLP-IJCNLP).2019:866-876.
[22]PARK J,SONG J,YOON S.Building a neural machine translation system using only synthetic parallel data[J].arXiv:1704.00253,2017.
[23]CURREY A,HEAFIELD K.Zero-resource neural machinetranslation with monolingual pivot data[C]//The 3rd Workshop on Neural Generation and Translation:at EMNLP-IJCNLP 2019.2019:99-107.
[24]ZHENG H,CHENG Y,LIU Y.Maximum Expected Likelihood Estimation for Zero-resource Neural Machine Translation[C]//Twenty-Sixth International Joint Conference on Artificial Intelligence.Melbourne:International Joint Conferences on Artificial Intelligence Organization,2017:4251-4257.
[25]YANG J,YIN Y,MA S,et al.UM4:unified multilingual multiple teacher-student model for zero-resource neural machine translation[J].arXiv:2207.04900,2022.
[26]ZHANG J,ZONG C.Exploiting source-side monolingual data in neural machine translation[C]//Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing.Austin,Texas:Association for Computational Linguistics,2016:1535-1545.
[27]EDUNOV S,OTT M,AULI M,et al.Understanding Back-Translation at Scale[C]//Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing.2018:489-500.
[28]DENG H,DING L,LIU X,et al.Improving simultaneous machine translation with monolingual data[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2023:12728-12736.
[29]JIAO W,WANG X,TU Z,et al.Self-Training Sampling with Monolingual Data Uncertainty for Neural Machine Translation[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing.2021:2840-2850.
[30]HE J,GU J,SHEN J,et al.Revisiting self-training for neural sequence generation[J].arXiv:1909.13788,2019.
[31]SCHUSTER M,PALIWAL K K.Bidirectional recurrent neural networks[J].IEEE Transactions on Signal Processing,1997,45(11):2673-2681.
[32]POST M.A Call for Clarity in Reporting BLEU Scores[C]//Proceedings of the Third Conference on Machine Translation.2018:186-191.
[33]PAPINENI K,ROUKOS S,WARD T,et al.Bleu:a method for automatic evaluation of machine translation[C]//Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics.Association for Computational Linguistics,2002:311-318.
[34]KOEHN P,ZENS R,DYER C,et al.Moses:Open source toolkitfor statistical machine translation[C]//Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions.Association for Computational Linguistics,2007:177-180.
[35]SENNRICH R,HADDOW B,BIRCH A.Neural machine translation of rare words with subword units[J].arXiv:1508.07909,2015.
[36]OTT M,EDUNOV S,BAEVSKI A,et al.FAIR SEQ:A fast,extensible toolkit for sequence modeling[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics(Demonstrations).2019:48-53.
[37]GRATTAFIORI A,DUBEY A,JAUHRI A,et al.The llama 3 herd of models[J].arXiv:2407.21783,2024.
[38]ACHIAM J,ADLER S,AGARWAL S,et al.Gpt-4 technical report[J].arXiv:2303.08774,2023.
[39]HENDY A,ABDELREHIM M,SHARAF A,et al.How goodare gpt models at machine translation? a comprehensive evaluation[J].arXiv:2302.09210,2023.

Related Articles 15

[1]	TANG Ruixue, WU Liqin, QIAN Qing. Named Entity Recognition for Chinese Based on Adaptive Attention and Boundary Enhancement [J]. Computer Science, 2026, 53(5): 309-318.
[2]	SUN Mingxu, LIANG Gang, WU Yifei, HU Haixin. Chinese Hate Speech Detection Incorporating Hate Object Features and Variant Word Restoration Mechanism [J]. Computer Science, 2026, 53(2): 289-299.
[3]	TAN Pingping, XU Ji, LI Yijun, WANG Hai. Dynamic Interaction Dual-channel Graph Attention Network for Chinese and English SarcasmDetection [J]. Computer Science, 2026, 53(2): 300-311.
[4]	KALZANG Gyatso, NYIMA Tashi, QUN Nuo, GAMA Tashi, DORJE Tashi, LOBSANG Yeshi, LHAMO Kyi, ZOM Kyi. Data Augmentation Methods for Tibetan-Chinese Machine Translation Based on Long-tail Words [J]. Computer Science, 2026, 53(1): 224-230.
[5]	ZHANG Taotao, XIE Jun, QIAO Pingjuan. Specific Emitter Identification Based on Progressive Self-training Open Set Domain Adaptation [J]. Computer Science, 2025, 52(7): 279-286.
[6]	LI Yonghui, YE Na, BAI Yu, ZHANG Guiping. Machine Translation of English-Chinese Long Complex Sentences in Patent Integrating Terminology and Dependency Position Encoding [J]. Computer Science, 2025, 52(6A): 240600098-9.
[7]	YIN Baosheng, ZONG Chen. Research on Semantic Fusion of Chinese Polysemous Words Based on Large LanguageModel [J]. Computer Science, 2025, 52(6A): 240400139-7.
[8]	WANG Teng, XIAN Yunting, XU Hao, XIE Songqi, ZOU Quanyi. Ship License Plate Recognition Network Based on Pyramid Transformer in Transformer [J]. Computer Science, 2025, 52(6): 179-186.
[9]	PAN Jian, WU Zhiwei, LI Yanjun. CGR-BERT-ZESHEL:Zero-shot Entity Linking Model with Chinese Features [J]. Computer Science, 2025, 52(4): 262-270.
[10]	XU Siyao, ZENG Jianjun, ZHANG Weiyan, YE Qi, ZHU Yan. Dependency Parsing for Chinese Electronic Medical Record Enhanced by Dual-scale Collaboration of Large and Small Language Models [J]. Computer Science, 2025, 52(2): 253-260.
[11]	WANG Xueni, YE Na, ZHANG Guiping. Translation Quality Estimation Based on Cross-lingual Term Attention Mechanism [J]. Computer Science, 2025, 52(11A): 250200007-9.
[12]	FU Juan. Research on Application of Deep Learning-based Natural Language Processing Technology inIntelligent Translation Systems [J]. Computer Science, 2025, 52(11A): 241000037-6.
[13]	YANG Chen, YE Na, ZHANG Guiping. Biased Retrieval-augmented Ensembling Translation Model for Aviation Manuals [J]. Computer Science, 2025, 52(11A): 241100022-10.
[14]	DING Xinyu, KONG Bing, CHEN Hongmei, BAO Chongming, ZHOU Lihua. Path-masked Autoencoder Guiding Unsupervised Attribute Graph Node Clustering [J]. Computer Science, 2025, 52(1): 160-169.
[15]	HUANG Wei, SHEN Yaodi, CHEN Songling, FU Xiangling. CFGT:A Lexicon-based Chinese Address Element Parsing Model [J]. Computer Science, 2024, 51(9): 233-241.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Construction of Chinese-Burmese Machine Translation Corpus Based on Pivot OptimizationSelf-training

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 15

Metrics

Comments

Recommended 0