计算机科学 ›› 2026, Vol. 53 ›› Issue (1): 224-230.doi: 10.11896/jsjkx.241200147

• 人工智能 • 上一篇    下一篇

基于长尾词分布的藏汉机器翻译数据增强方法

格桑加措, 尼玛扎西, 群诺, 嘎玛扎西, 道吉扎西, 罗桑益西, 拉毛吉, 钱木吉   

  1. 西藏大学信息科学技术学院 拉萨 850000;
    西藏大学藏文信息技术教育部工程研究中心 拉萨 850000
  • 收稿日期:2024-12-18 修回日期:2025-03-07 发布日期:2026-01-08
  • 通讯作者: 尼玛扎西(nmzx@utibet.edu.com)
  • 作者简介:(gyatso736@outlook.com)
  • 基金资助:
    新一代人工智能国家科技重大专项(2022ZD0116101);国家自然科学基金重点项目(62436006);西藏自治区自然科学基金重点项目(XZ202401ZR0040);西藏自治区自然基金面上项目(XZ202401ZR0031);国家自然科学基金青年基金(62406257)

Data Augmentation Methods for Tibetan-Chinese Machine Translation Based on Long-tail Words

KALZANG Gyatso, NYIMA Tashi, QUN Nuo, GAMA Tashi, DORJE Tashi, LOBSANG Yeshi, LHAMO Kyi, ZOM Kyi   

  1. School of Information Science and Technology, Tibet University, Lhasa 850000, China;
    Tibetan Language Information Technology Ministry of Education Engineering Research Center of Tibet University, Lhasa 850000, China
  • Received:2024-12-18 Revised:2025-03-07 Online:2026-01-08
  • About author:KALZANG Gyatso,born in 1988,Ph.D candidate,is a member of CCF(No.Z0116G).His main research interests include computational linguistics and Tibetan-Chinese machine translation.
    NYIMA Tashi,born in 1964,Ph.D,professor,Ph.D supervisor.His main research interests include Tibetan information technology and computational linguistics.
  • Supported by:
    National Key R & D Program of the New Generation of Artificial Intelligence(2022ZD0116101),Key Project of the National Natural Science Foundation of China(62436006),Key Project of Xizang Natural Science Foundation(XZ202401ZR0040),General Project of Xizang Natural Science Foundation(XZ202401ZR0031) and Youth Fund of National Natural Science Foundation of China(62406257).

摘要: 现有藏汉机器翻译语料中存在领域数据分布不平衡的问题,导致训练出来的模型对各个领域数据的翻译能力表现不均衡。反向翻译作为一种常见的数据增强方法,通过提供更多样化的伪数据来提高模型的性能。然而,传统的反向翻译方法难以充分考虑数据的领域分布不平衡问题,导致模型在整体性能提升过程中难以提升资源稀缺领域的翻译性能。对此,通过深入分析语料中的长尾词的分布,有针对性地利用现有藏汉双语语料的长尾词来选取单语数据,通过反向翻译构造伪数据进行数据增强操作。这一策略旨在提升藏汉机器翻译模型整体性能的同时,改善数据匮乏领域的翻译性能。实验结果表明,通过充分考虑领域数据不平衡情况,结合长尾词数据增强,能够有效提升机器翻译模型在稀缺领域的翻译性能,为解决领域数据不平衡问题提供了一种有针对性的策略。

关键词: 长尾词, 数据增强, 藏汉机器翻译, 领域数据不平衡

Abstract: The existing Tibetan-Chinese machine translation corpora exhibit significant domain data imbalance,resulting in inconsistent translation performance of trained models across different domains.Back-translation,as a common data augmentation method,enhances model performance by generating diverse pseudo-parallel data.However,traditional back-translation approaches struggle to fully account for the domain imbalance in the data distribution,leading to limited improvements in translation perfor-mance for resource-scarce domains,even as overall performance increases.This paper proposes a strategy that involves an in-depth analysis of the distribution of long-tail words in existing corpora,and targeted selection of monolingual data using these long-tail words from the existing Tibetan-Chinese bilingual corpora.By generating pseudo-data through back-translation,it performs data augmentation.This strategy aims to improve the overall performance of Tibetan-Chinese machine translation models while enhancing translation performance in data-scarce domains.Experiment results demonstrate that by fully considering domain data imba-lance and incorporating long-tail word data augmentation,the translation performance of machine translation models in resource-scarce domains can be effectively improved,providing a targeted approach to address the issue of domain data imbalance.

Key words: Long-tail words, Data augmentation, Tibetan-Chinese machine translation, Domain data imbalance

中图分类号: 

  • G35
[1]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[C]//Proceedings of the 31st International Confe-rence on Neural Information Processing Systems.2017:6000-6010.
[2]DEVLIN J,CHANG M W,LEE K,et al.BERT:Pre-training of Deep Bidirectional Transformers for Language Understanding[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.2019:4171-4186
[3]BROWN T B,MANN B,RYDER N,et al.Language Models are Few-Shot Learners[J].arXiv:2005.14165,2020.
[4]TOUVRON H,LAVRIL T,IZACARD G,et al.LLaMA:Open and Efficient Foundation Language Models[J].arXiv:2302.13971,2023.
[5]TOUVRON H,MARTIN L,STONE K,et al.Llama 2:OpenFoundation and Fine-Tuned Chat Models[J].arXiv:2307.09288,2023.
[6]JAPKOWICZ N,STEPHEN S.The class imbalance problem:A systematic study[J].Intelligent Data Analysis,2002,6(5):429-449.
[7]ALSOHYBE N T,DAHAN N A,BA-ALWI F M.Machine-Translation History and Evolution:Survey for Arabic-English Translations[J].Current Journal of Applied Science and Technology,2017,23(4):1-19.
[8]ARIVAZHAGAN N,BAPNA A,FIRAT O,et al.MassivelyMultilingual Neural Machine Translation in the Wild:Findings and Challenges[J].arXiv:1907.05019,2019.
[9]BAZIOTIS C,HADDOW B,BIRCH A.Language Model Priorfor Low-Resource Neural Machine Translation[C]//Procee-dings of the 2020 Conference on Empirical Methods in Natural Language Processing(EMNLP).Association for Computational Linguistics,2020:7622-7634.
[10]BRANTS T,POPAT A C,XU P,et al.Large Language Models in Machine Translation[C]//Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning.2007:858-867.
[11]JAPKOWICZ N,STEPHEN S.The class imbalance problem:A systematic study[J].Intelligent Data Analysis,2002,6(5):429-449.
[12]JAMES G,DANIELA W,TREVOR H,et al.An Introduction to Statistical Learning:with Applications in Python[M].Cham:Springer International Publishing,2023.
[13]MOHAMMED R,RAWASHDEH J,ABDULLAH M.Machine Learning with Oversampling and Undersampling Techniques:Overview Study and Experimental Results[C]//2020 11th International Conference on Information and Communication Systems.2020:243-248.
[14]WEISS K,KHOSHGOFTAAR T M,WANG D.A survey oftransfer learning[EB/OL].https://doi.org/10.1186/s40537-016-0043-6.
[15]KUANG J,XU G,TAO T,et al.Class-Imbalance Adversarial Transfer Learning Network for Cross-Domain Fault Diagnosis With Imbalanced Data[EB/OL].https://doi.org/10.1109/TIM.2021.3136175.
[16]FADAEE M,BISAZZA A,MONZ C.Data Augmentation forLow-Resource Neural Machine Translation[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics.Association for Computational Linguistics,2017:567-573.
[17]ZHANG Z,LIU S,LI M,et al.Joint Training for Neural Machine Translation Models with Monolingual Data[J].arXiv:1803.00353,2018.
[18]GIBADULLIN I,VALEEV A,KHUSAINOVA A,et al.A Survey of Methods to Leverage Monolingual Data in Low-resource Neural Machine TranslationJ].arXiv:1910.00373,2019.
[19]SHORTEN C,KHOSHGOFTAAR T M,FURHT B.Text Data Augmentation for Deep Learning[J].Journal of Big Data,2021,8(1):101.
[20]DUAN S,ZHAO H,ZHANG D,et al.Syntax-aware Data Augmentation for Neural Machine Translation[J].arXiv:2004.14200,2020.
[21]CHAWLA N V,BOWYER K W,HALL L O,et al.SMOTE:Synthetic Minority Over-sampling Technique[J].Journal of Artificial Intelligence Research,2002,16:321-357.
[22]NIU X,DENKOWSKI M,CARPUAT M.Bi-Directional Neural Machine Translation with Synthetic Parallel Data[C]//Procee-dings of the 2nd Workshop on Neural Machine Translation and Generation.Association for Computational Linguistics,2018:84-91.
[23]ZIPF G K.Human Behavior and the Principle of Least Effort:An Introduction to Human Ecology[M].Ravenio Books,2016.
[24]ZHANG Y,KANG B,HOOI B,et al.Deep Long-Tailed Lear-ning:A Survey[J].arXiv:2110.04596,2021.
[25]TAN J,WANG C,LI B,et al.Equalization Loss for Long-Tailed Object Recognition[C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:11659-11668.
[26]FADAEE M,MONZ C.Back-Translation Sampling by Targe-ting Difficult Words in Neural Machine Translation[C]//Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing.Association for Computational Linguistics,2018:436-446.
[27]PAPINENI K,ROUKOS S,WARD T,et al.Bleu:a Method for Automatic Evaluation of Machine Translation[C]//Proceedings of the 40th Annual Meeting of the Association for Computatio-nal Linguistics.Association for Computational Linguistics,2002:311-318.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!