Computer Science ›› 2026, Vol. 53 ›› Issue (1): 224-230.doi: 10.11896/jsjkx.241200147

• Artificial Intelligence • Previous Articles     Next Articles

Data Augmentation Methods for Tibetan-Chinese Machine Translation Based on Long-tail Words

KALZANG Gyatso, NYIMA Tashi, QUN Nuo, GAMA Tashi, DORJE Tashi, LOBSANG Yeshi, LHAMO Kyi, ZOM Kyi   

  1. School of Information Science and Technology, Tibet University, Lhasa 850000, China;
    Tibetan Language Information Technology Ministry of Education Engineering Research Center of Tibet University, Lhasa 850000, China
  • Received:2024-12-18 Revised:2025-03-07 Published:2026-01-08
  • About author:KALZANG Gyatso,born in 1988,Ph.D candidate,is a member of CCF(No.Z0116G).His main research interests include computational linguistics and Tibetan-Chinese machine translation.
    NYIMA Tashi,born in 1964,Ph.D,professor,Ph.D supervisor.His main research interests include Tibetan information technology and computational linguistics.
  • Supported by:
    National Key R & D Program of the New Generation of Artificial Intelligence(2022ZD0116101),Key Project of the National Natural Science Foundation of China(62436006),Key Project of Xizang Natural Science Foundation(XZ202401ZR0040),General Project of Xizang Natural Science Foundation(XZ202401ZR0031) and Youth Fund of National Natural Science Foundation of China(62406257).

Abstract: The existing Tibetan-Chinese machine translation corpora exhibit significant domain data imbalance,resulting in inconsistent translation performance of trained models across different domains.Back-translation,as a common data augmentation method,enhances model performance by generating diverse pseudo-parallel data.However,traditional back-translation approaches struggle to fully account for the domain imbalance in the data distribution,leading to limited improvements in translation perfor-mance for resource-scarce domains,even as overall performance increases.This paper proposes a strategy that involves an in-depth analysis of the distribution of long-tail words in existing corpora,and targeted selection of monolingual data using these long-tail words from the existing Tibetan-Chinese bilingual corpora.By generating pseudo-data through back-translation,it performs data augmentation.This strategy aims to improve the overall performance of Tibetan-Chinese machine translation models while enhancing translation performance in data-scarce domains.Experiment results demonstrate that by fully considering domain data imba-lance and incorporating long-tail word data augmentation,the translation performance of machine translation models in resource-scarce domains can be effectively improved,providing a targeted approach to address the issue of domain data imbalance.

Key words: Long-tail words, Data augmentation, Tibetan-Chinese machine translation, Domain data imbalance

CLC Number: 

  • G35
[1]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[C]//Proceedings of the 31st International Confe-rence on Neural Information Processing Systems.2017:6000-6010.
[2]DEVLIN J,CHANG M W,LEE K,et al.BERT:Pre-training of Deep Bidirectional Transformers for Language Understanding[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.2019:4171-4186
[3]BROWN T B,MANN B,RYDER N,et al.Language Models are Few-Shot Learners[J].arXiv:2005.14165,2020.
[4]TOUVRON H,LAVRIL T,IZACARD G,et al.LLaMA:Open and Efficient Foundation Language Models[J].arXiv:2302.13971,2023.
[5]TOUVRON H,MARTIN L,STONE K,et al.Llama 2:OpenFoundation and Fine-Tuned Chat Models[J].arXiv:2307.09288,2023.
[6]JAPKOWICZ N,STEPHEN S.The class imbalance problem:A systematic study[J].Intelligent Data Analysis,2002,6(5):429-449.
[7]ALSOHYBE N T,DAHAN N A,BA-ALWI F M.Machine-Translation History and Evolution:Survey for Arabic-English Translations[J].Current Journal of Applied Science and Technology,2017,23(4):1-19.
[8]ARIVAZHAGAN N,BAPNA A,FIRAT O,et al.MassivelyMultilingual Neural Machine Translation in the Wild:Findings and Challenges[J].arXiv:1907.05019,2019.
[9]BAZIOTIS C,HADDOW B,BIRCH A.Language Model Priorfor Low-Resource Neural Machine Translation[C]//Procee-dings of the 2020 Conference on Empirical Methods in Natural Language Processing(EMNLP).Association for Computational Linguistics,2020:7622-7634.
[10]BRANTS T,POPAT A C,XU P,et al.Large Language Models in Machine Translation[C]//Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning.2007:858-867.
[11]JAPKOWICZ N,STEPHEN S.The class imbalance problem:A systematic study[J].Intelligent Data Analysis,2002,6(5):429-449.
[12]JAMES G,DANIELA W,TREVOR H,et al.An Introduction to Statistical Learning:with Applications in Python[M].Cham:Springer International Publishing,2023.
[13]MOHAMMED R,RAWASHDEH J,ABDULLAH M.Machine Learning with Oversampling and Undersampling Techniques:Overview Study and Experimental Results[C]//2020 11th International Conference on Information and Communication Systems.2020:243-248.
[14]WEISS K,KHOSHGOFTAAR T M,WANG D.A survey oftransfer learning[EB/OL].https://doi.org/10.1186/s40537-016-0043-6.
[15]KUANG J,XU G,TAO T,et al.Class-Imbalance Adversarial Transfer Learning Network for Cross-Domain Fault Diagnosis With Imbalanced Data[EB/OL].https://doi.org/10.1109/TIM.2021.3136175.
[16]FADAEE M,BISAZZA A,MONZ C.Data Augmentation forLow-Resource Neural Machine Translation[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics.Association for Computational Linguistics,2017:567-573.
[17]ZHANG Z,LIU S,LI M,et al.Joint Training for Neural Machine Translation Models with Monolingual Data[J].arXiv:1803.00353,2018.
[18]GIBADULLIN I,VALEEV A,KHUSAINOVA A,et al.A Survey of Methods to Leverage Monolingual Data in Low-resource Neural Machine TranslationJ].arXiv:1910.00373,2019.
[19]SHORTEN C,KHOSHGOFTAAR T M,FURHT B.Text Data Augmentation for Deep Learning[J].Journal of Big Data,2021,8(1):101.
[20]DUAN S,ZHAO H,ZHANG D,et al.Syntax-aware Data Augmentation for Neural Machine Translation[J].arXiv:2004.14200,2020.
[21]CHAWLA N V,BOWYER K W,HALL L O,et al.SMOTE:Synthetic Minority Over-sampling Technique[J].Journal of Artificial Intelligence Research,2002,16:321-357.
[22]NIU X,DENKOWSKI M,CARPUAT M.Bi-Directional Neural Machine Translation with Synthetic Parallel Data[C]//Procee-dings of the 2nd Workshop on Neural Machine Translation and Generation.Association for Computational Linguistics,2018:84-91.
[23]ZIPF G K.Human Behavior and the Principle of Least Effort:An Introduction to Human Ecology[M].Ravenio Books,2016.
[24]ZHANG Y,KANG B,HOOI B,et al.Deep Long-Tailed Lear-ning:A Survey[J].arXiv:2110.04596,2021.
[25]TAN J,WANG C,LI B,et al.Equalization Loss for Long-Tailed Object Recognition[C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:11659-11668.
[26]FADAEE M,MONZ C.Back-Translation Sampling by Targe-ting Difficult Words in Neural Machine Translation[C]//Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing.Association for Computational Linguistics,2018:436-446.
[27]PAPINENI K,ROUKOS S,WARD T,et al.Bleu:a Method for Automatic Evaluation of Machine Translation[C]//Proceedings of the 40th Annual Meeting of the Association for Computatio-nal Linguistics.Association for Computational Linguistics,2002:311-318.
[1] YANG Jian, SUN Liu, ZHANG Lifang. Survey on Data Processing and Data Augmentation in Low-resource Language Automatic Speech Recognition [J]. Computer Science, 2025, 52(8): 86-99.
[2] LI Mengxi, GAO Xindan, LI Xue. Two-way Feature Augmentation Graph Convolution Networks Algorithm [J]. Computer Science, 2025, 52(7): 127-134.
[3] GAO Xinjun, ZHANG Meixin, ZHU Li. Study on Short-time Passenger Flow Data Generation and Prediction Method for RailTransportation [J]. Computer Science, 2025, 52(6A): 240600017-5.
[4] CHEN Yadang, GAO Yuxuan, LU Chuhan, CHE Xun. Saliency Mask Mixup for Few-shot Image Classification [J]. Computer Science, 2025, 52(6): 256-263.
[5] WU Pengyuan, FANG Wei. Study on Graph Collaborative Filtering Model Based on FeatureNet Contrastive Learning [J]. Computer Science, 2025, 52(5): 139-148.
[6] FU Kun, CUI Jingyuan, DANG Xing, CHENG Xiao, YING Shicong, LI Jianwei. Study on Graph Data Augmentation Based on Graph Entropy Theory [J]. Computer Science, 2025, 52(5): 149-160.
[7] AN Rui, LU Jin, YANG Jingjing. Deep Clustering Method Based on Dual-branch Wavelet Convolutional Autoencoder and DataAugmentation [J]. Computer Science, 2025, 52(4): 129-137.
[8] YANG Yingxiu, CHEN Hongmei, ZHOU Lihua , XIAO Qing. Heterogeneous Graph Attention Network Based on Data Augmentation [J]. Computer Science, 2025, 52(3): 180-187.
[9] HUANG Kun, HE Lang, WANG Zhanqing. Railway Fastener Segmentation Method Based on Sc-DeepLabV3+ Model [J]. Computer Science, 2025, 52(12): 166-174.
[10] ZHANG Shuo, JI Duo. Calculation of Police Incident Address Similarity Based on Fusion Model [J]. Computer Science, 2025, 52(11A): 241200035-8.
[11] TAN Jianhui, ZHANG Feng. Defect Detection of Engine Engraved Surface Based on Generative Data Augmentation andImproved Faster-RCNN [J]. Computer Science, 2025, 52(11A): 241200025-7.
[12] CHEN Yizhuo, ZOU Wei, WANG Hongda. Construction and Research of Convolution Enhanced Adaptive Classification Model [J]. Computer Science, 2025, 52(11A): 241200069-5.
[13] ZHAO Jinshuang, HUANG Degen. Summary Faithfulness Evaluation Based on Data Augmentation and Two-stage Training [J]. Computer Science, 2025, 52(10): 266-274.
[14] YE Lishuo, HE Zhixue. Multi-granularity Time Series Contrastive Learning Method Incorporating Time-Frequency Features [J]. Computer Science, 2025, 52(1): 170-182.
[15] DAI Chaofan, DING Huahua. Domain-adaptive Entity Resolution Algorithm Based on Semi-supervised Learning [J]. Computer Science, 2024, 51(9): 214-222.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!