计算机科学 ›› 2026, Vol. 53 ›› Issue (1): 224-230.doi: 10.11896/jsjkx.241200147
格桑加措, 尼玛扎西, 群诺, 嘎玛扎西, 道吉扎西, 罗桑益西, 拉毛吉, 钱木吉
KALZANG Gyatso, NYIMA Tashi, QUN Nuo, GAMA Tashi, DORJE Tashi, LOBSANG Yeshi, LHAMO Kyi, ZOM Kyi
摘要: 现有藏汉机器翻译语料中存在领域数据分布不平衡的问题,导致训练出来的模型对各个领域数据的翻译能力表现不均衡。反向翻译作为一种常见的数据增强方法,通过提供更多样化的伪数据来提高模型的性能。然而,传统的反向翻译方法难以充分考虑数据的领域分布不平衡问题,导致模型在整体性能提升过程中难以提升资源稀缺领域的翻译性能。对此,通过深入分析语料中的长尾词的分布,有针对性地利用现有藏汉双语语料的长尾词来选取单语数据,通过反向翻译构造伪数据进行数据增强操作。这一策略旨在提升藏汉机器翻译模型整体性能的同时,改善数据匮乏领域的翻译性能。实验结果表明,通过充分考虑领域数据不平衡情况,结合长尾词数据增强,能够有效提升机器翻译模型在稀缺领域的翻译性能,为解决领域数据不平衡问题提供了一种有针对性的策略。
中图分类号:
| [1]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[C]//Proceedings of the 31st International Confe-rence on Neural Information Processing Systems.2017:6000-6010. [2]DEVLIN J,CHANG M W,LEE K,et al.BERT:Pre-training of Deep Bidirectional Transformers for Language Understanding[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.2019:4171-4186 [3]BROWN T B,MANN B,RYDER N,et al.Language Models are Few-Shot Learners[J].arXiv:2005.14165,2020. [4]TOUVRON H,LAVRIL T,IZACARD G,et al.LLaMA:Open and Efficient Foundation Language Models[J].arXiv:2302.13971,2023. [5]TOUVRON H,MARTIN L,STONE K,et al.Llama 2:OpenFoundation and Fine-Tuned Chat Models[J].arXiv:2307.09288,2023. [6]JAPKOWICZ N,STEPHEN S.The class imbalance problem:A systematic study[J].Intelligent Data Analysis,2002,6(5):429-449. [7]ALSOHYBE N T,DAHAN N A,BA-ALWI F M.Machine-Translation History and Evolution:Survey for Arabic-English Translations[J].Current Journal of Applied Science and Technology,2017,23(4):1-19. [8]ARIVAZHAGAN N,BAPNA A,FIRAT O,et al.MassivelyMultilingual Neural Machine Translation in the Wild:Findings and Challenges[J].arXiv:1907.05019,2019. [9]BAZIOTIS C,HADDOW B,BIRCH A.Language Model Priorfor Low-Resource Neural Machine Translation[C]//Procee-dings of the 2020 Conference on Empirical Methods in Natural Language Processing(EMNLP).Association for Computational Linguistics,2020:7622-7634. [10]BRANTS T,POPAT A C,XU P,et al.Large Language Models in Machine Translation[C]//Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning.2007:858-867. [11]JAPKOWICZ N,STEPHEN S.The class imbalance problem:A systematic study[J].Intelligent Data Analysis,2002,6(5):429-449. [12]JAMES G,DANIELA W,TREVOR H,et al.An Introduction to Statistical Learning:with Applications in Python[M].Cham:Springer International Publishing,2023. [13]MOHAMMED R,RAWASHDEH J,ABDULLAH M.Machine Learning with Oversampling and Undersampling Techniques:Overview Study and Experimental Results[C]//2020 11th International Conference on Information and Communication Systems.2020:243-248. [14]WEISS K,KHOSHGOFTAAR T M,WANG D.A survey oftransfer learning[EB/OL].https://doi.org/10.1186/s40537-016-0043-6. [15]KUANG J,XU G,TAO T,et al.Class-Imbalance Adversarial Transfer Learning Network for Cross-Domain Fault Diagnosis With Imbalanced Data[EB/OL].https://doi.org/10.1109/TIM.2021.3136175. [16]FADAEE M,BISAZZA A,MONZ C.Data Augmentation forLow-Resource Neural Machine Translation[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics.Association for Computational Linguistics,2017:567-573. [17]ZHANG Z,LIU S,LI M,et al.Joint Training for Neural Machine Translation Models with Monolingual Data[J].arXiv:1803.00353,2018. [18]GIBADULLIN I,VALEEV A,KHUSAINOVA A,et al.A Survey of Methods to Leverage Monolingual Data in Low-resource Neural Machine TranslationJ].arXiv:1910.00373,2019. [19]SHORTEN C,KHOSHGOFTAAR T M,FURHT B.Text Data Augmentation for Deep Learning[J].Journal of Big Data,2021,8(1):101. [20]DUAN S,ZHAO H,ZHANG D,et al.Syntax-aware Data Augmentation for Neural Machine Translation[J].arXiv:2004.14200,2020. [21]CHAWLA N V,BOWYER K W,HALL L O,et al.SMOTE:Synthetic Minority Over-sampling Technique[J].Journal of Artificial Intelligence Research,2002,16:321-357. [22]NIU X,DENKOWSKI M,CARPUAT M.Bi-Directional Neural Machine Translation with Synthetic Parallel Data[C]//Procee-dings of the 2nd Workshop on Neural Machine Translation and Generation.Association for Computational Linguistics,2018:84-91. [23]ZIPF G K.Human Behavior and the Principle of Least Effort:An Introduction to Human Ecology[M].Ravenio Books,2016. [24]ZHANG Y,KANG B,HOOI B,et al.Deep Long-Tailed Lear-ning:A Survey[J].arXiv:2110.04596,2021. [25]TAN J,WANG C,LI B,et al.Equalization Loss for Long-Tailed Object Recognition[C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:11659-11668. [26]FADAEE M,MONZ C.Back-Translation Sampling by Targe-ting Difficult Words in Neural Machine Translation[C]//Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing.Association for Computational Linguistics,2018:436-446. [27]PAPINENI K,ROUKOS S,WARD T,et al.Bleu:a Method for Automatic Evaluation of Machine Translation[C]//Proceedings of the 40th Annual Meeting of the Association for Computatio-nal Linguistics.Association for Computational Linguistics,2002:311-318. |
|
||