计算机科学 ›› 2022, Vol. 49 ›› Issue (1): 41-46.doi: 10.11896/jsjkx.210900012

• 多语言计算前沿技术* 上一篇    下一篇

面向小语种机器翻译的平行语料库构建方法

刘妍, 熊德意   

  1. 天津大学智能与计算学部 天津300350
  • 收稿日期:2021-09-01 修回日期:2021-10-18 出版日期:2022-01-15 发布日期:2022-01-18
  • 通讯作者: 熊德意(dyxiong@tju.edu.cn)
  • 作者简介:yan_liu@tju.edu.cn
  • 基金资助:
    国家重点研发计划(2019QY1802)

Construction Method of Parallel Corpus for Minority Language Machine Translation

LIU Yan, XIONG De-yi   

  1. College of Intelligence and Computing,Tianjin University,Tianjin 300350,China
  • Received:2021-09-01 Revised:2021-10-18 Online:2022-01-15 Published:2022-01-18
  • About author:LIU Yan,born in 1992,postgraduate.Her main research interests include neural machine translation and discourse parsing.
    XIONG De-yi,born in 1979,Ph.D,professor,Ph.D supervisor,is a member of China Computer Federation.His main research interests include natural language processing,especially in machine translation,dialogue,and natural language generation.
  • Supported by:
    National Key Research and Development Program(2019QY1802).

摘要: 神经机器翻译模型的训练效果在很大程度上取决于平行语料库的规模和质量。除了一些常见语言外,汉语与小语种间高质量平行语料库的建设一直处于滞后状态。现有小语种平行语料库多采用自动句子对齐技术利用网络资源构建而成,在文本质量和领域等方面有诸多局限性。采用人工翻译的方式可以构建高质量平行语料库,但是缺乏相关经验和方法。文中从机器翻译实践者和研究者角度出发,介绍了经济高效的人工构建小语种平行语料库的工作,包括其总体目标、实施过程、流程细节和最后结果。在构建过程中尝试并积累了各种经验,形成了小语种到汉语平行语料库构建方法、建议的总结。最终,成功构建了波斯语到汉语、印地语到汉语、印度尼西亚语到汉语各50万条高质量平行语料。实验结果表明,所构建的平行语料库有较好的质量,提高了小语种神经机器翻译模型的训练效果。

关键词: 平行语料库, 神经机器翻译模型, 小语种

Abstract: The training performance of neural machine translation depends heavily on the scale and quality of parallel corpus.Unlike some common languages,the construction of high-quality parallel corpora between Chinese and minority languages has been lagging.The existing minority language parallel corpora are mostly constructed by using automatic sentence alignment technology and network resources,which has many limitations such as domain and quality confined.Although high-quality parallel corpora could be constructed by manual,it lacks relevant experience and method.From the perspective of machine translation practitioners and researchers,this article introduces a cost-effective method to manually construct parallel corpus between minority languages and Chinese,including its overall goals,implementation process,engineering details,and the final result.This article tries and accumulats various experiences in the construction process,and finally forms a summary of the methods and suggestions for constructing parallel corpora from minority languages to Chinese.In the end,this paper successfully constructs 0.5 million high-quality parallel corpora from Persian to Chinese,Hindi to Chinese,and Indonesian to Chinese.The experimental results prove the quality of our constructed corpora,and it improves the performance of the minority language neural machine translation models.

Key words: Minority language, Neural machine translation, Parallel corpus

中图分类号: 

  • TP391
[1]GERNOT W.The Iranian languages[M].Routledge,2009.
[2]LIAO B.The Language Situation in India-An Analysis Based on the Language Survey Data of the Indian Census in 2011[J].Journal of PLA University of Foreign Languages,2020,43(6):7.
[3] JIANG S Y,LI S S,FU S H,et al.An Overview of Natural Language Processing for Indonesian and Malay[J].Pattern Recognition and Artificial Intelligence,2020,33(6):12.
[4]JAMES N S.The Indonesia languages:Its history and role in Modern Society[M].UNSW Press,2004.
[5]SCHWENK H,CHAUDHARY V,SUN S,et al.WikiMatrix:Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia[J].arXiv:1907.05791,2019.
[6]El-KISHKY A,RENDUCHINTALA A,CROSS J,et al.XLEnt:Mining a Large Cross-lingual Entity Dataset with Lexical-Semantic-Phonetic Word Alignment[J].arXiv:2104.08597,2021.
[7]TIEDEMANN J.Parallel Data,Tools and Interfaces in OPUS[C]//Lrec.2012:2214-2218.
[8]REIMERS N,GUREVYCH I.Making Monolingual SentenceEmbeddings Multilingual using Knowledge Distillation[J].arXiv:2004.09813,2020.
[9]GUZMAN F,SAJJAD H,VOGEL S,et al.The AMARA Corpus:Building Resources for Translating the Web's Educational Content[C]//International Workshop on Spoken Language Translation(IWSLT).2013.
[10]ZHAO F,ZHOU T,ZHANG L,et al.Research Progress onWikipedia[J].Journal of University of Electronic Science and Technology of China,2010(3):321-334.
[11]SMITH J R,SAINTAMAND H,PLAMADA M,et al.Dirtcheap web-scale parallel text from the Common Crawl[C]//Proceedings of the 2013 Conference of the Association for Computational Linguistics (ACL 2013).2013.
[12]ECK M,VOGEL S,WAIBEL A.Low Cost Portability for Statistical Machine Translation based on N-gram Frequency and TF-IDF [C]//Proceedings of International Workshop on Spoken Language Translation.2005.
[13]SETTLES B.Active Learning Literature Survey[J].Science,1995,10(3):237-304.
[14]LEVENSHTEIN V I.Binary codes capable of correcting dele-tions,insertions and reversals[C]//Soviet Physics Doklady.1996:707-710.
[15]NEEDLEMAN S B.A general method applicable to the search for similarities in the amino acid sequence of two proteins[J].Journal of Molecular Biology,1970,48(3):443-453.
[16]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[C]//Advances in Neural Information Processing Systems.2017:5998-6008.
[17]PAPINENI S.Blue;A method for Automatic Evaluation ofMachine Translation[C]//Meeting of the Association for Computational Linguistics.Association for Computational Linguistics.2002.
[18]EL-KISHKY A,CHAUDHARY V,GUZMAN F,et al.CCAligned:A massive collection of cross-lingual web-document pairs[J].arXiv:1911.06154,2019.
[19]SCHWENK H,WENZEK G,EDUNOV S,et al.Ccmatrix:Mi-ning billions of high-quality parallel sentences on the web[J].arXiv:1911.04944,2019.
[20]ZHANG B,NAGESH A,KNIGHT K.Parallel Corpus Filtering via Pre-trained Language Models[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.2020:8545-8554.
[21]DEVLIN J,CHANG M W,LEE K,et al.BERT:Pre-training of Deep Bidirectional Transformers for Language Understanding[C]//NAACL-HLT (1).2019.
[22]RADFORD A,WU J,CHILD R,et al.Language models are unsupervised multitask learners[J].OpenAI Blog,2019,1(8):9.
[23]IMANKULOVA A,SATO T,KOMACHI M.Improving low-resource neural machine translation with filtered pseudo-parallel corpus[C]//Proceedings of the 4th Workshop on Asian Translation (WAT2017).2017:70-78.
[24]GRAÇA M,KIM Y,SCHAMPER J,et al.Generalizing Back-Translation in Neural Machine Translation[C]//Proceedings of the Fourth Conference on Machine Translation(Volume 1:Research Papers).2019:45-52.
[25]LOWPHANSIRIKUL L,POLPANUMAS C,RUTHERFORDA T,et al.A large English-Thai parallel corpus from the web and machine-generated text[J].Language Resources and Evaluation,2021,55(1):1-23.
[26]ZIN M M,RACHARAK T,LE N M.Construct-Extract:AnEffective Model for Building Bilingual Corpus to Improve English-Myanmar Machine Translation[C]//ICAART (2).2021:333-342.
[27]MUBARAK H,HASSAN S,ABDELALI A.Constructing a bilingual corpus of parallel tweets[C]//Proceedings of the 13th Workshop on Building and Using Comparable Corpora.2020:14-21.
[1] 陈志强, 韩萌, 李慕航, 武红鑫, 张喜龙.
数据流概念漂移处理方法研究综述
Survey of Concept Drift Handling Methods in Data Streams
计算机科学, 2022, 49(9): 14-32. https://doi.org/10.11896/jsjkx.210700112
[2] 王明, 武文芳, 王大玲, 冯时, 张一飞.
生成链接树:一种高数据真实性的反事实解释生成方法
Generative Link Tree:A Counterfactual Explanation Generation Approach with High Data Fidelity
计算机科学, 2022, 49(9): 33-40. https://doi.org/10.11896/jsjkx.220300158
[3] 张佳, 董守斌.
基于评论方面级用户偏好迁移的跨领域推荐算法
Cross-domain Recommendation Based on Review Aspect-level User Preference Transfer
计算机科学, 2022, 49(9): 41-47. https://doi.org/10.11896/jsjkx.220200131
[4] 周芳泉, 成卫青.
基于全局增强图神经网络的序列推荐
Sequence Recommendation Based on Global Enhanced Graph Neural Network
计算机科学, 2022, 49(9): 55-63. https://doi.org/10.11896/jsjkx.210700085
[5] 宋杰, 梁美玉, 薛哲, 杜军平, 寇菲菲.
基于无监督集群级的科技论文异质图节点表示学习方法
Scientific Paper Heterogeneous Graph Node Representation Learning Method Based onUnsupervised Clustering Level
计算机科学, 2022, 49(9): 64-69. https://doi.org/10.11896/jsjkx.220500196
[6] 柴慧敏, 张勇, 方敏.
基于特征相似度聚类的空中目标分群方法
Aerial Target Grouping Method Based on Feature Similarity Clustering
计算机科学, 2022, 49(9): 70-75. https://doi.org/10.11896/jsjkx.210800203
[7] 郑文萍, 刘美麟, 杨贵.
一种基于节点稳定性和邻域相似性的社区发现算法
Community Detection Algorithm Based on Node Stability and Neighbor Similarity
计算机科学, 2022, 49(9): 83-91. https://doi.org/10.11896/jsjkx.220400146
[8] 吕晓锋, 赵书良, 高恒达, 武永亮, 张宝奇.
基于异质信息网的短文本特征扩充方法
Short Texts Feautre Enrichment Method Based on Heterogeneous Information Network
计算机科学, 2022, 49(9): 92-100. https://doi.org/10.11896/jsjkx.210700241
[9] 徐天慧, 郭强, 张彩明.
基于全变分比分隔距离的时序数据异常检测
Time Series Data Anomaly Detection Based on Total Variation Ratio Separation Distance
计算机科学, 2022, 49(9): 101-110. https://doi.org/10.11896/jsjkx.210600174
[10] 聂秀山, 潘嘉男, 谭智方, 刘新放, 郭杰, 尹义龙.
基于自然语言的视频片段定位综述
Overview of Natural Language Video Localization
计算机科学, 2022, 49(9): 111-122. https://doi.org/10.11896/jsjkx.220500130
[11] 曹晓雯, 梁美玉, 鲁康康.
基于细粒度语义推理的跨媒体双路对抗哈希学习模型
Fine-grained Semantic Reasoning Based Cross-media Dual-way Adversarial Hashing Learning Model
计算机科学, 2022, 49(9): 123-131. https://doi.org/10.11896/jsjkx.220600011
[12] 周旭, 钱胜胜, 李章明, 方全, 徐常胜.
基于对偶变分多模态注意力网络的不完备社会事件分类方法
Dual Variational Multi-modal Attention Network for Incomplete Social Event Classification
计算机科学, 2022, 49(9): 132-138. https://doi.org/10.11896/jsjkx.220600022
[13] 戴禹, 许林峰.
基于文本行匹配的跨图文本阅读方法
Cross-image Text Reading Method Based on Text Line Matching
计算机科学, 2022, 49(9): 139-145. https://doi.org/10.11896/jsjkx.220600032
[14] 曲倩文, 车啸平, 曲晨鑫, 李瑾如.
基于信息感知的虚拟现实用户临场感研究
Study on Information Perception Based User Presence in Virtual Reality
计算机科学, 2022, 49(9): 146-154. https://doi.org/10.11896/jsjkx.220500200
[15] 周乐员, 张剑华, 袁甜甜, 陈胜勇.
多层注意力机制融合的序列到序列中国连续手语识别和翻译
Sequence-to-Sequence Chinese Continuous Sign Language Recognition and Translation with Multi- layer Attention Mechanism Fusion
计算机科学, 2022, 49(9): 155-161. https://doi.org/10.11896/jsjkx.210800026
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!