计算机科学 ›› 2022, Vol. 49 ›› Issue (1): 41-46.doi: 10.11896/jsjkx.210900012
刘妍, 熊德意
LIU Yan, XIONG De-yi
摘要: 神经机器翻译模型的训练效果在很大程度上取决于平行语料库的规模和质量。除了一些常见语言外,汉语与小语种间高质量平行语料库的建设一直处于滞后状态。现有小语种平行语料库多采用自动句子对齐技术利用网络资源构建而成,在文本质量和领域等方面有诸多局限性。采用人工翻译的方式可以构建高质量平行语料库,但是缺乏相关经验和方法。文中从机器翻译实践者和研究者角度出发,介绍了经济高效的人工构建小语种平行语料库的工作,包括其总体目标、实施过程、流程细节和最后结果。在构建过程中尝试并积累了各种经验,形成了小语种到汉语平行语料库构建方法、建议的总结。最终,成功构建了波斯语到汉语、印地语到汉语、印度尼西亚语到汉语各50万条高质量平行语料。实验结果表明,所构建的平行语料库有较好的质量,提高了小语种神经机器翻译模型的训练效果。
中图分类号:
[1]GERNOT W.The Iranian languages[M].Routledge,2009. [2]LIAO B.The Language Situation in India-An Analysis Based on the Language Survey Data of the Indian Census in 2011[J].Journal of PLA University of Foreign Languages,2020,43(6):7. [3] JIANG S Y,LI S S,FU S H,et al.An Overview of Natural Language Processing for Indonesian and Malay[J].Pattern Recognition and Artificial Intelligence,2020,33(6):12. [4]JAMES N S.The Indonesia languages:Its history and role in Modern Society[M].UNSW Press,2004. [5]SCHWENK H,CHAUDHARY V,SUN S,et al.WikiMatrix:Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia[J].arXiv:1907.05791,2019. [6]El-KISHKY A,RENDUCHINTALA A,CROSS J,et al.XLEnt:Mining a Large Cross-lingual Entity Dataset with Lexical-Semantic-Phonetic Word Alignment[J].arXiv:2104.08597,2021. [7]TIEDEMANN J.Parallel Data,Tools and Interfaces in OPUS[C]//Lrec.2012:2214-2218. [8]REIMERS N,GUREVYCH I.Making Monolingual SentenceEmbeddings Multilingual using Knowledge Distillation[J].arXiv:2004.09813,2020. [9]GUZMAN F,SAJJAD H,VOGEL S,et al.The AMARA Corpus:Building Resources for Translating the Web's Educational Content[C]//International Workshop on Spoken Language Translation(IWSLT).2013. [10]ZHAO F,ZHOU T,ZHANG L,et al.Research Progress onWikipedia[J].Journal of University of Electronic Science and Technology of China,2010(3):321-334. [11]SMITH J R,SAINTAMAND H,PLAMADA M,et al.Dirtcheap web-scale parallel text from the Common Crawl[C]//Proceedings of the 2013 Conference of the Association for Computational Linguistics (ACL 2013).2013. [12]ECK M,VOGEL S,WAIBEL A.Low Cost Portability for Statistical Machine Translation based on N-gram Frequency and TF-IDF [C]//Proceedings of International Workshop on Spoken Language Translation.2005. [13]SETTLES B.Active Learning Literature Survey[J].Science,1995,10(3):237-304. [14]LEVENSHTEIN V I.Binary codes capable of correcting dele-tions,insertions and reversals[C]//Soviet Physics Doklady.1996:707-710. [15]NEEDLEMAN S B.A general method applicable to the search for similarities in the amino acid sequence of two proteins[J].Journal of Molecular Biology,1970,48(3):443-453. [16]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[C]//Advances in Neural Information Processing Systems.2017:5998-6008. [17]PAPINENI S.Blue;A method for Automatic Evaluation ofMachine Translation[C]//Meeting of the Association for Computational Linguistics.Association for Computational Linguistics.2002. [18]EL-KISHKY A,CHAUDHARY V,GUZMAN F,et al.CCAligned:A massive collection of cross-lingual web-document pairs[J].arXiv:1911.06154,2019. [19]SCHWENK H,WENZEK G,EDUNOV S,et al.Ccmatrix:Mi-ning billions of high-quality parallel sentences on the web[J].arXiv:1911.04944,2019. [20]ZHANG B,NAGESH A,KNIGHT K.Parallel Corpus Filtering via Pre-trained Language Models[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.2020:8545-8554. [21]DEVLIN J,CHANG M W,LEE K,et al.BERT:Pre-training of Deep Bidirectional Transformers for Language Understanding[C]//NAACL-HLT (1).2019. [22]RADFORD A,WU J,CHILD R,et al.Language models are unsupervised multitask learners[J].OpenAI Blog,2019,1(8):9. [23]IMANKULOVA A,SATO T,KOMACHI M.Improving low-resource neural machine translation with filtered pseudo-parallel corpus[C]//Proceedings of the 4th Workshop on Asian Translation (WAT2017).2017:70-78. [24]GRAÇA M,KIM Y,SCHAMPER J,et al.Generalizing Back-Translation in Neural Machine Translation[C]//Proceedings of the Fourth Conference on Machine Translation(Volume 1:Research Papers).2019:45-52. [25]LOWPHANSIRIKUL L,POLPANUMAS C,RUTHERFORDA T,et al.A large English-Thai parallel corpus from the web and machine-generated text[J].Language Resources and Evaluation,2021,55(1):1-23. [26]ZIN M M,RACHARAK T,LE N M.Construct-Extract:AnEffective Model for Building Bilingual Corpus to Improve English-Myanmar Machine Translation[C]//ICAART (2).2021:333-342. [27]MUBARAK H,HASSAN S,ABDELALI A.Constructing a bilingual corpus of parallel tweets[C]//Proceedings of the 13th Workshop on Building and Using Comparable Corpora.2020:14-21. |
[1] | 陈志强, 韩萌, 李慕航, 武红鑫, 张喜龙. 数据流概念漂移处理方法研究综述 Survey of Concept Drift Handling Methods in Data Streams 计算机科学, 2022, 49(9): 14-32. https://doi.org/10.11896/jsjkx.210700112 |
[2] | 王明, 武文芳, 王大玲, 冯时, 张一飞. 生成链接树:一种高数据真实性的反事实解释生成方法 Generative Link Tree:A Counterfactual Explanation Generation Approach with High Data Fidelity 计算机科学, 2022, 49(9): 33-40. https://doi.org/10.11896/jsjkx.220300158 |
[3] | 张佳, 董守斌. 基于评论方面级用户偏好迁移的跨领域推荐算法 Cross-domain Recommendation Based on Review Aspect-level User Preference Transfer 计算机科学, 2022, 49(9): 41-47. https://doi.org/10.11896/jsjkx.220200131 |
[4] | 周芳泉, 成卫青. 基于全局增强图神经网络的序列推荐 Sequence Recommendation Based on Global Enhanced Graph Neural Network 计算机科学, 2022, 49(9): 55-63. https://doi.org/10.11896/jsjkx.210700085 |
[5] | 宋杰, 梁美玉, 薛哲, 杜军平, 寇菲菲. 基于无监督集群级的科技论文异质图节点表示学习方法 Scientific Paper Heterogeneous Graph Node Representation Learning Method Based onUnsupervised Clustering Level 计算机科学, 2022, 49(9): 64-69. https://doi.org/10.11896/jsjkx.220500196 |
[6] | 柴慧敏, 张勇, 方敏. 基于特征相似度聚类的空中目标分群方法 Aerial Target Grouping Method Based on Feature Similarity Clustering 计算机科学, 2022, 49(9): 70-75. https://doi.org/10.11896/jsjkx.210800203 |
[7] | 郑文萍, 刘美麟, 杨贵. 一种基于节点稳定性和邻域相似性的社区发现算法 Community Detection Algorithm Based on Node Stability and Neighbor Similarity 计算机科学, 2022, 49(9): 83-91. https://doi.org/10.11896/jsjkx.220400146 |
[8] | 吕晓锋, 赵书良, 高恒达, 武永亮, 张宝奇. 基于异质信息网的短文本特征扩充方法 Short Texts Feautre Enrichment Method Based on Heterogeneous Information Network 计算机科学, 2022, 49(9): 92-100. https://doi.org/10.11896/jsjkx.210700241 |
[9] | 徐天慧, 郭强, 张彩明. 基于全变分比分隔距离的时序数据异常检测 Time Series Data Anomaly Detection Based on Total Variation Ratio Separation Distance 计算机科学, 2022, 49(9): 101-110. https://doi.org/10.11896/jsjkx.210600174 |
[10] | 聂秀山, 潘嘉男, 谭智方, 刘新放, 郭杰, 尹义龙. 基于自然语言的视频片段定位综述 Overview of Natural Language Video Localization 计算机科学, 2022, 49(9): 111-122. https://doi.org/10.11896/jsjkx.220500130 |
[11] | 曹晓雯, 梁美玉, 鲁康康. 基于细粒度语义推理的跨媒体双路对抗哈希学习模型 Fine-grained Semantic Reasoning Based Cross-media Dual-way Adversarial Hashing Learning Model 计算机科学, 2022, 49(9): 123-131. https://doi.org/10.11896/jsjkx.220600011 |
[12] | 周旭, 钱胜胜, 李章明, 方全, 徐常胜. 基于对偶变分多模态注意力网络的不完备社会事件分类方法 Dual Variational Multi-modal Attention Network for Incomplete Social Event Classification 计算机科学, 2022, 49(9): 132-138. https://doi.org/10.11896/jsjkx.220600022 |
[13] | 戴禹, 许林峰. 基于文本行匹配的跨图文本阅读方法 Cross-image Text Reading Method Based on Text Line Matching 计算机科学, 2022, 49(9): 139-145. https://doi.org/10.11896/jsjkx.220600032 |
[14] | 曲倩文, 车啸平, 曲晨鑫, 李瑾如. 基于信息感知的虚拟现实用户临场感研究 Study on Information Perception Based User Presence in Virtual Reality 计算机科学, 2022, 49(9): 146-154. https://doi.org/10.11896/jsjkx.220500200 |
[15] | 周乐员, 张剑华, 袁甜甜, 陈胜勇. 多层注意力机制融合的序列到序列中国连续手语识别和翻译 Sequence-to-Sequence Chinese Continuous Sign Language Recognition and Translation with Multi- layer Attention Mechanism Fusion 计算机科学, 2022, 49(9): 155-161. https://doi.org/10.11896/jsjkx.210800026 |
|