Computer Science ›› 2022, Vol. 49 ›› Issue (1): 41-46.doi: 10.11896/jsjkx.210900012

• Multilingual Computing Advanced Technology • Previous Articles     Next Articles

Construction Method of Parallel Corpus for Minority Language Machine Translation

LIU Yan, XIONG De-yi   

  1. College of Intelligence and Computing,Tianjin University,Tianjin 300350,China
  • Received:2021-09-01 Revised:2021-10-18 Online:2022-01-15 Published:2022-01-18
  • About author:LIU Yan,born in 1992,postgraduate.Her main research interests include neural machine translation and discourse parsing.
    XIONG De-yi,born in 1979,Ph.D,professor,Ph.D supervisor,is a member of China Computer Federation.His main research interests include natural language processing,especially in machine translation,dialogue,and natural language generation.
  • Supported by:
    National Key Research and Development Program(2019QY1802).

Abstract: The training performance of neural machine translation depends heavily on the scale and quality of parallel corpus.Unlike some common languages,the construction of high-quality parallel corpora between Chinese and minority languages has been lagging.The existing minority language parallel corpora are mostly constructed by using automatic sentence alignment technology and network resources,which has many limitations such as domain and quality confined.Although high-quality parallel corpora could be constructed by manual,it lacks relevant experience and method.From the perspective of machine translation practitioners and researchers,this article introduces a cost-effective method to manually construct parallel corpus between minority languages and Chinese,including its overall goals,implementation process,engineering details,and the final result.This article tries and accumulats various experiences in the construction process,and finally forms a summary of the methods and suggestions for constructing parallel corpora from minority languages to Chinese.In the end,this paper successfully constructs 0.5 million high-quality parallel corpora from Persian to Chinese,Hindi to Chinese,and Indonesian to Chinese.The experimental results prove the quality of our constructed corpora,and it improves the performance of the minority language neural machine translation models.

Key words: Minority language, Neural machine translation, Parallel corpus

CLC Number: 

  • TP391
[1]GERNOT W.The Iranian languages[M].Routledge,2009.
[2]LIAO B.The Language Situation in India-An Analysis Based on the Language Survey Data of the Indian Census in 2011[J].Journal of PLA University of Foreign Languages,2020,43(6):7.
[3] JIANG S Y,LI S S,FU S H,et al.An Overview of Natural Language Processing for Indonesian and Malay[J].Pattern Recognition and Artificial Intelligence,2020,33(6):12.
[4]JAMES N S.The Indonesia languages:Its history and role in Modern Society[M].UNSW Press,2004.
[5]SCHWENK H,CHAUDHARY V,SUN S,et al.WikiMatrix:Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia[J].arXiv:1907.05791,2019.
[6]El-KISHKY A,RENDUCHINTALA A,CROSS J,et al.XLEnt:Mining a Large Cross-lingual Entity Dataset with Lexical-Semantic-Phonetic Word Alignment[J].arXiv:2104.08597,2021.
[7]TIEDEMANN J.Parallel Data,Tools and Interfaces in OPUS[C]//Lrec.2012:2214-2218.
[8]REIMERS N,GUREVYCH I.Making Monolingual SentenceEmbeddings Multilingual using Knowledge Distillation[J].arXiv:2004.09813,2020.
[9]GUZMAN F,SAJJAD H,VOGEL S,et al.The AMARA Corpus:Building Resources for Translating the Web's Educational Content[C]//International Workshop on Spoken Language Translation(IWSLT).2013.
[10]ZHAO F,ZHOU T,ZHANG L,et al.Research Progress onWikipedia[J].Journal of University of Electronic Science and Technology of China,2010(3):321-334.
[11]SMITH J R,SAINTAMAND H,PLAMADA M,et al.Dirtcheap web-scale parallel text from the Common Crawl[C]//Proceedings of the 2013 Conference of the Association for Computational Linguistics (ACL 2013).2013.
[12]ECK M,VOGEL S,WAIBEL A.Low Cost Portability for Statistical Machine Translation based on N-gram Frequency and TF-IDF [C]//Proceedings of International Workshop on Spoken Language Translation.2005.
[13]SETTLES B.Active Learning Literature Survey[J].Science,1995,10(3):237-304.
[14]LEVENSHTEIN V I.Binary codes capable of correcting dele-tions,insertions and reversals[C]//Soviet Physics Doklady.1996:707-710.
[15]NEEDLEMAN S B.A general method applicable to the search for similarities in the amino acid sequence of two proteins[J].Journal of Molecular Biology,1970,48(3):443-453.
[16]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[C]//Advances in Neural Information Processing Systems.2017:5998-6008.
[17]PAPINENI S.Blue;A method for Automatic Evaluation ofMachine Translation[C]//Meeting of the Association for Computational Linguistics.Association for Computational Linguistics.2002.
[18]EL-KISHKY A,CHAUDHARY V,GUZMAN F,et al.CCAligned:A massive collection of cross-lingual web-document pairs[J].arXiv:1911.06154,2019.
[19]SCHWENK H,WENZEK G,EDUNOV S,et al.Ccmatrix:Mi-ning billions of high-quality parallel sentences on the web[J].arXiv:1911.04944,2019.
[20]ZHANG B,NAGESH A,KNIGHT K.Parallel Corpus Filtering via Pre-trained Language Models[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.2020:8545-8554.
[21]DEVLIN J,CHANG M W,LEE K,et al.BERT:Pre-training of Deep Bidirectional Transformers for Language Understanding[C]//NAACL-HLT (1).2019.
[22]RADFORD A,WU J,CHILD R,et al.Language models are unsupervised multitask learners[J].OpenAI Blog,2019,1(8):9.
[23]IMANKULOVA A,SATO T,KOMACHI M.Improving low-resource neural machine translation with filtered pseudo-parallel corpus[C]//Proceedings of the 4th Workshop on Asian Translation (WAT2017).2017:70-78.
[24]GRAÇA M,KIM Y,SCHAMPER J,et al.Generalizing Back-Translation in Neural Machine Translation[C]//Proceedings of the Fourth Conference on Machine Translation(Volume 1:Research Papers).2019:45-52.
[25]LOWPHANSIRIKUL L,POLPANUMAS C,RUTHERFORDA T,et al.A large English-Thai parallel corpus from the web and machine-generated text[J].Language Resources and Evaluation,2021,55(1):1-23.
[26]ZIN M M,RACHARAK T,LE N M.Construct-Extract:AnEffective Model for Building Bilingual Corpus to Improve English-Myanmar Machine Translation[C]//ICAART (2).2021:333-342.
[27]MUBARAK H,HASSAN S,ABDELALI A.Constructing a bilingual corpus of parallel tweets[C]//Proceedings of the 13th Workshop on Building and Using Comparable Corpora.2020:14-21.
[1] LIU Jun-peng, SU Jin-song, HUANG De-gen. Incorporating Language-specific Adapter into Multilingual Neural Machine Translation [J]. Computer Science, 2022, 49(1): 17-23.
[2] QIAO Bo-wen,LI Jun-hui. Neural Machine Translation Combining Source Semantic Roles [J]. Computer Science, 2020, 47(2): 163-168.
[3] WANG Qi, DUAN Xiang-yu. Neural Machine Translation Based on Attention Convolution [J]. Computer Science, 2018, 45(11): 226-230.
[4] LAN Yi-yong, LIU Hai-feng and YANG Yuan-yuan. Minority Language Websites’ Automatic Identification and Collection [J]. Computer Science, 2015, 42(Z6): 79-82.
Full text



No Suggested Reading articles found!