计算机科学 ›› 2024, Vol. 51 ›› Issue (11): 248-254.doi: 10.11896/jsjkx.231000096
刘小峰, 郑禹铖, 李东阳
LIU Xiaofeng, ZHENG Yucheng, LI Dongyang
摘要: 从Web中抽取平行语料对于机器翻译和其他多语语言处理任务来说非常重要,由此提出了一种从Web中灵活高效地增量抽取平行语料的方法,通过持续地对Common Crawl的Web抓取存档进行下载、扫描和分析统计,增量更新域名下的语言文本长度统计数据。对于任意给定的感兴趣目标语言对,抽取方法基于域名下的语言文本长度统计数据确定抓取网站入口,并根据目标语言进行定向抓取,忽略多语域名和目标语言外的链接。此外还提出了一种在多语域名内基于语义相似性进行全局对齐的新的句子对齐方法。实验表明,增量抽取能够持续不断地获得新的平行语料,根据指定的语言对进行抽取,可以灵活地获得感兴趣的目标语言对平行语料;新的对齐方法在对齐效率上明显优于全局方法,且能完成局部方法无法完成的对齐;在6个语言方向中,抽取到的平行语料在4个中低资源语言方向的质量优于现有Web开源平行语料,在2个高资源语言方向的质量接近现有最好的Web开源平行语料。
中图分类号:
[1] SCHWENK H,WENZEK G,EDUNOV S,et al.CCMatrix:Mining billions of high-quality parallel sentences on the Web[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics.2021:6490-6500. [2] EL-KISHKY A,CHAUDHARY V,GUZMAN F,et al.CCAligned:A massive collection of cross-lingual web-document pairs[C]//Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing.2020:5960-5969. [3] LISON P,TIEDEMANN J.OpenSubtitles2016:Extracting large parallel corpora from movie and TV subtitles[C]//Proceedings of the Tenth International Conference on Language Resources and Evaluation.2016:923-929. [4] ZIEMSKI M,JUNCZYS-DOWMUNT M,POULIQUEN B.The United Nations parallel corpus v1.0[C]//Proceedings of the Tenth International Conference on Language Resources and Evaluation.2016:3530-3534. [5] KOEHN P.Europarl:A parallel corpus for statistical machinetranslation[C]//Proceedings of Machine Translation Summit.2005:79-86. [6] MORISHITA M,CHOUSA K,SUZUKI J,et al.JParaCrawl v3.0:A Large-scale English-Japanese Parallel Corpus[C]//Proceedings of International Conference on Language Resources and Evaluation.2022:6704-6710. [7] ESPLÀ-GOMIS M,FORCADA M,RAMÍREZ-SÁNCHEZ G,et al.Paracrawl:Web-scale parallel corpora for the languages of the EU[C]//Proceedings of Machine Translation Summit.2019:118-119. [8] JUSSA C,CROSS M,ÇELEBI J,et al.No Language Left Behind:Scaling Human-Centered Machine Translation[J].arXiv:2207.04672,2022. [9] TUFIS D,ION R,DANIEL S,et al.Wikipedia as an SMT trai-ning corpus[C]//Proceedings of the International Conference Recent Advances in Natural Language Processing.2013:702-709. [10] SCHWENK H,CHAUDHARY V,SUN S,et al.Wikimatrix:Mining 135m parallel sentences in 1620 language pairs from Wikipedia[C]//Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics.2021:1351-1361. [11] JOHNSON J,DOUZE M,JÉGOU H.Billion-scale similaritysearch with gpus[J].IEEE Transactions on Big Data,2019,7(3):535-547. [12] ARTETXE M,SCHWENK H.Margin-based parallel corpusmining with multilingual sentence embeddings[C]//Procee-dings of the 57th Annual Meeting of the Association for Computational Linguistics.2019:3197-3203. [13] KVAPILÍKOVÁ I,ARTETXE M,LABAKA G,et al.Unsupervised multilingual sentence embeddings for parallel corpus mi-ning[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.2020:255-262. [14] RESNIK P.Mining the Web for Bilingual Text[C]//Procee-dings of the 37th Annual Meeting of the Association for Computational Linguistics.1999:527-534. [15] BUCK C,KOEHN P.Findings of the WMT 2016 bilingual document alignment shared task[C]//Proceedings of the First Conference on Machine Translation.2016:554-563. [16] FANG X Y,YANG Y F,CER D,et al.Language-agnosticBERT Sentence Embedding[C]//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics.2022:878-889. [17] KOEHN P,KHAYRALLAH H,HEAFIELD K,et al.Findings of the WMT 2018 shared task on parallel corpus filtering[C]//Proceedings of the Third Conference on Machine Translation.2018:726-739. [18] VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[C]//Proceedings of the 31st International Confe-rence on Neural Information Processing Systems.2017:6000-6010. [19] KUDO T,RICHARDSON J.SentencePiece:A simple and language independent subword tokenizer and detokenizer for Neural Text Processing[C]//Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing.2018:66-71. [20] POST M.A Call for Clarity in Reporting BLEU Scores[C]//Proceedings of the Third Conference on Machine Translation.2018:186-191. |
|