计算机科学 ›› 2024, Vol. 51 ›› Issue (11): 248-254.doi: 10.11896/jsjkx.231000096

• 人工智能 • 上一篇    下一篇

一种灵活高效的增量式Web平行语料抽取方法

刘小峰, 郑禹铖, 李东阳   

  1. 华中科技大学软件学院 武汉 430074
  • 收稿日期:2023-10-16 修回日期:2024-03-13 出版日期:2024-11-15 发布日期:2024-11-06
  • 通讯作者: 刘小峰(liuxf@hust.edu.cn)

Incrementally and Flexibly Extracting Parallel Corpus from Web

LIU Xiaofeng, ZHENG Yucheng, LI Dongyang   

  1. School of Software Engineering,Huazhong University of Science and Technology,Wuhan 430074,China
  • Received:2023-10-16 Revised:2024-03-13 Online:2024-11-15 Published:2024-11-06
  • About author:LIU Xiaofeng,born in 1974,Ph.D,associate professor,graduate supervisor.His main research interests include na-tural language processing based on deep learning and so on.

摘要: 从Web中抽取平行语料对于机器翻译和其他多语语言处理任务来说非常重要,由此提出了一种从Web中灵活高效地增量抽取平行语料的方法,通过持续地对Common Crawl的Web抓取存档进行下载、扫描和分析统计,增量更新域名下的语言文本长度统计数据。对于任意给定的感兴趣目标语言对,抽取方法基于域名下的语言文本长度统计数据确定抓取网站入口,并根据目标语言进行定向抓取,忽略多语域名和目标语言外的链接。此外还提出了一种在多语域名内基于语义相似性进行全局对齐的新的句子对齐方法。实验表明,增量抽取能够持续不断地获得新的平行语料,根据指定的语言对进行抽取,可以灵活地获得感兴趣的目标语言对平行语料;新的对齐方法在对齐效率上明显优于全局方法,且能完成局部方法无法完成的对齐;在6个语言方向中,抽取到的平行语料在4个中低资源语言方向的质量优于现有Web开源平行语料,在2个高资源语言方向的质量接近现有最好的Web开源平行语料。

关键词: 平行语料抽取, 句子对齐, 语料库构建, 机器翻译, Web挖掘

Abstract: Extracting parallel corpus from the web is important for machine translation and other multilingual processing tasks.This paper proposes an incremental web parallel corpus extraction method,which incrementally updates language text length statistics for domains by continuously downloading,scanning and analyzing Common Crawl’s web crawling archive.For any given interested language pairs,web sites to be crawled are determined based on language text length statistics for domains and crawled according to the target language pairs,and non-target domains and links are discarded.It also proposes a new intermediatesentence alignment method,which globally aligns sentences based on semantic similarity within multilingual domains.Experiments show that:1)our extraction method can continuously obtain new parallel corpus and flexibly obtain the target language pair of interest via extracting the specified language pairs;2)the proposed intermediate method is significantly better than the global method in terms of alignment efficiency,and can complete the alignment that cannot be completed by local methods;3)out of 6 language directions,the extracted parallel corpora are superior to existing web open source parallel corpus in 4 medium-low resource languages and close to the best available web open source parallel corpus in 2 high-resource languages.

Key words: Parallel corpus extraction, Sentence alignment, Corpus construction, Machine translation, Web mining

中图分类号: 

  • TP391
[1] SCHWENK H,WENZEK G,EDUNOV S,et al.CCMatrix:Mining billions of high-quality parallel sentences on the Web[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics.2021:6490-6500.
[2] EL-KISHKY A,CHAUDHARY V,GUZMAN F,et al.CCAligned:A massive collection of cross-lingual web-document pairs[C]//Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing.2020:5960-5969.
[3] LISON P,TIEDEMANN J.OpenSubtitles2016:Extracting large parallel corpora from movie and TV subtitles[C]//Proceedings of the Tenth International Conference on Language Resources and Evaluation.2016:923-929.
[4] ZIEMSKI M,JUNCZYS-DOWMUNT M,POULIQUEN B.The United Nations parallel corpus v1.0[C]//Proceedings of the Tenth International Conference on Language Resources and Evaluation.2016:3530-3534.
[5] KOEHN P.Europarl:A parallel corpus for statistical machinetranslation[C]//Proceedings of Machine Translation Summit.2005:79-86.
[6] MORISHITA M,CHOUSA K,SUZUKI J,et al.JParaCrawl v3.0:A Large-scale English-Japanese Parallel Corpus[C]//Proceedings of International Conference on Language Resources and Evaluation.2022:6704-6710.
[7] ESPLÀ-GOMIS M,FORCADA M,RAMÍREZ-SÁNCHEZ G,et al.Paracrawl:Web-scale parallel corpora for the languages of the EU[C]//Proceedings of Machine Translation Summit.2019:118-119.
[8] JUSSA C,CROSS M,ÇELEBI J,et al.No Language Left Behind:Scaling Human-Centered Machine Translation[J].arXiv:2207.04672,2022.
[9] TUFIS D,ION R,DANIEL S,et al.Wikipedia as an SMT trai-ning corpus[C]//Proceedings of the International Conference Recent Advances in Natural Language Processing.2013:702-709.
[10] SCHWENK H,CHAUDHARY V,SUN S,et al.Wikimatrix:Mining 135m parallel sentences in 1620 language pairs from Wikipedia[C]//Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics.2021:1351-1361.
[11] JOHNSON J,DOUZE M,JÉGOU H.Billion-scale similaritysearch with gpus[J].IEEE Transactions on Big Data,2019,7(3):535-547.
[12] ARTETXE M,SCHWENK H.Margin-based parallel corpusmining with multilingual sentence embeddings[C]//Procee-dings of the 57th Annual Meeting of the Association for Computational Linguistics.2019:3197-3203.
[13] KVAPILÍKOVÁ I,ARTETXE M,LABAKA G,et al.Unsupervised multilingual sentence embeddings for parallel corpus mi-ning[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.2020:255-262.
[14] RESNIK P.Mining the Web for Bilingual Text[C]//Procee-dings of the 37th Annual Meeting of the Association for Computational Linguistics.1999:527-534.
[15] BUCK C,KOEHN P.Findings of the WMT 2016 bilingual document alignment shared task[C]//Proceedings of the First Conference on Machine Translation.2016:554-563.
[16] FANG X Y,YANG Y F,CER D,et al.Language-agnosticBERT Sentence Embedding[C]//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics.2022:878-889.
[17] KOEHN P,KHAYRALLAH H,HEAFIELD K,et al.Findings of the WMT 2018 shared task on parallel corpus filtering[C]//Proceedings of the Third Conference on Machine Translation.2018:726-739.
[18] VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[C]//Proceedings of the 31st International Confe-rence on Neural Information Processing Systems.2017:6000-6010.
[19] KUDO T,RICHARDSON J.SentencePiece:A simple and language independent subword tokenizer and detokenizer for Neural Text Processing[C]//Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing.2018:66-71.
[20] POST M.A Call for Clarity in Reporting BLEU Scores[C]//Proceedings of the Third Conference on Machine Translation.2018:186-191.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!