一种灵活高效的增量式Web平行语料抽取方法

doi:10.11896/jsjkx.231000096

Abstract

Abstract: Extracting parallel corpus from the web is important for machine translation and other multilingual processing tasks.This paper proposes an incremental web parallel corpus extraction method,which incrementally updates language text length statistics for domains by continuously downloading,scanning and analyzing Common Crawl’s web crawling archive.For any given interested language pairs,web sites to be crawled are determined based on language text length statistics for domains and crawled according to the target language pairs,and non-target domains and links are discarded.It also proposes a new intermediatesentence alignment method,which globally aligns sentences based on semantic similarity within multilingual domains.Experiments show that:1)our extraction method can continuously obtain new parallel corpus and flexibly obtain the target language pair of interest via extracting the specified language pairs;2)the proposed intermediate method is significantly better than the global method in terms of alignment efficiency,and can complete the alignment that cannot be completed by local methods;3)out of 6 language directions,the extracted parallel corpora are superior to existing web open source parallel corpus in 4 medium-low resource languages and close to the best available web open source parallel corpus in 2 high-resource languages.

Key words: Parallel corpus extraction, Sentence alignment, Corpus construction, Machine translation, Web mining

CLC Number:

TP391

LIU Xiaofeng, ZHENG Yucheng, LI Dongyang. Incrementally and Flexibly Extracting Parallel Corpus from Web[J].Computer Science, 2024, 51(11): 248-254.

References

[1] SCHWENK H,WENZEK G,EDUNOV S,et al.CCMatrix:Mining billions of high-quality parallel sentences on the Web[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics.2021:6490-6500.
[2] EL-KISHKY A,CHAUDHARY V,GUZMAN F,et al.CCAligned:A massive collection of cross-lingual web-document pairs[C]//Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing.2020:5960-5969.
[3] LISON P,TIEDEMANN J.OpenSubtitles2016:Extracting large parallel corpora from movie and TV subtitles[C]//Proceedings of the Tenth International Conference on Language Resources and Evaluation.2016:923-929.
[4] ZIEMSKI M,JUNCZYS-DOWMUNT M,POULIQUEN B.The United Nations parallel corpus v1.0[C]//Proceedings of the Tenth International Conference on Language Resources and Evaluation.2016:3530-3534.
[5] KOEHN P.Europarl:A parallel corpus for statistical machinetranslation[C]//Proceedings of Machine Translation Summit.2005:79-86.
[6] MORISHITA M,CHOUSA K,SUZUKI J,et al.JParaCrawl v3.0:A Large-scale English-Japanese Parallel Corpus[C]//Proceedings of International Conference on Language Resources and Evaluation.2022:6704-6710.
[7] ESPLÀ-GOMIS M,FORCADA M,RAMÍREZ-SÁNCHEZ G,et al.Paracrawl:Web-scale parallel corpora for the languages of the EU[C]//Proceedings of Machine Translation Summit.2019:118-119.
[8] JUSSA C,CROSS M,ÇELEBI J,et al.No Language Left Behind:Scaling Human-Centered Machine Translation[J].arXiv:2207.04672,2022.
[9] TUFIS D,ION R,DANIEL S,et al.Wikipedia as an SMT trai-ning corpus[C]//Proceedings of the International Conference Recent Advances in Natural Language Processing.2013:702-709.
[10] SCHWENK H,CHAUDHARY V,SUN S,et al.Wikimatrix:Mining 135m parallel sentences in 1620 language pairs from Wikipedia[C]//Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics.2021:1351-1361.
[11] JOHNSON J,DOUZE M,JÉGOU H.Billion-scale similaritysearch with gpus[J].IEEE Transactions on Big Data,2019,7(3):535-547.
[12] ARTETXE M,SCHWENK H.Margin-based parallel corpusmining with multilingual sentence embeddings[C]//Procee-dings of the 57th Annual Meeting of the Association for Computational Linguistics.2019:3197-3203.
[13] KVAPILÍKOVÁ I,ARTETXE M,LABAKA G,et al.Unsupervised multilingual sentence embeddings for parallel corpus mi-ning[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.2020:255-262.
[14] RESNIK P.Mining the Web for Bilingual Text[C]//Procee-dings of the 37th Annual Meeting of the Association for Computational Linguistics.1999:527-534.
[15] BUCK C,KOEHN P.Findings of the WMT 2016 bilingual document alignment shared task[C]//Proceedings of the First Conference on Machine Translation.2016:554-563.
[16] FANG X Y,YANG Y F,CER D,et al.Language-agnosticBERT Sentence Embedding[C]//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics.2022:878-889.
[17] KOEHN P,KHAYRALLAH H,HEAFIELD K,et al.Findings of the WMT 2018 shared task on parallel corpus filtering[C]//Proceedings of the Third Conference on Machine Translation.2018:726-739.
[18] VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[C]//Proceedings of the 31st International Confe-rence on Neural Information Processing Systems.2017:6000-6010.
[19] KUDO T,RICHARDSON J.SentencePiece:A simple and language independent subword tokenizer and detokenizer for Neural Text Processing[C]//Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing.2018:66-71.
[20] POST M.A Call for Clarity in Reporting BLEU Scores[C]//Proceedings of the Third Conference on Machine Translation.2018:186-191.

Related Articles 15

[1]	YANG Binxia, LUO Xudong, SUN Kaili. Recent Progress on Machine Translation Based on Pre-trained Language Models [J]. Computer Science, 2024, 51(6A): 230700112-8.
[2]	GU Shiwei, LIU Jing, LI Bingchun, XIONG Deyi. Survey of Unsupervised Sentence Alignment [J]. Computer Science, 2024, 51(1): 60-67.
[3]	SUN Kaili, LUO Xudong , Michael Y.LUO. Survey of Applications of Pretrained Language Models [J]. Computer Science, 2023, 50(1): 176-184.
[4]	DONG Zhen-heng, REN Wei-ping, YOU Xin-dong, LYU Xue-qiang. Machine Translation Method Integrating New Energy Terminology Knowledge [J]. Computer Science, 2022, 49(6): 305-312.
[5]	LIU Jun-peng, SU Jin-song, HUANG De-gen. Incorporating Language-specific Adapter into Multilingual Neural Machine Translation [J]. Computer Science, 2022, 49(1): 17-23.
[6]	YU Dong, XIE Wan-ying, GU Shu-hao, FENG Yang. Similarity-based Curriculum Learning for Multilingual Neural Machine Translation [J]. Computer Science, 2022, 49(1): 24-30.
[7]	HOU Hong-xu, SUN Shuo, WU Nier. Survey of Mongolian-Chinese Neural Machine Translation [J]. Computer Science, 2022, 49(1): 31-40.
[8]	LIU Yan, XIONG De-yi. Construction Method of Parallel Corpus for Minority Language Machine Translation [J]. Computer Science, 2022, 49(1): 41-46.
[9]	LIU Chuang, XIONG De-yi. Survey of Multilingual Question Answering [J]. Computer Science, 2022, 49(1): 65-72.
[10]	NING Qiu-yi, SHI Xiao-jing, DUAN Xiang-yu, ZHANG Min. Unsupervised Domain Adaptation Based on Style Aware [J]. Computer Science, 2022, 49(1): 271-278.
[11]	LIU Xiao-die. Recognition and Transformation for Complex Noun Phrases Based on Boundary Perception [J]. Computer Science, 2021, 48(6A): 299-305.
[12]	GUO Dan, TANG Shen-geng, HONG Ri-chang, WANG Meng. Review of Sign Language Recognition, Translation and Generation [J]. Computer Science, 2021, 48(3): 60-70.
[13]	ZHOU Xiao-shi, ZHANG Zi-wei, WEN Juan. Natural Language Steganography Based on Neural Machine Translation [J]. Computer Science, 2021, 48(11A): 557-564.
[14]	QIAO Bo-wen,LI Jun-hui. Neural Machine Translation Combining Source Semantic Roles [J]. Computer Science, 2020, 47(2): 163-168.
[15]	JI Ming-xuan, SONG Yu-rong. New Machine Translation Model Based on Logarithmic Position Representation and Self-attention [J]. Computer Science, 2020, 47(11A): 86-91.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Incrementally and Flexibly Extracting Parallel Corpus from Web

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 15

Metrics

Comments

Recommended 0