计算机科学 ›› 2024, Vol. 51 ›› Issue (1): 60-67.doi: 10.11896/jsjkx.231100024

• 创刊五十周年特别专题 • 上一篇    下一篇

无监督句对齐综述

谷仕威1, 刘静2, 李丙春2, 熊德意1   

  1. 1 天津大学智能与计算学部 天津300350
    2 喀什大学计算机科学与技术学院 新疆 喀什844000
  • 收稿日期:2023-11-02 修回日期:2023-12-10 出版日期:2024-01-15 发布日期:2024-01-12
  • 通讯作者: 熊德意(dyxiong@tju.edu.cn)
  • 作者简介:(swgu98@qq.com)
  • 基金资助:
    新疆自治区自然科学基金重点项目(2022D01D43);云南省重点研发计划(202203AA080004);基于汉语-乌尔都语平行语料库的研究(KS2022084)

Survey of Unsupervised Sentence Alignment

GU Shiwei1, LIU Jing2, LI Bingchun2, XIONG Deyi1   

  1. 1 College of Intelligence and Computing,Tianjin University,Tianjin 300350,China
    2 School of Computer Science and Technology,Kashi University,Kashgar,Xinjiang 844000,China
  • Received:2023-11-02 Revised:2023-12-10 Online:2024-01-15 Published:2024-01-12
  • About author:GU Shiwei,born in 1998,postgraduate.His main research interests include na-tural language processing and machine translation.
    XIONG Deyi,born in 1979,Ph.D, professor,Ph.D supervisor.His main research interests include natural language processing and machine translation.
  • Supported by:
    Natural Science Foundation of Xinjiang Uygur Autonomous Region(2022D01D43), Key Research and Development Program of Yunnan Province(202203AA080004) and Research on the Parallel Corpus of Chinese Urdu Language(KS2022084).

摘要: 无监督句对齐在自然语言处理领域是一个重要而具有挑战性的问题。该任务旨在找到不同语言中句子的对应关系,为跨语言信息检索、机器翻译等应用提供基础支持。该综述从方法、挑战和应用3个方面概括了无监督句对齐的研究现状。在方法方面,无监督句对齐涵盖了多种方法,包括基于多语言嵌入、聚类和自监督或者生成模型等。然而,无监督句对齐面临着多样性、语言差异和领域适应等挑战。语言的多义性和差异性使得句对齐变得复杂,尤其在低资源语言中更为明显。尽管面临挑战,无监督句对齐在跨语言信息检索、机器翻译、多语言信息聚合等领域具有重要应用。通过无监督句对齐,可以将不同语言中的信息整合,提升信息检索的效果。同时,该领域的研究也在不断推动技术的创新和发展,为实现更准确和稳健的无监督句对齐提供了契机。

关键词: 无监督句对齐, 自然语言处理, 机器翻译, 自监督, 低资源

Abstract: Unsupervised sentence alignment is an important and challenging problem in the field of natural language processing.This task aims to find corresponding sentence correspondences in different languages and provide basic support for cross-language information retrieval,machine translation and other applications.This survey summarizes the current research status of unsupervised sentence alignment from three aspects:methods,challenges and applications.In terms of methods,unsupervised sentence alignment covers a variety of methods,including based on multi-language embedding,clustering and self-supervised or generative models.However,unsupervised sentence alignment faces challenges such as diversity,language differences,and domain adaptation.The ambiguity and diversity of languages complicates sentence alignment,especially in low-resource languages.Despite the challenges,unsupervised sentence alignment has important applications in fields such as cross-lingual information retrieval,machine translation,and multilingual information aggregation.Through unsupervised sentence alignment,information in different languages can be integrated to improve the effect of information retrieval.At the same time,research in this field is alsoconstan-tly promoting technological innovation and development,providing opportunities to achieve more accurate and robust unsupervised sentence alignment.

Key words: Unsupervised sentence alignment, Natural language processing, Machine translation, Self-supervised, Low-resource

中图分类号: 

  • TP391
[1]BRAUNE F,FRASER A.Improved unsupervised sentencealignment for symmetrical and asymmetrical parallel corpora[C]//Coling 2010:Posters.2010:81-89.
[2]LI Z,HUANG S,ZHANG Z,et al.Dual-Alignment Pre-training for Cross-lingual Sentence Embedding[J].arXiv:2305.09148,2023.
[3]TIEN C,STEINERT-THRELKELD S.Bilingual alignmenttransfers to multilingual alignment for unsupervised parallel text mining[C]//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics(Volume 1:Long Papers).2022:8696-8706.
[4]KEUNG P,SALAZAR J,LU Y,et al.Unsupervised bitext mi-ning and translation via self-trained contextual embeddings[J].Transactions of the Association for Computational Linguistics,2021,8:828-841.
[5]ZHU S,MI C,LI T,et al.Unsupervised parallel sentences of machine translation for Asian language pairs[J].ACM Transactions on Asian and Low-Resource Language Information Processing,2023,22(3):64:1-64:14.
[6]LAMPLE G,CONNEAU A,DENOYER L,et al.Unsupervised Machine Translation Using Monolingual Corpora Only[C]//International Conference on Learning Representations.2018.
[7]ARTETXE M,LABAKA G,AGIRRE E,et al.Unsupervisedneural machine translation[C]//6th International Conference on Learning Representations(ICLR 2018).2018.
[8]LAMPLE G,CONNEAU A,RANZATO M A,et al.Word translation without parallel da-ta[C]//International Conference on Learning Representations.2018.
[9]QI Y,SACHAN D,FELIX M,et al.When and Why Are Pre-Trained Word Embeddings Useful for Neural Machine Translation?[C]//Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies,Volume 2(Short Papers).2018:529-535.
[10]ARTETXE M,SCHWENK H.Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond[J].Transactions of the Association for Computational Linguistics,2019,7:597-610.
[11]REN S,LIU S,ZHOU M,et al.A graph-based coarse-to-finemethod for unsupervised bilingual lexicon induction[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.2020:3476-3485.
[12]ARTETXE M,LABAKA G,AGIRRE E.A robust self-learning method for fully unsupervised cross-lingual mappings of word embed-dings[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics(Volume 1:Long Papers).2018:789-798.
[13]GARNEAU N,GODBOUT M,BEAUCHEMIN D,et al.A Robust Self-Learning Method for Fully Unsupervised Cross-Lingual Mappings of Word Embeddings:Making the Method Robustly Reproducible as Well[C]//Proceedings of the Twelfth Language Resources and Evaluation Conference.2020:5546-5554.
[14]CONNEAU A,KHANDELWAL K,GOYAL N,et al.Unsupervised Cross-lingual Representation Learning at Scale[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.2020:8440-8451.
[15]LU X,QIANG J,LI Y,et al.An unsupervised method for buil-ding sentence simplification corpora in multiple languages[C]//Findings of the Association for Computational Linguistics.Punta Cana:Association for Computational Linguistics,2021:227-237.
[16]KVAPILÍKOVÁ I,ARTETXE M,LABAKA G,et al.Un-su-pervised Multilingual Sentence Embeddings for Parallel Corpus Mining[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics:Student Research Workshop.2020:255-262.
[17]ARTETXE M,LABAKA G,AGIRRE E.A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics(Volume 1:Long Papers).2018:789-798.
[18]HASHIMOTO K,XIONG C,TSURUOKA Y,et al.A JointMany-Task Model:Growing a Neural Network for Multiple NLP Tasks[C]//Proceedings of the 2017 Conference on Empi-rical Methods in Natural Language Processing.2017:1923-1933.
[19]ORMAZABAL A,ARTETXE M,LABAKA G,et al.Analyzing the Limitations of Cross-lingual Word Embedding Mappings[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.2019:4990-4995.
[20]PATRA B,MONIZ J R A,GARG S,et al.Bilingual Lexicon Induction with Semi-supervision in Non-Isometric Embedding Spaces[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.2019:184-193.
[21]ZHAO X,WANG Z,ZHANG Y,et al.A Relaxed Matching Procedure for Unsupervised BLI[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.2020:3036-3041.
[22]HANGYA V,BRAUNE F,KALASOUSKAYA Y,et al.Unsupervised parallel sentence extraction from comparable corpora[C]//Proceedings of the 15th International Conference on Spoken Language Translation.Brussels:International Conference on Spoken Language Translation.2018:7-13.
[23]KIM Y,ROSENDAHL H,ROSSENBACH N,et al.LearningBilingual Sentence Embeddings via Autoencoding and Computing Similarities with a Multilayer Perceptron[C]//Proceedings of the 4th Workshop on Representation Learning for NLP(RepL4NLP-2019).2019:61-71.
[24]BAÑÓN M,CHEN P,HADDOW B,et al.ParaCrawl:Web-scale acquisition of parallel corpora[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.2020:4555-4567.
[25]HANGYA V,FRASER A.Unsupervised parallel sentence ex-traction with parallel segment detection helps machine translation[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.2019:1224-1234.
[26]HONG C,LEE J,LEE J.Unsupervised Interlingual SemanticRepresentations from Sentence Embeddings for Zero-Shot Cross-Lingual Trans-fer[C]//Proceedings of the AAAI Confe-rence on Artificial Intelligence.2020:7944-7951.
[27]SCHWENK H,WENZEK G,EDUNOV S,et al.CCMatrix:Mining Billions of High-Quality Parallel Sentences on the Web[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing(Volume 1:Long Papers).2021:6490-6500.
[28]SCHWENK H,CHAUDHARY V,SUN S,et al.WikiMatrix:Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia[C]//Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics:Main Volume.2021:1351-1361.
[29]LIAN X,JAIN K,TRUSZKOWSKI J,et al.Unsupervised multilingual alignment using Wasserstein barycenter[C]//Procee-dings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence.2021:3702-3708.
[30]CHOUSA K,NAGATA M,NISHINO M.SpanAlign:Sentence alignment method based on cross-language span prediction and ILP[C]//Proceedings of the 28th International Conference on Computational Linguistics.2020:4750-4761.
[31]ZHU S,GU S,LI S,et al.Mining parallel sentences from internet with multi-view knowledge distillation for low-resource language pairs[J/OL].Knowledge and Information Systems,2023.https://doi.org/10.1007/s10115-023-01925-3.
[32]CHI T C,CHEN Y N.CLUSE:Cross-Lingual UnsupervisedSense Embeddings[C]//Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing.2018:271-281.
[33]WANG L,ZHAO W,LIU J.Aligning Cross-lingual SentenceRepresentations with Dual Momentum Contrast[C]//Procee-dings of the 2021 Conference on Empirical Methods in Natural Language Processing.2021:3807-3815.
[34]DENG J,WAN F,YANG T,et al.Clustering-Aware Negative Sampling for Unsupervised Sentence Representation[J].arXiv:2305.09892,2023.
[35]PAETZOLD G,ALVA-MANCHEGO F,SPECIA L.Massa-lign:Alignment and annotation of comparable docu-ments[C]//Proceedings of the IJCNLP 2017,System Demonstrations.2017:1-4.
[36]LENG Y,TAN X,QIN T,et al.Unsupervised Pivot Translation for Distant Languages[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.2019:175-183.
[37]CHEN S,ZHOU J,SUN Y,et al.An Information Minimization Based Contrastive Learning Model for Unsupervised Sentence Embeddings Learning[C]//Proceedings of the 29th Interna-tional Conference on Computational Linguistics.2022:4821-4831.
[38]PIRES T,SCHLINGER E,GARRETTE D.How Multilingual is Multilingual BERT?[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.2019:4996-5001.
[39]ARTETXE M,SCHWENK H.Margin-based Parallel CorpusMining with Multilingual Sentence Embeddings[C]//Procee-dings of the 57th Annual Meeting of the Association for Computational Linguistics.2019:3197-3203.
[40]DING Y,LI J,GONG Z,et al.Improving neural sentence alignment with word translation[J].Frontiers of Computer Science,2021,15:151302.
[41]WU N,LIANG Y,REN H,et al.Unsupervised context aware sentence representation pretraining for multi-lingual dense retrieval[J].arXiv:2206.03281,2022.
[42]LIU J,MORIN E,SALDARRIAGA S P,et al.From unifiedphrase representation to bilingual phrase alignment in an unsupervised manner[J].Natural Language Engineering,2023,29(3):643-668.
[43]ZWEIGENBAUM P,SHAROFF S,RAPP R.Towards preparation of the second BUCC shared task:Detecting parallel sentences in comparable corpora[C]//Proceedings of the Ninth Workshop on Building and Using Comparable Corpora.Euro-pean Language Resources Association(ELRA),Portoroz,Slovenia.2016:38-43.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!