计算机科学 ›› 2024, Vol. 51 ›› Issue (1): 60-67.doi: 10.11896/jsjkx.231100024
谷仕威1, 刘静2, 李丙春2, 熊德意1
GU Shiwei1, LIU Jing2, LI Bingchun2, XIONG Deyi1
摘要: 无监督句对齐在自然语言处理领域是一个重要而具有挑战性的问题。该任务旨在找到不同语言中句子的对应关系,为跨语言信息检索、机器翻译等应用提供基础支持。该综述从方法、挑战和应用3个方面概括了无监督句对齐的研究现状。在方法方面,无监督句对齐涵盖了多种方法,包括基于多语言嵌入、聚类和自监督或者生成模型等。然而,无监督句对齐面临着多样性、语言差异和领域适应等挑战。语言的多义性和差异性使得句对齐变得复杂,尤其在低资源语言中更为明显。尽管面临挑战,无监督句对齐在跨语言信息检索、机器翻译、多语言信息聚合等领域具有重要应用。通过无监督句对齐,可以将不同语言中的信息整合,提升信息检索的效果。同时,该领域的研究也在不断推动技术的创新和发展,为实现更准确和稳健的无监督句对齐提供了契机。
中图分类号:
[1]BRAUNE F,FRASER A.Improved unsupervised sentencealignment for symmetrical and asymmetrical parallel corpora[C]//Coling 2010:Posters.2010:81-89. [2]LI Z,HUANG S,ZHANG Z,et al.Dual-Alignment Pre-training for Cross-lingual Sentence Embedding[J].arXiv:2305.09148,2023. [3]TIEN C,STEINERT-THRELKELD S.Bilingual alignmenttransfers to multilingual alignment for unsupervised parallel text mining[C]//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics(Volume 1:Long Papers).2022:8696-8706. [4]KEUNG P,SALAZAR J,LU Y,et al.Unsupervised bitext mi-ning and translation via self-trained contextual embeddings[J].Transactions of the Association for Computational Linguistics,2021,8:828-841. [5]ZHU S,MI C,LI T,et al.Unsupervised parallel sentences of machine translation for Asian language pairs[J].ACM Transactions on Asian and Low-Resource Language Information Processing,2023,22(3):64:1-64:14. [6]LAMPLE G,CONNEAU A,DENOYER L,et al.Unsupervised Machine Translation Using Monolingual Corpora Only[C]//International Conference on Learning Representations.2018. [7]ARTETXE M,LABAKA G,AGIRRE E,et al.Unsupervisedneural machine translation[C]//6th International Conference on Learning Representations(ICLR 2018).2018. [8]LAMPLE G,CONNEAU A,RANZATO M A,et al.Word translation without parallel da-ta[C]//International Conference on Learning Representations.2018. [9]QI Y,SACHAN D,FELIX M,et al.When and Why Are Pre-Trained Word Embeddings Useful for Neural Machine Translation?[C]//Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies,Volume 2(Short Papers).2018:529-535. [10]ARTETXE M,SCHWENK H.Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond[J].Transactions of the Association for Computational Linguistics,2019,7:597-610. [11]REN S,LIU S,ZHOU M,et al.A graph-based coarse-to-finemethod for unsupervised bilingual lexicon induction[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.2020:3476-3485. [12]ARTETXE M,LABAKA G,AGIRRE E.A robust self-learning method for fully unsupervised cross-lingual mappings of word embed-dings[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics(Volume 1:Long Papers).2018:789-798. [13]GARNEAU N,GODBOUT M,BEAUCHEMIN D,et al.A Robust Self-Learning Method for Fully Unsupervised Cross-Lingual Mappings of Word Embeddings:Making the Method Robustly Reproducible as Well[C]//Proceedings of the Twelfth Language Resources and Evaluation Conference.2020:5546-5554. [14]CONNEAU A,KHANDELWAL K,GOYAL N,et al.Unsupervised Cross-lingual Representation Learning at Scale[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.2020:8440-8451. [15]LU X,QIANG J,LI Y,et al.An unsupervised method for buil-ding sentence simplification corpora in multiple languages[C]//Findings of the Association for Computational Linguistics.Punta Cana:Association for Computational Linguistics,2021:227-237. [16]KVAPILÍKOVÁ I,ARTETXE M,LABAKA G,et al.Un-su-pervised Multilingual Sentence Embeddings for Parallel Corpus Mining[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics:Student Research Workshop.2020:255-262. [17]ARTETXE M,LABAKA G,AGIRRE E.A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics(Volume 1:Long Papers).2018:789-798. [18]HASHIMOTO K,XIONG C,TSURUOKA Y,et al.A JointMany-Task Model:Growing a Neural Network for Multiple NLP Tasks[C]//Proceedings of the 2017 Conference on Empi-rical Methods in Natural Language Processing.2017:1923-1933. [19]ORMAZABAL A,ARTETXE M,LABAKA G,et al.Analyzing the Limitations of Cross-lingual Word Embedding Mappings[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.2019:4990-4995. [20]PATRA B,MONIZ J R A,GARG S,et al.Bilingual Lexicon Induction with Semi-supervision in Non-Isometric Embedding Spaces[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.2019:184-193. [21]ZHAO X,WANG Z,ZHANG Y,et al.A Relaxed Matching Procedure for Unsupervised BLI[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.2020:3036-3041. [22]HANGYA V,BRAUNE F,KALASOUSKAYA Y,et al.Unsupervised parallel sentence extraction from comparable corpora[C]//Proceedings of the 15th International Conference on Spoken Language Translation.Brussels:International Conference on Spoken Language Translation.2018:7-13. [23]KIM Y,ROSENDAHL H,ROSSENBACH N,et al.LearningBilingual Sentence Embeddings via Autoencoding and Computing Similarities with a Multilayer Perceptron[C]//Proceedings of the 4th Workshop on Representation Learning for NLP(RepL4NLP-2019).2019:61-71. [24]BAÑÓN M,CHEN P,HADDOW B,et al.ParaCrawl:Web-scale acquisition of parallel corpora[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.2020:4555-4567. [25]HANGYA V,FRASER A.Unsupervised parallel sentence ex-traction with parallel segment detection helps machine translation[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.2019:1224-1234. [26]HONG C,LEE J,LEE J.Unsupervised Interlingual SemanticRepresentations from Sentence Embeddings for Zero-Shot Cross-Lingual Trans-fer[C]//Proceedings of the AAAI Confe-rence on Artificial Intelligence.2020:7944-7951. [27]SCHWENK H,WENZEK G,EDUNOV S,et al.CCMatrix:Mining Billions of High-Quality Parallel Sentences on the Web[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing(Volume 1:Long Papers).2021:6490-6500. [28]SCHWENK H,CHAUDHARY V,SUN S,et al.WikiMatrix:Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia[C]//Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics:Main Volume.2021:1351-1361. [29]LIAN X,JAIN K,TRUSZKOWSKI J,et al.Unsupervised multilingual alignment using Wasserstein barycenter[C]//Procee-dings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence.2021:3702-3708. [30]CHOUSA K,NAGATA M,NISHINO M.SpanAlign:Sentence alignment method based on cross-language span prediction and ILP[C]//Proceedings of the 28th International Conference on Computational Linguistics.2020:4750-4761. [31]ZHU S,GU S,LI S,et al.Mining parallel sentences from internet with multi-view knowledge distillation for low-resource language pairs[J/OL].Knowledge and Information Systems,2023.https://doi.org/10.1007/s10115-023-01925-3. [32]CHI T C,CHEN Y N.CLUSE:Cross-Lingual UnsupervisedSense Embeddings[C]//Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing.2018:271-281. [33]WANG L,ZHAO W,LIU J.Aligning Cross-lingual SentenceRepresentations with Dual Momentum Contrast[C]//Procee-dings of the 2021 Conference on Empirical Methods in Natural Language Processing.2021:3807-3815. [34]DENG J,WAN F,YANG T,et al.Clustering-Aware Negative Sampling for Unsupervised Sentence Representation[J].arXiv:2305.09892,2023. [35]PAETZOLD G,ALVA-MANCHEGO F,SPECIA L.Massa-lign:Alignment and annotation of comparable docu-ments[C]//Proceedings of the IJCNLP 2017,System Demonstrations.2017:1-4. [36]LENG Y,TAN X,QIN T,et al.Unsupervised Pivot Translation for Distant Languages[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.2019:175-183. [37]CHEN S,ZHOU J,SUN Y,et al.An Information Minimization Based Contrastive Learning Model for Unsupervised Sentence Embeddings Learning[C]//Proceedings of the 29th Interna-tional Conference on Computational Linguistics.2022:4821-4831. [38]PIRES T,SCHLINGER E,GARRETTE D.How Multilingual is Multilingual BERT?[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.2019:4996-5001. [39]ARTETXE M,SCHWENK H.Margin-based Parallel CorpusMining with Multilingual Sentence Embeddings[C]//Procee-dings of the 57th Annual Meeting of the Association for Computational Linguistics.2019:3197-3203. [40]DING Y,LI J,GONG Z,et al.Improving neural sentence alignment with word translation[J].Frontiers of Computer Science,2021,15:151302. [41]WU N,LIANG Y,REN H,et al.Unsupervised context aware sentence representation pretraining for multi-lingual dense retrieval[J].arXiv:2206.03281,2022. [42]LIU J,MORIN E,SALDARRIAGA S P,et al.From unifiedphrase representation to bilingual phrase alignment in an unsupervised manner[J].Natural Language Engineering,2023,29(3):643-668. [43]ZWEIGENBAUM P,SHAROFF S,RAPP R.Towards preparation of the second BUCC shared task:Detecting parallel sentences in comparable corpora[C]//Proceedings of the Ninth Workshop on Building and Using Comparable Corpora.Euro-pean Language Resources Association(ELRA),Portoroz,Slovenia.2016:38-43. |
|