Computer Science ›› 2024, Vol. 51 ›› Issue (1): 60-67.doi: 10.11896/jsjkx.231100024

• Special Issue on the 56th Anniversary of Computer Science • Previous Articles     Next Articles

Survey of Unsupervised Sentence Alignment

GU Shiwei1, LIU Jing2, LI Bingchun2, XIONG Deyi1   

  1. 1 College of Intelligence and Computing,Tianjin University,Tianjin 300350,China
    2 School of Computer Science and Technology,Kashi University,Kashgar,Xinjiang 844000,China
  • Received:2023-11-02 Revised:2023-12-10 Online:2024-01-15 Published:2024-01-12
  • About author:GU Shiwei,born in 1998,postgraduate.His main research interests include na-tural language processing and machine translation.
    XIONG Deyi,born in 1979,Ph.D, professor,Ph.D supervisor.His main research interests include natural language processing and machine translation.
  • Supported by:
    Natural Science Foundation of Xinjiang Uygur Autonomous Region(2022D01D43), Key Research and Development Program of Yunnan Province(202203AA080004) and Research on the Parallel Corpus of Chinese Urdu Language(KS2022084).

Abstract: Unsupervised sentence alignment is an important and challenging problem in the field of natural language processing.This task aims to find corresponding sentence correspondences in different languages and provide basic support for cross-language information retrieval,machine translation and other applications.This survey summarizes the current research status of unsupervised sentence alignment from three aspects:methods,challenges and applications.In terms of methods,unsupervised sentence alignment covers a variety of methods,including based on multi-language embedding,clustering and self-supervised or generative models.However,unsupervised sentence alignment faces challenges such as diversity,language differences,and domain adaptation.The ambiguity and diversity of languages complicates sentence alignment,especially in low-resource languages.Despite the challenges,unsupervised sentence alignment has important applications in fields such as cross-lingual information retrieval,machine translation,and multilingual information aggregation.Through unsupervised sentence alignment,information in different languages can be integrated to improve the effect of information retrieval.At the same time,research in this field is alsoconstan-tly promoting technological innovation and development,providing opportunities to achieve more accurate and robust unsupervised sentence alignment.

Key words: Unsupervised sentence alignment, Natural language processing, Machine translation, Self-supervised, Low-resource

CLC Number: 

  • TP391
[1]BRAUNE F,FRASER A.Improved unsupervised sentencealignment for symmetrical and asymmetrical parallel corpora[C]//Coling 2010:Posters.2010:81-89.
[2]LI Z,HUANG S,ZHANG Z,et al.Dual-Alignment Pre-training for Cross-lingual Sentence Embedding[J].arXiv:2305.09148,2023.
[3]TIEN C,STEINERT-THRELKELD S.Bilingual alignmenttransfers to multilingual alignment for unsupervised parallel text mining[C]//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics(Volume 1:Long Papers).2022:8696-8706.
[4]KEUNG P,SALAZAR J,LU Y,et al.Unsupervised bitext mi-ning and translation via self-trained contextual embeddings[J].Transactions of the Association for Computational Linguistics,2021,8:828-841.
[5]ZHU S,MI C,LI T,et al.Unsupervised parallel sentences of machine translation for Asian language pairs[J].ACM Transactions on Asian and Low-Resource Language Information Processing,2023,22(3):64:1-64:14.
[6]LAMPLE G,CONNEAU A,DENOYER L,et al.Unsupervised Machine Translation Using Monolingual Corpora Only[C]//International Conference on Learning Representations.2018.
[7]ARTETXE M,LABAKA G,AGIRRE E,et al.Unsupervisedneural machine translation[C]//6th International Conference on Learning Representations(ICLR 2018).2018.
[8]LAMPLE G,CONNEAU A,RANZATO M A,et al.Word translation without parallel da-ta[C]//International Conference on Learning Representations.2018.
[9]QI Y,SACHAN D,FELIX M,et al.When and Why Are Pre-Trained Word Embeddings Useful for Neural Machine Translation?[C]//Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies,Volume 2(Short Papers).2018:529-535.
[10]ARTETXE M,SCHWENK H.Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond[J].Transactions of the Association for Computational Linguistics,2019,7:597-610.
[11]REN S,LIU S,ZHOU M,et al.A graph-based coarse-to-finemethod for unsupervised bilingual lexicon induction[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.2020:3476-3485.
[12]ARTETXE M,LABAKA G,AGIRRE E.A robust self-learning method for fully unsupervised cross-lingual mappings of word embed-dings[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics(Volume 1:Long Papers).2018:789-798.
[13]GARNEAU N,GODBOUT M,BEAUCHEMIN D,et al.A Robust Self-Learning Method for Fully Unsupervised Cross-Lingual Mappings of Word Embeddings:Making the Method Robustly Reproducible as Well[C]//Proceedings of the Twelfth Language Resources and Evaluation Conference.2020:5546-5554.
[14]CONNEAU A,KHANDELWAL K,GOYAL N,et al.Unsupervised Cross-lingual Representation Learning at Scale[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.2020:8440-8451.
[15]LU X,QIANG J,LI Y,et al.An unsupervised method for buil-ding sentence simplification corpora in multiple languages[C]//Findings of the Association for Computational Linguistics.Punta Cana:Association for Computational Linguistics,2021:227-237.
[16]KVAPILÍKOVÁ I,ARTETXE M,LABAKA G,et al.Un-su-pervised Multilingual Sentence Embeddings for Parallel Corpus Mining[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics:Student Research Workshop.2020:255-262.
[17]ARTETXE M,LABAKA G,AGIRRE E.A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics(Volume 1:Long Papers).2018:789-798.
[18]HASHIMOTO K,XIONG C,TSURUOKA Y,et al.A JointMany-Task Model:Growing a Neural Network for Multiple NLP Tasks[C]//Proceedings of the 2017 Conference on Empi-rical Methods in Natural Language Processing.2017:1923-1933.
[19]ORMAZABAL A,ARTETXE M,LABAKA G,et al.Analyzing the Limitations of Cross-lingual Word Embedding Mappings[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.2019:4990-4995.
[20]PATRA B,MONIZ J R A,GARG S,et al.Bilingual Lexicon Induction with Semi-supervision in Non-Isometric Embedding Spaces[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.2019:184-193.
[21]ZHAO X,WANG Z,ZHANG Y,et al.A Relaxed Matching Procedure for Unsupervised BLI[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.2020:3036-3041.
[22]HANGYA V,BRAUNE F,KALASOUSKAYA Y,et al.Unsupervised parallel sentence extraction from comparable corpora[C]//Proceedings of the 15th International Conference on Spoken Language Translation.Brussels:International Conference on Spoken Language Translation.2018:7-13.
[23]KIM Y,ROSENDAHL H,ROSSENBACH N,et al.LearningBilingual Sentence Embeddings via Autoencoding and Computing Similarities with a Multilayer Perceptron[C]//Proceedings of the 4th Workshop on Representation Learning for NLP(RepL4NLP-2019).2019:61-71.
[24]BAÑÓN M,CHEN P,HADDOW B,et al.ParaCrawl:Web-scale acquisition of parallel corpora[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.2020:4555-4567.
[25]HANGYA V,FRASER A.Unsupervised parallel sentence ex-traction with parallel segment detection helps machine translation[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.2019:1224-1234.
[26]HONG C,LEE J,LEE J.Unsupervised Interlingual SemanticRepresentations from Sentence Embeddings for Zero-Shot Cross-Lingual Trans-fer[C]//Proceedings of the AAAI Confe-rence on Artificial Intelligence.2020:7944-7951.
[27]SCHWENK H,WENZEK G,EDUNOV S,et al.CCMatrix:Mining Billions of High-Quality Parallel Sentences on the Web[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing(Volume 1:Long Papers).2021:6490-6500.
[28]SCHWENK H,CHAUDHARY V,SUN S,et al.WikiMatrix:Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia[C]//Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics:Main Volume.2021:1351-1361.
[29]LIAN X,JAIN K,TRUSZKOWSKI J,et al.Unsupervised multilingual alignment using Wasserstein barycenter[C]//Procee-dings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence.2021:3702-3708.
[30]CHOUSA K,NAGATA M,NISHINO M.SpanAlign:Sentence alignment method based on cross-language span prediction and ILP[C]//Proceedings of the 28th International Conference on Computational Linguistics.2020:4750-4761.
[31]ZHU S,GU S,LI S,et al.Mining parallel sentences from internet with multi-view knowledge distillation for low-resource language pairs[J/OL].Knowledge and Information Systems,2023.https://doi.org/10.1007/s10115-023-01925-3.
[32]CHI T C,CHEN Y N.CLUSE:Cross-Lingual UnsupervisedSense Embeddings[C]//Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing.2018:271-281.
[33]WANG L,ZHAO W,LIU J.Aligning Cross-lingual SentenceRepresentations with Dual Momentum Contrast[C]//Procee-dings of the 2021 Conference on Empirical Methods in Natural Language Processing.2021:3807-3815.
[34]DENG J,WAN F,YANG T,et al.Clustering-Aware Negative Sampling for Unsupervised Sentence Representation[J].arXiv:2305.09892,2023.
[35]PAETZOLD G,ALVA-MANCHEGO F,SPECIA L.Massa-lign:Alignment and annotation of comparable docu-ments[C]//Proceedings of the IJCNLP 2017,System Demonstrations.2017:1-4.
[36]LENG Y,TAN X,QIN T,et al.Unsupervised Pivot Translation for Distant Languages[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.2019:175-183.
[37]CHEN S,ZHOU J,SUN Y,et al.An Information Minimization Based Contrastive Learning Model for Unsupervised Sentence Embeddings Learning[C]//Proceedings of the 29th Interna-tional Conference on Computational Linguistics.2022:4821-4831.
[38]PIRES T,SCHLINGER E,GARRETTE D.How Multilingual is Multilingual BERT?[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.2019:4996-5001.
[39]ARTETXE M,SCHWENK H.Margin-based Parallel CorpusMining with Multilingual Sentence Embeddings[C]//Procee-dings of the 57th Annual Meeting of the Association for Computational Linguistics.2019:3197-3203.
[40]DING Y,LI J,GONG Z,et al.Improving neural sentence alignment with word translation[J].Frontiers of Computer Science,2021,15:151302.
[41]WU N,LIANG Y,REN H,et al.Unsupervised context aware sentence representation pretraining for multi-lingual dense retrieval[J].arXiv:2206.03281,2022.
[42]LIU J,MORIN E,SALDARRIAGA S P,et al.From unifiedphrase representation to bilingual phrase alignment in an unsupervised manner[J].Natural Language Engineering,2023,29(3):643-668.
[43]ZWEIGENBAUM P,SHAROFF S,RAPP R.Towards preparation of the second BUCC shared task:Detecting parallel sentences in comparable corpora[C]//Proceedings of the Ninth Workshop on Building and Using Comparable Corpora.Euro-pean Language Resources Association(ELRA),Portoroz,Slovenia.2016:38-43.
[1] MAO Xin, LEI Zhanyao, QI Zhengwei. Automated Kaomoji Extraction Based on Large-scale Danmaku Texts [J]. Computer Science, 2024, 51(1): 284-294.
[2] GE Huibin, WANG Dexin, ZHENG Tao, ZHANG Ting, XIONG Deyi. Study on Model Migration of Natural Language Processing for Domestic Deep Learning Platform [J]. Computer Science, 2024, 51(1): 50-59.
[3] LI Xiang, FAN Zhiguang, LIN Nan, CAO Yangjie, LI Xuexiang. Self-supervised Learning for 3D Real-scenes Question Answering [J]. Computer Science, 2023, 50(9): 220-226.
[4] ZHANG Yian, YANG Ying, REN Gang, WANG Gang. Study on Multimodal Online Reviews Helpfulness Prediction Based on Attention Mechanism [J]. Computer Science, 2023, 50(8): 37-44.
[5] ZHOU Ziyi, XIONG Hailing. Image Captioning Optimization Strategy Based on Deep Learning [J]. Computer Science, 2023, 50(8): 99-110.
[6] GAO Xiang, WANG Shi, ZHU Junwu, LIANG Mingxuan, LI Yang, JIAO Zhixiang. Overview of Named Entity Recognition Tasks [J]. Computer Science, 2023, 50(6A): 220200119-8.
[7] ZENG Wu, MAO Guojun. Few-shot Learning Method Based on Multi-graph Feature Aggregation [J]. Computer Science, 2023, 50(6A): 220400029-10.
[8] WEI Tao, LI Zhihua, WANG Changjie, CHENG Shunhang. Cybersecurity Threat Intelligence Mining Algorithm for Open Source Heterogeneous Data [J]. Computer Science, 2023, 50(6): 330-337.
[9] WANG Lin, MENG Zuqiang, YANG Lina. Chinese Sentiment Analysis Based on CNN-BiLSTM Model of Multi-level and Multi-scale Feature Extraction [J]. Computer Science, 2023, 50(5): 248-254.
[10] ZHEN Tiange, SONG Mingyang, JING Liping. Incorporating Multi-granularity Extractive Features for Keyphrase Generation [J]. Computer Science, 2023, 50(4): 181-187.
[11] WANG Pengyu, TAI Wenxin, LIU Fang, ZHONG Ting, LUO Xucheng, ZHOU Fan. Self-supervised Flight Trajectory Prediction Based on Data Augmentation [J]. Computer Science, 2023, 50(2): 130-137.
[12] ZHU Lei, WANG Shanmin, LIU Qingshan. Self-supervised 3D Face Reconstruction Based on Detailed Face Mask [J]. Computer Science, 2023, 50(2): 214-220.
[13] CHEN Shifei, LIU Dong, JIANG He. CodeBERT-based Language Model for Design Patterns [J]. Computer Science, 2023, 50(12): 75-81.
[14] QIN Mingfei, FU Guohong. Multi-level Semantic Structure Enhanced Emotional Cause Span Extraction in Conversations [J]. Computer Science, 2023, 50(12): 236-245.
[15] FAN Dongxu, GUO Yi. Aspect-based Multimodal Sentiment Analysis Based on Trusted Fine-grained Alignment [J]. Computer Science, 2023, 50(12): 246-254.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!