无监督句对齐综述

doi:10.11896/jsjkx.231100024

Abstract

Abstract: Unsupervised sentence alignment is an important and challenging problem in the field of natural language processing.This task aims to find corresponding sentence correspondences in different languages and provide basic support for cross-language information retrieval,machine translation and other applications.This survey summarizes the current research status of unsupervised sentence alignment from three aspects:methods,challenges and applications.In terms of methods,unsupervised sentence alignment covers a variety of methods,including based on multi-language embedding,clustering and self-supervised or generative models.However,unsupervised sentence alignment faces challenges such as diversity,language differences,and domain adaptation.The ambiguity and diversity of languages complicates sentence alignment,especially in low-resource languages.Despite the challenges,unsupervised sentence alignment has important applications in fields such as cross-lingual information retrieval,machine translation,and multilingual information aggregation.Through unsupervised sentence alignment,information in different languages can be integrated to improve the effect of information retrieval.At the same time,research in this field is alsoconstan-tly promoting technological innovation and development,providing opportunities to achieve more accurate and robust unsupervised sentence alignment.

Key words: Unsupervised sentence alignment, Natural language processing, Machine translation, Self-supervised, Low-resource

CLC Number:

TP391

GU Shiwei, LIU Jing, LI Bingchun, XIONG Deyi. Survey of Unsupervised Sentence Alignment[J].Computer Science, 2024, 51(1): 60-67.

References

[1]BRAUNE F,FRASER A.Improved unsupervised sentencealignment for symmetrical and asymmetrical parallel corpora[C]//Coling 2010:Posters.2010:81-89.
[2]LI Z,HUANG S,ZHANG Z,et al.Dual-Alignment Pre-training for Cross-lingual Sentence Embedding[J].arXiv:2305.09148,2023.
[3]TIEN C,STEINERT-THRELKELD S.Bilingual alignmenttransfers to multilingual alignment for unsupervised parallel text mining[C]//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics(Volume 1:Long Papers).2022:8696-8706.
[4]KEUNG P,SALAZAR J,LU Y,et al.Unsupervised bitext mi-ning and translation via self-trained contextual embeddings[J].Transactions of the Association for Computational Linguistics,2021,8:828-841.
[5]ZHU S,MI C,LI T,et al.Unsupervised parallel sentences of machine translation for Asian language pairs[J].ACM Transactions on Asian and Low-Resource Language Information Processing,2023,22(3):64:1-64:14.
[6]LAMPLE G,CONNEAU A,DENOYER L,et al.Unsupervised Machine Translation Using Monolingual Corpora Only[C]//International Conference on Learning Representations.2018.
[7]ARTETXE M,LABAKA G,AGIRRE E,et al.Unsupervisedneural machine translation[C]//6th International Conference on Learning Representations(ICLR 2018).2018.
[8]LAMPLE G,CONNEAU A,RANZATO M A,et al.Word translation without parallel da-ta[C]//International Conference on Learning Representations.2018.
[9]QI Y,SACHAN D,FELIX M,et al.When and Why Are Pre-Trained Word Embeddings Useful for Neural Machine Translation?[C]//Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies,Volume 2(Short Papers).2018:529-535.
[10]ARTETXE M,SCHWENK H.Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond[J].Transactions of the Association for Computational Linguistics,2019,7:597-610.
[11]REN S,LIU S,ZHOU M,et al.A graph-based coarse-to-finemethod for unsupervised bilingual lexicon induction[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.2020:3476-3485.
[12]ARTETXE M,LABAKA G,AGIRRE E.A robust self-learning method for fully unsupervised cross-lingual mappings of word embed-dings[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics(Volume 1:Long Papers).2018:789-798.
[13]GARNEAU N,GODBOUT M,BEAUCHEMIN D,et al.A Robust Self-Learning Method for Fully Unsupervised Cross-Lingual Mappings of Word Embeddings:Making the Method Robustly Reproducible as Well[C]//Proceedings of the Twelfth Language Resources and Evaluation Conference.2020:5546-5554.
[14]CONNEAU A,KHANDELWAL K,GOYAL N,et al.Unsupervised Cross-lingual Representation Learning at Scale[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.2020:8440-8451.
[15]LU X,QIANG J,LI Y,et al.An unsupervised method for buil-ding sentence simplification corpora in multiple languages[C]//Findings of the Association for Computational Linguistics.Punta Cana:Association for Computational Linguistics,2021:227-237.
[16]KVAPILÍKOVÁ I,ARTETXE M,LABAKA G,et al.Un-su-pervised Multilingual Sentence Embeddings for Parallel Corpus Mining[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics:Student Research Workshop.2020:255-262.
[17]ARTETXE M,LABAKA G,AGIRRE E.A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics(Volume 1:Long Papers).2018:789-798.
[18]HASHIMOTO K,XIONG C,TSURUOKA Y,et al.A JointMany-Task Model:Growing a Neural Network for Multiple NLP Tasks[C]//Proceedings of the 2017 Conference on Empi-rical Methods in Natural Language Processing.2017:1923-1933.
[19]ORMAZABAL A,ARTETXE M,LABAKA G,et al.Analyzing the Limitations of Cross-lingual Word Embedding Mappings[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.2019:4990-4995.
[20]PATRA B,MONIZ J R A,GARG S,et al.Bilingual Lexicon Induction with Semi-supervision in Non-Isometric Embedding Spaces[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.2019:184-193.
[21]ZHAO X,WANG Z,ZHANG Y,et al.A Relaxed Matching Procedure for Unsupervised BLI[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.2020:3036-3041.
[22]HANGYA V,BRAUNE F,KALASOUSKAYA Y,et al.Unsupervised parallel sentence extraction from comparable corpora[C]//Proceedings of the 15th International Conference on Spoken Language Translation.Brussels:International Conference on Spoken Language Translation.2018:7-13.
[23]KIM Y,ROSENDAHL H,ROSSENBACH N,et al.LearningBilingual Sentence Embeddings via Autoencoding and Computing Similarities with a Multilayer Perceptron[C]//Proceedings of the 4th Workshop on Representation Learning for NLP(RepL4NLP-2019).2019:61-71.
[24]BAÑÓN M,CHEN P,HADDOW B,et al.ParaCrawl:Web-scale acquisition of parallel corpora[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.2020:4555-4567.
[25]HANGYA V,FRASER A.Unsupervised parallel sentence ex-traction with parallel segment detection helps machine translation[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.2019:1224-1234.
[26]HONG C,LEE J,LEE J.Unsupervised Interlingual SemanticRepresentations from Sentence Embeddings for Zero-Shot Cross-Lingual Trans-fer[C]//Proceedings of the AAAI Confe-rence on Artificial Intelligence.2020:7944-7951.
[27]SCHWENK H,WENZEK G,EDUNOV S,et al.CCMatrix:Mining Billions of High-Quality Parallel Sentences on the Web[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing(Volume 1:Long Papers).2021:6490-6500.
[28]SCHWENK H,CHAUDHARY V,SUN S,et al.WikiMatrix:Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia[C]//Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics:Main Volume.2021:1351-1361.
[29]LIAN X,JAIN K,TRUSZKOWSKI J,et al.Unsupervised multilingual alignment using Wasserstein barycenter[C]//Procee-dings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence.2021:3702-3708.
[30]CHOUSA K,NAGATA M,NISHINO M.SpanAlign:Sentence alignment method based on cross-language span prediction and ILP[C]//Proceedings of the 28th International Conference on Computational Linguistics.2020:4750-4761.
[31]ZHU S,GU S,LI S,et al.Mining parallel sentences from internet with multi-view knowledge distillation for low-resource language pairs[J/OL].Knowledge and Information Systems,2023.https://doi.org/10.1007/s10115-023-01925-3.
[32]CHI T C,CHEN Y N.CLUSE:Cross-Lingual UnsupervisedSense Embeddings[C]//Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing.2018:271-281.
[33]WANG L,ZHAO W,LIU J.Aligning Cross-lingual SentenceRepresentations with Dual Momentum Contrast[C]//Procee-dings of the 2021 Conference on Empirical Methods in Natural Language Processing.2021:3807-3815.
[34]DENG J,WAN F,YANG T,et al.Clustering-Aware Negative Sampling for Unsupervised Sentence Representation[J].arXiv:2305.09892,2023.
[35]PAETZOLD G,ALVA-MANCHEGO F,SPECIA L.Massa-lign:Alignment and annotation of comparable docu-ments[C]//Proceedings of the IJCNLP 2017,System Demonstrations.2017:1-4.
[36]LENG Y,TAN X,QIN T,et al.Unsupervised Pivot Translation for Distant Languages[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.2019:175-183.
[37]CHEN S,ZHOU J,SUN Y,et al.An Information Minimization Based Contrastive Learning Model for Unsupervised Sentence Embeddings Learning[C]//Proceedings of the 29th Interna-tional Conference on Computational Linguistics.2022:4821-4831.
[38]PIRES T,SCHLINGER E,GARRETTE D.How Multilingual is Multilingual BERT?[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.2019:4996-5001.
[39]ARTETXE M,SCHWENK H.Margin-based Parallel CorpusMining with Multilingual Sentence Embeddings[C]//Procee-dings of the 57th Annual Meeting of the Association for Computational Linguistics.2019:3197-3203.
[40]DING Y,LI J,GONG Z,et al.Improving neural sentence alignment with word translation[J].Frontiers of Computer Science,2021,15:151302.
[41]WU N,LIANG Y,REN H,et al.Unsupervised context aware sentence representation pretraining for multi-lingual dense retrieval[J].arXiv:2206.03281,2022.
[42]LIU J,MORIN E,SALDARRIAGA S P,et al.From unifiedphrase representation to bilingual phrase alignment in an unsupervised manner[J].Natural Language Engineering,2023,29(3):643-668.
[43]ZWEIGENBAUM P,SHAROFF S,RAPP R.Towards preparation of the second BUCC shared task:Detecting parallel sentences in comparable corpora[C]//Proceedings of the Ninth Workshop on Building and Using Comparable Corpora.Euro-pean Language Resources Association(ELRA),Portoroz,Slovenia.2016:38-43.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Survey of Unsupervised Sentence Alignment

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 15

Metrics

Comments

Recommended 0

[1]	GUO Zhiqiang, GUAN Donghai, YUAN Weiwei. Word-Character Model with Low Lexical Information Loss for Chinese NER [J]. Computer Science, 2024, 51(8): 272-280.
[2]	HAN Bing, DENG Lixiang, ZHENG Yi, REN Shuang. Survey of 3D Point Clouds Upsampling Methods [J]. Computer Science, 2024, 51(7): 167-196.
[3]	WENG Yu, LUO Haoyu, Chaomurilige, LIU Xuan , DONG Jun, LIU Zheng. CINOSUM:An Extractive Summarization Model for Low-resource Multi-ethnic Language [J]. Computer Science, 2024, 51(7): 296-302.
[4]	PENG Bo, LI Yaodong, GONG Xianfu, LI Hao. Method for Entity Relation Extraction Based on Heterogeneous Graph Neural Networks and TextSemantic Enhancement [J]. Computer Science, 2024, 51(6A): 230700071-5.
[5]	LI Bin, WANG Haochang. Implementation and Application of Chinese Grammatical Error Diagnosis System Based on CRF [J]. Computer Science, 2024, 51(6A): 230900073-6.
[6]	YANG Binxia, LUO Xudong, SUN Kaili. Recent Progress on Machine Translation Based on Pre-trained Language Models [J]. Computer Science, 2024, 51(6A): 230700112-8.
[7]	WANG Yingjie, ZHANG Chengye, BAI Fengbo, WANG Zumin. Named Entity Recognition Approach of Judicial Documents Based on Transformer [J]. Computer Science, 2024, 51(6A): 230500164-9.
[8]	LI Minzhe, YIN Jibin. TCM Named Entity Recognition Model Combining BERT Model and Lexical Enhancement [J]. Computer Science, 2024, 51(6A): 230900030-6.
[9]	CHEN Wenzhong, CHEN Hongmei, ZHOU Lihua, FANG Yuan. Time-aware Pre-training Method for Sequence Recommendation [J]. Computer Science, 2024, 51(5): 45-53.
[10]	TU Xin, ZHANG Wei, LI Jidong, LI Meijiao , LONG Xiangbo. Study on Automatic Classification of English Tense Exercises for Intelligent Online Teaching [J]. Computer Science, 2024, 51(4): 353-358.
[11]	ZHANG Mingdao, ZHOU Xin, WU Xiaohong, QING Linbo, HE Xiaohai. Unified Fake News Detection Based on Semantic Expansion and HDGCN [J]. Computer Science, 2024, 51(4): 299-306.
[12]	ZHENG Cheng, SHI Jingwei, WEI Suhua, CHENG Jiaming. Dual Feature Adaptive Fusion Network Based on Dependency Type Pruning for Aspect-basedSentiment Analysis [J]. Computer Science, 2024, 51(3): 205-213.
[13]	LIU Tao, JIANG Guoquan, LIU Shanshan, LIU Liu, HUAN Zhigang. Survey of Event Extraction in Low-resource Scenarios [J]. Computer Science, 2024, 51(2): 217-237.
[14]	GE Huibin, WANG Dexin, ZHENG Tao, ZHANG Ting, XIONG Deyi. Study on Model Migration of Natural Language Processing for Domestic Deep Learning Platform [J]. Computer Science, 2024, 51(1): 50-59.
[15]	MAO Xin, LEI Zhanyao, QI Zhengwei. Automated Kaomoji Extraction Based on Large-scale Danmaku Texts [J]. Computer Science, 2024, 51(1): 284-294.