计算机科学 ›› 2025, Vol. 52 ›› Issue (8): 259-267.doi: 10.11896/jsjkx.241000055

• 人工智能 • 上一篇    下一篇

基于对齐查询的跨语言信息检索方法

李俊文1, 宋雨秋2, 张维彦2, 阮彤2, 刘井平2, 朱焱1   

  1. 1 华东理工大学数学学院 上海 200237
    2 华东理工大学信息科学与工程学院 上海 200237
  • 收稿日期:2024-10-12 修回日期:2025-01-25 出版日期:2025-08-15 发布日期:2025-08-08
  • 通讯作者: 朱焱(zhuygraph@ecust.edu.cn)
  • 作者简介:(18602126280@163.com)

Cross-lingual Information Retrieval Based on Aligned Query

LI Junwen1, SONG Yuqiu2, ZHANG Weiyan2, RUAN Tong2, LIU Jingping2, ZHU Yan1   

  1. 1 School of Mathematics,East China University of Science and Technology,Shanghai 200237,China
    2 School of Information Science and Engineering,East China University of Science and Technology,Shanghai 200237,China
  • Received:2024-10-12 Revised:2025-01-25 Online:2025-08-15 Published:2025-08-08
  • About author:LI Junwen,born in 2001,postgraduate.His main research interests include na-tural language processing and information retrieval.
    ZHU Yan,born in 1984,Ph.D,associate professor.Her main research interest is graph theory and its applications.

摘要: 跨语言信息检索是自然语言处理中一项重要的信息获取任务。最近,基于大语言模型的检索方法在这一任务中获得了广泛关注并取得了显著的进展。然而,现有基于提示大语言模型的无监督检索方法在效果和效率上仍有不足。对此,提出了一种全新的基于对齐查询的跨语言信息检索方法。具体而言,采用“预训练-微调”范式,基于预训练多语言模型提出了一种自适应的自指导编码器,通过同一语言内的检索学习指导跨语言检索学习。该方法引入与文档语种相同的语义对齐的查询,并设计了一种自适应的自指导机制,利用不同语种视角下的单语言检索结果的概率分布来指导跨语言检索。在22对语言组合上进行了广泛的实验来评估所提模型的有效性和效率,结果表明,所提方法的MRR指标达到了当前最先进水平。具体而言,其在高资源语种组合上相较于次优基线的平均MRR提高了15.45%,在低资源语种组合上相较于次优基线提高了18.9%。此外,相比基于大语言模型的方法,该方法在训练时间和推理时间上均更短,并且显著提升了收敛性能。相关代码已公开。

关键词: 跨语言信息检索, 对齐查询, 自指导, 自适应层级系数

Abstract: Cross-lingual Information Retrieval(CLIR) is an important information acquisition task in natural language proces-sing.Recently,LLM-based retrieval methods have gained attention and demonstrated remarkable progress in this task.However,existing unsupervised retrieval methods based on prompting large language models still insufficient in effectiveness and efficiency.To solve this problem,this paper introduces a novel CLIR method based on aligned query.Specifically,this paper adopts the “pretrain-finetune” paradigm and proposes an adaptive self-teaching encoder based on a pretrained multilingual model to guide cross-lingual retrieval learning by mono-lingual retrieval learning.This method introduces semantically aligned queries in the same language as the documents and designs an adaptive self-teaching mechanism to guide cross-lingual retrieval by leveraging the probability distribution of mono-lingual retrieval results from different linguistic perspectives.To evaluate the effectiveness and efficiency of this method,this paper conducts extensive experiments on 22 language pairs.The results demonstrate that the proposed method achieves SOTA performance in terms of MRR.In particular,this method improves average MRR by 15.45% over the sub-optimal baseline in high-resource language pairs and 18.9% over the sub-optimal baseline in low-resource language pairs.Furthermore,the method reduces training and inference times compared to LLM-based approaches and exhibits faster convergence with enhanced stability.

Key words: Cross-lingual Information Retrieval, Aligned query, Self-teaching, Adaptive layer-wise coefficient

中图分类号: 

  • TP391
[1]HUANG Z,YU P,ALLAN J.Improving cross-lingual information retrieval on low-resource languages via optimal transport distillation[C]//Proceedings of the Sixteenth ACM Internatio-nal Conference on Web Search and Data Mining.2023:1048-1056.
[2]ZHANG S,LIANG Y,GONG M,et al.Modeling sequential sentence relation to improve cross-lingual dense retrieval[J].arXiv:2302.01626,2023.
[3]LI Z J,LI S H.Survey on Web-based Question Answering [J].Computer Science,2017,44(6):1-7,42.
[4]YU Y Y,CHAO W H,HE Y Y,et al.Cross-language Know-ledge Linkage Based on Bilingual Topic Model and Bilingual Word Vectors [J].Computer Science,2019,46(1):238-244.
[5]WANG Y,REN R,LI J,et al.REAR:A Relevance-Aware Retrieval-Augmented Framework for Open-Domain Question Answering[J].arXiv:2402.17497,2024.
[6]ZHUANG H,QIN Z,HUI K,et al.Beyond yes and no:Improving zero-shot llm rankers via scoring fine-grained relevance labels[J].arXiv:2310.14122,2023.
[7]SUN W,YAN L,MA X,et al.Is ChatGPT good at search? investigating large language models as re-ranking agents[J].ar-Xiv:2304.09542,2023.
[8]QIN Z,JAGERMAN R,HUI K,et al.Large language modelsare effective text rankers with pairwise ranking prompting[J].arXiv:2306.17563,2023.
[9]ACHIAM J,ADLER S,AGARWAL S,et al.Gpt-4 technical report[J].arXiv:2303.08774,2023.
[10]TOUVRON H,LAVRIL T,IZACARD G,et al.Llama:Openand efficient foundation language models[J].arXiv:2302.13971,2023.
[11]TOUVRON H,MARTIN L,STONE K,et al.Llama 2:Open foundation and fine-tuned chat models[J].arXiv:2307.09288,2023.
[12]LIU J P,SU J S,HUANG D G.Incorporating Language-specific Adapter into Multilingual Neural Machine Translation[J].Computer Science,2022,49(1):17-23.
[13]ELAYEB B,ROMDHANE W B,SAOUD N B B.Towards a new possibilistic query translation tool for cross-language information retrieval[J].Multimedia Tools and Applications,2018,77:2423-2465.
[14]AZARBONYAD H,SHAKERY A,FAILI H.A learning torank approach for cross-language information retrieval exploiting multiple translation resources[J].Natural Language Engineering,2019,25(3):363-384.
[15]KISHIDA K,KANDO N.Two-stage refinement of query translation in a pivot language approach to cross-lingual information retrieval:An experiment at CLEF 2003[C]//Workshop of the Cross-Language Evaluation Forum for European Languages.Berlin:Springer,2003:253-262.
[16]TASHU T M,KONTOS E R,SABATELLI M,et al.Mapping Transformer Leveraged Embeddings for Cross-Lingual Document Representation[J].arXiv:2401.06583,2024.
[17]LIN J A,BAO C Z,DONG J F,et al.Multilingual Text-Video Cross-Modal Retrieval Model via Multilingual-Visual Common Space Learning[J].Journal of Computer Science,2024,47(9):2195-2210.
[18]ZOU A,HAO W N,JIN D W,et al.Study on Text Retrieval Based on Pre-training and Deep Hash [J].Computer Science,2021,48(11):300-306.
[19]QIU X,WANG Y,SHI J,et al.Cross-Lingual Transfer for Na-tural Language Inference via Multilingual Prompt Translator[J].arXiv:2403.12407,2024.
[20]CONNEAU A,KHANDELWAL K,GOYAL N,et al.Unsupervised cross-lingual representation learning at scale[J].arXiv:1911.02116,2019.
[21]LUO F,WANG W,LIU J,et al.VECO:Variable and flexible cross-lingual pre-training for language understanding and gene-ration[J].arXiv:2010.16046,2020.
[22]LITSCHKO R,VULIĆ I,PONZETTO S P,et al.On cross-lingual retrieval with multilingual text encoders[J].Information Retrieval Journal,2022,25(2):149-183.
[23]YANG E,NAIR S,CHANDRADEVAN R,et al.C3:Continued pretraining with contrastive weak supervision for cross language ad-hoc retrieval[C]//Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval.2022:2507-2512.
[24]ZHENG H,ZHANG X,CHI Z,et al.Cross-lingual phrase retrieval[J].arXiv:2204.08887,2022.
[25]MA X,ZHANG X,PRADEEP R,et al.Zero-shot listwise document reranking with a large language model[J].arXiv:2305.02156,2023.
[26]CHEN X T,YE J J,ZU C,et al.Robustness of GPT Large Language Models on Natural Language Processing Tasks [J].Journal of Computer Research and Development,2024,61(5):1128-1142.
[27]IZACARD G,CARON M,HOSSEINI L,et al.Unsuperviseddense information retrieval with contrastive learning[J].arXiv:2112.09118,2021.
[28]QU Y,DING Y,LIU J,et al.RocketQA:An optimized training approach to dense passage retrieval for open-domain question answering[J].arXiv:2010.08191,2020.
[29]XIONG L,XIONG C,LI Y,et al.Approximate nearest neighbor negative contrastive learning for dense text retrieval[J].arXiv:2007.00808,2020.
[30]HUANG Z H,YANG S Z,LIN W,et al.Knowledge Distillation:A Survey [J].Journal of Computer Science,2022,45(3):624-653.
[31]JAWAHAR G,SAGOT B,SEDDAH D.What does BERT learn about the structure of language?[C]//57th Annual Meeting of the Association for Computational Linguistics.2019.
[32]SUN S,DUH K.CLIRMatrix:A massively large collection ofbilingual and multilingual datasets for Cross-Lingual Information Retrieval[C]//Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing(EMNLP).2020:4160-4170.
[33]PENEDO G,MALARTIC Q,HESSLOW D,et al.The RefinedWeb dataset for Falcon LLM:outperforming curated corpora with web data,and web data only[J].arXiv:2306.01116,2023.
[34]ZHENG L,CHIANG W L,SHENG Y,et al.Judging LLM-as-a-judge with MT-bench and chatbot arena[J].arXiv:2306.05685,2024.
[35]LIU J,SONG Y,XUE K,et al.Fl-tuning:Layer tuning for feed-forward network in transformer[J].arXiv:2206.15312,2022.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!