Computer Science ›› 2025, Vol. 52 ›› Issue (8): 259-267.doi: 10.11896/jsjkx.241000055

• Artificial Intelligence • Previous Articles     Next Articles

Cross-lingual Information Retrieval Based on Aligned Query

LI Junwen1, SONG Yuqiu2, ZHANG Weiyan2, RUAN Tong2, LIU Jingping2, ZHU Yan1   

  1. 1 School of Mathematics,East China University of Science and Technology,Shanghai 200237,China
    2 School of Information Science and Engineering,East China University of Science and Technology,Shanghai 200237,China
  • Received:2024-10-12 Revised:2025-01-25 Online:2025-08-15 Published:2025-08-08
  • About author:LI Junwen,born in 2001,postgraduate.His main research interests include na-tural language processing and information retrieval.
    ZHU Yan,born in 1984,Ph.D,associate professor.Her main research interest is graph theory and its applications.

Abstract: Cross-lingual Information Retrieval(CLIR) is an important information acquisition task in natural language proces-sing.Recently,LLM-based retrieval methods have gained attention and demonstrated remarkable progress in this task.However,existing unsupervised retrieval methods based on prompting large language models still insufficient in effectiveness and efficiency.To solve this problem,this paper introduces a novel CLIR method based on aligned query.Specifically,this paper adopts the “pretrain-finetune” paradigm and proposes an adaptive self-teaching encoder based on a pretrained multilingual model to guide cross-lingual retrieval learning by mono-lingual retrieval learning.This method introduces semantically aligned queries in the same language as the documents and designs an adaptive self-teaching mechanism to guide cross-lingual retrieval by leveraging the probability distribution of mono-lingual retrieval results from different linguistic perspectives.To evaluate the effectiveness and efficiency of this method,this paper conducts extensive experiments on 22 language pairs.The results demonstrate that the proposed method achieves SOTA performance in terms of MRR.In particular,this method improves average MRR by 15.45% over the sub-optimal baseline in high-resource language pairs and 18.9% over the sub-optimal baseline in low-resource language pairs.Furthermore,the method reduces training and inference times compared to LLM-based approaches and exhibits faster convergence with enhanced stability.

Key words: Cross-lingual Information Retrieval, Aligned query, Self-teaching, Adaptive layer-wise coefficient

CLC Number: 

  • TP391
[1]HUANG Z,YU P,ALLAN J.Improving cross-lingual information retrieval on low-resource languages via optimal transport distillation[C]//Proceedings of the Sixteenth ACM Internatio-nal Conference on Web Search and Data Mining.2023:1048-1056.
[2]ZHANG S,LIANG Y,GONG M,et al.Modeling sequential sentence relation to improve cross-lingual dense retrieval[J].arXiv:2302.01626,2023.
[3]LI Z J,LI S H.Survey on Web-based Question Answering [J].Computer Science,2017,44(6):1-7,42.
[4]YU Y Y,CHAO W H,HE Y Y,et al.Cross-language Know-ledge Linkage Based on Bilingual Topic Model and Bilingual Word Vectors [J].Computer Science,2019,46(1):238-244.
[5]WANG Y,REN R,LI J,et al.REAR:A Relevance-Aware Retrieval-Augmented Framework for Open-Domain Question Answering[J].arXiv:2402.17497,2024.
[6]ZHUANG H,QIN Z,HUI K,et al.Beyond yes and no:Improving zero-shot llm rankers via scoring fine-grained relevance labels[J].arXiv:2310.14122,2023.
[7]SUN W,YAN L,MA X,et al.Is ChatGPT good at search? investigating large language models as re-ranking agents[J].ar-Xiv:2304.09542,2023.
[8]QIN Z,JAGERMAN R,HUI K,et al.Large language modelsare effective text rankers with pairwise ranking prompting[J].arXiv:2306.17563,2023.
[9]ACHIAM J,ADLER S,AGARWAL S,et al.Gpt-4 technical report[J].arXiv:2303.08774,2023.
[10]TOUVRON H,LAVRIL T,IZACARD G,et al.Llama:Openand efficient foundation language models[J].arXiv:2302.13971,2023.
[11]TOUVRON H,MARTIN L,STONE K,et al.Llama 2:Open foundation and fine-tuned chat models[J].arXiv:2307.09288,2023.
[12]LIU J P,SU J S,HUANG D G.Incorporating Language-specific Adapter into Multilingual Neural Machine Translation[J].Computer Science,2022,49(1):17-23.
[13]ELAYEB B,ROMDHANE W B,SAOUD N B B.Towards a new possibilistic query translation tool for cross-language information retrieval[J].Multimedia Tools and Applications,2018,77:2423-2465.
[14]AZARBONYAD H,SHAKERY A,FAILI H.A learning torank approach for cross-language information retrieval exploiting multiple translation resources[J].Natural Language Engineering,2019,25(3):363-384.
[15]KISHIDA K,KANDO N.Two-stage refinement of query translation in a pivot language approach to cross-lingual information retrieval:An experiment at CLEF 2003[C]//Workshop of the Cross-Language Evaluation Forum for European Languages.Berlin:Springer,2003:253-262.
[16]TASHU T M,KONTOS E R,SABATELLI M,et al.Mapping Transformer Leveraged Embeddings for Cross-Lingual Document Representation[J].arXiv:2401.06583,2024.
[17]LIN J A,BAO C Z,DONG J F,et al.Multilingual Text-Video Cross-Modal Retrieval Model via Multilingual-Visual Common Space Learning[J].Journal of Computer Science,2024,47(9):2195-2210.
[18]ZOU A,HAO W N,JIN D W,et al.Study on Text Retrieval Based on Pre-training and Deep Hash [J].Computer Science,2021,48(11):300-306.
[19]QIU X,WANG Y,SHI J,et al.Cross-Lingual Transfer for Na-tural Language Inference via Multilingual Prompt Translator[J].arXiv:2403.12407,2024.
[20]CONNEAU A,KHANDELWAL K,GOYAL N,et al.Unsupervised cross-lingual representation learning at scale[J].arXiv:1911.02116,2019.
[21]LUO F,WANG W,LIU J,et al.VECO:Variable and flexible cross-lingual pre-training for language understanding and gene-ration[J].arXiv:2010.16046,2020.
[22]LITSCHKO R,VULIĆ I,PONZETTO S P,et al.On cross-lingual retrieval with multilingual text encoders[J].Information Retrieval Journal,2022,25(2):149-183.
[23]YANG E,NAIR S,CHANDRADEVAN R,et al.C3:Continued pretraining with contrastive weak supervision for cross language ad-hoc retrieval[C]//Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval.2022:2507-2512.
[24]ZHENG H,ZHANG X,CHI Z,et al.Cross-lingual phrase retrieval[J].arXiv:2204.08887,2022.
[25]MA X,ZHANG X,PRADEEP R,et al.Zero-shot listwise document reranking with a large language model[J].arXiv:2305.02156,2023.
[26]CHEN X T,YE J J,ZU C,et al.Robustness of GPT Large Language Models on Natural Language Processing Tasks [J].Journal of Computer Research and Development,2024,61(5):1128-1142.
[27]IZACARD G,CARON M,HOSSEINI L,et al.Unsuperviseddense information retrieval with contrastive learning[J].arXiv:2112.09118,2021.
[28]QU Y,DING Y,LIU J,et al.RocketQA:An optimized training approach to dense passage retrieval for open-domain question answering[J].arXiv:2010.08191,2020.
[29]XIONG L,XIONG C,LI Y,et al.Approximate nearest neighbor negative contrastive learning for dense text retrieval[J].arXiv:2007.00808,2020.
[30]HUANG Z H,YANG S Z,LIN W,et al.Knowledge Distillation:A Survey [J].Journal of Computer Science,2022,45(3):624-653.
[31]JAWAHAR G,SAGOT B,SEDDAH D.What does BERT learn about the structure of language?[C]//57th Annual Meeting of the Association for Computational Linguistics.2019.
[32]SUN S,DUH K.CLIRMatrix:A massively large collection ofbilingual and multilingual datasets for Cross-Lingual Information Retrieval[C]//Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing(EMNLP).2020:4160-4170.
[33]PENEDO G,MALARTIC Q,HESSLOW D,et al.The RefinedWeb dataset for Falcon LLM:outperforming curated corpora with web data,and web data only[J].arXiv:2306.01116,2023.
[34]ZHENG L,CHIANG W L,SHENG Y,et al.Judging LLM-as-a-judge with MT-bench and chatbot arena[J].arXiv:2306.05685,2024.
[35]LIU J,SONG Y,XUE K,et al.Fl-tuning:Layer tuning for feed-forward network in transformer[J].arXiv:2206.15312,2022.
[1] ZHAO Shengyu, PENG Jiaheng, WANG Wei, HUANG Fan. OpenRank Dynamics:Influence Evaluation and Dynamic Propagation Models for Open SourceEcosystems [J]. Computer Science, 2025, 52(8): 62-70.
[2] YANG Jian, SUN Liu, ZHANG Lifang. Survey on Data Processing and Data Augmentation in Low-resource Language Automatic Speech Recognition [J]. Computer Science, 2025, 52(8): 86-99.
[3] WANG Pei, YANG Xihong, GUAN Renxiang, ZHU En. Deep Graph Contrastive Clustering Algorithm Based on Dynamic Threshold Pseudo-label Selection [J]. Computer Science, 2025, 52(8): 100-108.
[4] CHEN Genshen, LIU Gang, DONG Yang, FAN Wenyao, YI Qiang, JIANG Zixin. Efficient Indexing Method for Massive 3D Geological Block Models Based on Inverted-B+ Tree [J]. Computer Science, 2025, 52(8): 146-153.
[5] ZHANG Shiju, GUO Chaoyang, WU Chengliang, WU Lingjun, YANG Fengyu. Text Clustering Approach Based on Key Semantic Driven and Contrastive Learning [J]. Computer Science, 2025, 52(8): 171-179.
[6] DING Zhengze, NIE Rencan, LI Jintao, SU Huaping, XU Hang. MTFuse:An Infrared and Visible Image Fusion Network Based on Mamba and Transformer [J]. Computer Science, 2025, 52(8): 188-194.
[7] YUAN Youwen, JIN Shuo, ZHAO Xi. IBSNet:A Neural Implicit Field for IBS Prediction in Single-view Scanned Point Cloud [J]. Computer Science, 2025, 52(8): 195-203.
[8] LIU Huayong, XU Minghui. Hash Image Retrieval Based on Mixed Attention and Polarization Asymmetric Loss [J]. Computer Science, 2025, 52(8): 204-213.
[9] LIU Jian, YAO Renyuan, GAO Nan, LIANG Ronghua, CHEN Peng. VSRI:Visual Semantic Relational Interactor for Image Caption [J]. Computer Science, 2025, 52(8): 222-231.
[10] WANG Fengling, WEI Aimin, PANG Xiongwen, LI Zhi, XIE Jingming. Video Super-resolution Model Based on Implicit Alignment [J]. Computer Science, 2025, 52(8): 232-239.
[11] WANG Jia, XIA Ying, FENG Jiangfan. Few-shot Video Action Recognition Based on Two-stage Spatio-Temporal Alignment [J]. Computer Science, 2025, 52(8): 251-258.
[12] ZHANG Yuan, ZHANG Shengjie, LIU Lilong, QIAN Shengsheng. Research on Continual Social Event Classification Based on Continual Event Knowledge Network [J]. Computer Science, 2025, 52(8): 268-276.
[13] LIU Le, XIAO Rong, YANG Xiao. Application of Decoupled Knowledge Distillation Method in Document-level RelationExtraction [J]. Computer Science, 2025, 52(8): 277-287.
[14] CHEN Ge, WANG Zhongqing. Cross-domain Aspect-based Sentiment Analysis Based on Pre-training Model with Data Augmentation [J]. Computer Science, 2025, 52(8): 300-307.
[15] WANG Dongsheng. Multi-defendant Legal Judgment Prediction with Multi-turn LLM and Criminal Knowledge Graph [J]. Computer Science, 2025, 52(8): 308-316.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!