计算机科学 ›› 2025, Vol. 52 ›› Issue (8): 259-267.doi: 10.11896/jsjkx.241000055
李俊文1, 宋雨秋2, 张维彦2, 阮彤2, 刘井平2, 朱焱1
LI Junwen1, SONG Yuqiu2, ZHANG Weiyan2, RUAN Tong2, LIU Jingping2, ZHU Yan1
摘要: 跨语言信息检索是自然语言处理中一项重要的信息获取任务。最近,基于大语言模型的检索方法在这一任务中获得了广泛关注并取得了显著的进展。然而,现有基于提示大语言模型的无监督检索方法在效果和效率上仍有不足。对此,提出了一种全新的基于对齐查询的跨语言信息检索方法。具体而言,采用“预训练-微调”范式,基于预训练多语言模型提出了一种自适应的自指导编码器,通过同一语言内的检索学习指导跨语言检索学习。该方法引入与文档语种相同的语义对齐的查询,并设计了一种自适应的自指导机制,利用不同语种视角下的单语言检索结果的概率分布来指导跨语言检索。在22对语言组合上进行了广泛的实验来评估所提模型的有效性和效率,结果表明,所提方法的MRR指标达到了当前最先进水平。具体而言,其在高资源语种组合上相较于次优基线的平均MRR提高了15.45%,在低资源语种组合上相较于次优基线提高了18.9%。此外,相比基于大语言模型的方法,该方法在训练时间和推理时间上均更短,并且显著提升了收敛性能。相关代码已公开。
中图分类号:
[1]HUANG Z,YU P,ALLAN J.Improving cross-lingual information retrieval on low-resource languages via optimal transport distillation[C]//Proceedings of the Sixteenth ACM Internatio-nal Conference on Web Search and Data Mining.2023:1048-1056. [2]ZHANG S,LIANG Y,GONG M,et al.Modeling sequential sentence relation to improve cross-lingual dense retrieval[J].arXiv:2302.01626,2023. [3]LI Z J,LI S H.Survey on Web-based Question Answering [J].Computer Science,2017,44(6):1-7,42. [4]YU Y Y,CHAO W H,HE Y Y,et al.Cross-language Know-ledge Linkage Based on Bilingual Topic Model and Bilingual Word Vectors [J].Computer Science,2019,46(1):238-244. [5]WANG Y,REN R,LI J,et al.REAR:A Relevance-Aware Retrieval-Augmented Framework for Open-Domain Question Answering[J].arXiv:2402.17497,2024. [6]ZHUANG H,QIN Z,HUI K,et al.Beyond yes and no:Improving zero-shot llm rankers via scoring fine-grained relevance labels[J].arXiv:2310.14122,2023. [7]SUN W,YAN L,MA X,et al.Is ChatGPT good at search? investigating large language models as re-ranking agents[J].ar-Xiv:2304.09542,2023. [8]QIN Z,JAGERMAN R,HUI K,et al.Large language modelsare effective text rankers with pairwise ranking prompting[J].arXiv:2306.17563,2023. [9]ACHIAM J,ADLER S,AGARWAL S,et al.Gpt-4 technical report[J].arXiv:2303.08774,2023. [10]TOUVRON H,LAVRIL T,IZACARD G,et al.Llama:Openand efficient foundation language models[J].arXiv:2302.13971,2023. [11]TOUVRON H,MARTIN L,STONE K,et al.Llama 2:Open foundation and fine-tuned chat models[J].arXiv:2307.09288,2023. [12]LIU J P,SU J S,HUANG D G.Incorporating Language-specific Adapter into Multilingual Neural Machine Translation[J].Computer Science,2022,49(1):17-23. [13]ELAYEB B,ROMDHANE W B,SAOUD N B B.Towards a new possibilistic query translation tool for cross-language information retrieval[J].Multimedia Tools and Applications,2018,77:2423-2465. [14]AZARBONYAD H,SHAKERY A,FAILI H.A learning torank approach for cross-language information retrieval exploiting multiple translation resources[J].Natural Language Engineering,2019,25(3):363-384. [15]KISHIDA K,KANDO N.Two-stage refinement of query translation in a pivot language approach to cross-lingual information retrieval:An experiment at CLEF 2003[C]//Workshop of the Cross-Language Evaluation Forum for European Languages.Berlin:Springer,2003:253-262. [16]TASHU T M,KONTOS E R,SABATELLI M,et al.Mapping Transformer Leveraged Embeddings for Cross-Lingual Document Representation[J].arXiv:2401.06583,2024. [17]LIN J A,BAO C Z,DONG J F,et al.Multilingual Text-Video Cross-Modal Retrieval Model via Multilingual-Visual Common Space Learning[J].Journal of Computer Science,2024,47(9):2195-2210. [18]ZOU A,HAO W N,JIN D W,et al.Study on Text Retrieval Based on Pre-training and Deep Hash [J].Computer Science,2021,48(11):300-306. [19]QIU X,WANG Y,SHI J,et al.Cross-Lingual Transfer for Na-tural Language Inference via Multilingual Prompt Translator[J].arXiv:2403.12407,2024. [20]CONNEAU A,KHANDELWAL K,GOYAL N,et al.Unsupervised cross-lingual representation learning at scale[J].arXiv:1911.02116,2019. [21]LUO F,WANG W,LIU J,et al.VECO:Variable and flexible cross-lingual pre-training for language understanding and gene-ration[J].arXiv:2010.16046,2020. [22]LITSCHKO R,VULIĆ I,PONZETTO S P,et al.On cross-lingual retrieval with multilingual text encoders[J].Information Retrieval Journal,2022,25(2):149-183. [23]YANG E,NAIR S,CHANDRADEVAN R,et al.C3:Continued pretraining with contrastive weak supervision for cross language ad-hoc retrieval[C]//Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval.2022:2507-2512. [24]ZHENG H,ZHANG X,CHI Z,et al.Cross-lingual phrase retrieval[J].arXiv:2204.08887,2022. [25]MA X,ZHANG X,PRADEEP R,et al.Zero-shot listwise document reranking with a large language model[J].arXiv:2305.02156,2023. [26]CHEN X T,YE J J,ZU C,et al.Robustness of GPT Large Language Models on Natural Language Processing Tasks [J].Journal of Computer Research and Development,2024,61(5):1128-1142. [27]IZACARD G,CARON M,HOSSEINI L,et al.Unsuperviseddense information retrieval with contrastive learning[J].arXiv:2112.09118,2021. [28]QU Y,DING Y,LIU J,et al.RocketQA:An optimized training approach to dense passage retrieval for open-domain question answering[J].arXiv:2010.08191,2020. [29]XIONG L,XIONG C,LI Y,et al.Approximate nearest neighbor negative contrastive learning for dense text retrieval[J].arXiv:2007.00808,2020. [30]HUANG Z H,YANG S Z,LIN W,et al.Knowledge Distillation:A Survey [J].Journal of Computer Science,2022,45(3):624-653. [31]JAWAHAR G,SAGOT B,SEDDAH D.What does BERT learn about the structure of language?[C]//57th Annual Meeting of the Association for Computational Linguistics.2019. [32]SUN S,DUH K.CLIRMatrix:A massively large collection ofbilingual and multilingual datasets for Cross-Lingual Information Retrieval[C]//Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing(EMNLP).2020:4160-4170. [33]PENEDO G,MALARTIC Q,HESSLOW D,et al.The RefinedWeb dataset for Falcon LLM:outperforming curated corpora with web data,and web data only[J].arXiv:2306.01116,2023. [34]ZHENG L,CHIANG W L,SHENG Y,et al.Judging LLM-as-a-judge with MT-bench and chatbot arena[J].arXiv:2306.05685,2024. [35]LIU J,SONG Y,XUE K,et al.Fl-tuning:Layer tuning for feed-forward network in transformer[J].arXiv:2206.15312,2022. |
|