基于渐进原型匹配的文本-动态图片跨模态检索算法

doi:10.11896/jsjkx.241200204

计算机科学 ›› 2025, Vol. 52 ›› Issue (9): 276-281.doi: 10.11896/jsjkx.241200204

• 计算机图形学&多媒体 • 上一篇下一篇

基于渐进原型匹配的文本-动态图片跨模态检索算法

彭姣¹, 贺月¹, 商笑然², 胡塞尔², 张博¹, 常永娟¹, 欧中洪³, 卢艳艳¹, 姜丹¹, 刘亚铎¹

1 国网河北省电力有限公司信息通信分公司石家庄 050000
2 北京邮电大学计算机学院北京 100876
3 北京邮电大学网络与交换技术全国重点实验室北京 100876

收稿日期:2024-12-30 修回日期:2025-03-31 出版日期:2025-09-15 发布日期:2025-09-11
通讯作者: 欧中洪(zhonghong.ou@bupt.edu.cn)
作者简介:(p2010015645@163.com)
基金资助:
国网河北省电力有限公司(SGHEXT00SJJS2310134)

Text-Dynamic Image Cross-modal Retrieval Algorithm Based on Progressive Prototype Matching

PENG Jiao¹, HE Yue¹, SHANG Xiaoran², HU Saier², ZHANG Bo¹, CHANG Yongjuan¹, OU Zhonghong³, LU Yanyan¹, JIANG dan¹, LIU Yaduo¹

1 Information & Telecommunications Branch,State Grid Hebei Electric Power Co.,Ltd.,Shijiazhuang 050000,China
2 School of Computer Science,Beijing University of Posts and Telecommunications,Beijing 100876,China
3 State Key Laboratory of Networking and Switching Technology,Beijing University of Posts and Telecommunications,Beijing 100876,China

Received:2024-12-30 Revised:2025-03-31 Online:2025-09-15 Published:2025-09-11
About author:PENG Jiao,born in 1991,postgraduate,engineer.Her main research interests include NLP image processing and big data analysis.
OU Zhonghong,born in 1982,Ph.D,professor,Ph.D supervisor,is a member of CCF(No.69730S).His main research interests include small sample learning,cross-domain adaptation and small target detection.
Supported by:
State Grid Hebei Information and Telecommunication(SGHEXT00SJJS2310134).

摘要/Abstract

摘要： 在社交和聊天场景中,用户不再局限于使用文字或emoji表情符号,而是采用语义更加丰富的静态或动态图片来进行交流。尽管现有的文本-动态图片检索算法取得了一定效果,但仍存在模态内和模态间缺乏细粒度交互,以及原型生成过程中缺乏全局引导的问题。为了解决上述问题,提出了一种全局敏感的渐进原型匹配模型(Global-aware Progressive Prototype Matching Model,GaPPMM)用于文本-动态图片跨模态检索,采用三阶段渐进原型匹配的方法来实现跨模态细粒度交互,并提出了全局敏感的时间原型生成方法,利用全局分支产生的预览特征作为注意力机制的查询,引导局部分支关注到最相关的局部特征,实现了动态图片的细粒度特征提取。实验结果表明,提出的模型在公开数据集上的召回率总和超越了现有的SOTA模型。

关键词: 跨模态检索, 动态图片检索, 渐进原型匹配, 注意力机制, 全局敏感性分析

Abstract: In social and chat scenes,users are no longer limited to using text or simple emoji,but are more inclined to use static or dynamic images with richer semantic meaning to communicate.Although existing text-dynamic image retrieval algorithms have been achieved,there are still problems such as lack of fine-grained intra-modal and inter-modal interactions,and lack of global guidance in the prototype generation process.In order to solve the above problems,this paper proposes a Global-aware Progressive Prototype Matching Model(GaPPMM) for text-dynamic image cross-modal retrieval.A three-stage progressive prototype matching method is used to achieve cross-modal fine-grained interaction.In addition,a globally sensitive temporal prototype ge-neration method is proposed,which uses the preview features generated by the global branch as the query of the attention mechanism to guide the local branch to pay attention to the most relevant local features,so as to realize the fine-grained feature extraction of dynamic images.The experimental results demonstrate that the proposed model surpasses state-of-the-art in terms of recall rate on the publicly available dataset.

Key words: Cross-modal retrieval, Text-dynamic image retrieval, Progressive prototype matching, Attention mechanisms, Global sensitivity analysis

中图分类号:

TP391

彭姣, 贺月, 商笑然, 胡塞尔, 张博, 常永娟, 欧中洪, 卢艳艳, 姜丹, 刘亚铎. 基于渐进原型匹配的文本-动态图片跨模态检索算法[J]. 计算机科学, 2025, 52(9): 276-281. https://doi.org/10.11896/jsjkx.241200204

PENG Jiao, HE Yue, SHANG Xiaoran, HU Saier, ZHANG Bo, CHANG Yongjuan, OU Zhonghong, LU Yanyan, JIANG dan, LIU Yaduo. Text-Dynamic Image Cross-modal Retrieval Algorithm Based on Progressive Prototype Matching[J]. Computer Science, 2025, 52(9): 276-281. https://doi.org/10.11896/jsjkx.241200204

参考文献

[1]LI U C,SONG Y,CAO L L,et al.TGIF:A New Dataset and Benchmark on Animated GIF Description[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.IEEE,2016:4641-4650.
[2]SHMUELI B,RAY S,KU L W.Happy dance,slow clap:Using reaction GIFs to predict induced affect on Twitter[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing.Stroudsburg,PA:ACL,2021:395-401.
[3]CHEN H,DING G,LIU X,et al.IMRAM:Iterative Matching With Recurrent Attention Memory for Cross-Modal Image-Text Retrieval[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.IEEE,2020:12655-12663.
[4]ZHANG Q,LEI Z,ZHANG Z,et al.Context-Aware Attention Network for Image-Text Retrieval[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.IEEE,2020:3536-3545.
[5]ZHENG F,LI W,WANG X,et al.A Cross-Attention Mechanism Based on Regional-Level Semantic Features of Images for Cross-Modal Text-Image Retrieval in Remote Sensing[J].Applied Sciences,2022,12(23):12221.
[6]SONG Y,SOLEYMANI M.Polysemous visual-semantic embedding for cross-modal retrieval[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition.Piscataway,NJ:IEEE,2019:1979-1988.
[7]WANG X,JURGENS D.An animated picture says at least a thousand words:selecting gif-based replies in multimodal dialog[C]//Findings of the Association for Computational Linguistics:EMNLP 2021.Stroudsburg,PA:ACL,2021:3228-3257.
[8]LI G,DUAN N,FANG Y,et al.Unicoder-vl:A universal en-coder for vision and language by cross-modal pre-training[C]//Proceedings of the AAAI Conference on Artificial Intelligence.New York:AAAI,2020:11336-11344.
[9]CONNEAU A,LAMPLE G.Cross-lingual Language Model Pretraining[C]//NeurIPS:Advances in Neural Information Processing Systems.Curran Associates Inc.,2019.
[10]HUANG H,LIANG Y,DUAN N,et al.Unicoder:A Universal Language Encoder by Pre-training with Multiple Cross-lingual Tasks[J].arXiv:1909.00964,2019.
[11]ZHANG K,MAO Z,WANG Q,et al.Negative-Aware Attention Framework for Image-Text Matching[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.IEEE,2022:15661-15670.
[12]LI X,YIN X,LI C,et al.Oscar:Object-Semantics Aligned Pre-training for Vision-Language Tasks[C]//Proceedings of 16th European Conference on Computer Vision(ECCV 2020).Sprin-ger,2020:121-137.
[13]CHEN S,ZHAO Y,JIN Q,et al.Fine-Grained Video-Text Retrieval With Hierarchical Graph Reasoning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.IEEE,2020:10638-10647.
[14]SONG X,CHEN J,WU Z,et al.Spatial-Temporal Graphs forCross-Modal Text2Video Retrieval[J].IEEE Transactions on Multimedia,2022,24:2914-2923.
[15]MIECH A,ZHUKOV D,ALAYRAC J B,et al.HowTo100M:Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.IEEE,2019:2630-2640.
[16]LUO J,LI Y,PAN Y,et al.CoCo-BERT:Improving Video-Language Pre-training with Contrastive Cross-modal Matching and Denoising[C]//Proceedings of the 29th ACM International Conference on Multimedia.New York:ACM,2021:5600-5608.
[17]PENG J,HUANG J,XIONG P,et al.Video-Text As GamePlayers:Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.IEEE,2023:2472-2482.
[18]DONG J F,ZHANG M,ZHANG Z,et al.Dual Learning with Dynamic Knowledge Distillation for Partially Relevant Video Retrieval[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.IEEE,2023:11302-11312.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

基于渐进原型匹配的文本-动态图片跨模态检索算法

Text-Dynamic Image Cross-modal Retrieval Algorithm Based on Progressive Prototype Matching

PDF (PC)

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 0

Metrics

本文评价

推荐阅读 0