计算机科学 ›› 2025, Vol. 52 ›› Issue (9): 276-281.doi: 10.11896/jsjkx.241200204
彭姣1, 贺月1, 商笑然2, 胡塞尔2, 张博1, 常永娟1, 欧中洪3, 卢艳艳1, 姜丹1, 刘亚铎1
PENG Jiao1, HE Yue1, SHANG Xiaoran2, HU Saier2, ZHANG Bo1, CHANG Yongjuan1, OU Zhonghong3, LU Yanyan1, JIANG dan1, LIU Yaduo1
摘要: 在社交和聊天场景中,用户不再局限于使用文字或emoji表情符号,而是采用语义更加丰富的静态或动态图片来进行交流。尽管现有的文本-动态图片检索算法取得了一定效果,但仍存在模态内和模态间缺乏细粒度交互,以及原型生成过程中缺乏全局引导的问题。为了解决上述问题,提出了一种全局敏感的渐进原型匹配模型(Global-aware Progressive Prototype Matching Model,GaPPMM)用于文本-动态图片跨模态检索,采用三阶段渐进原型匹配的方法来实现跨模态细粒度交互,并提出了全局敏感的时间原型生成方法,利用全局分支产生的预览特征作为注意力机制的查询,引导局部分支关注到最相关的局部特征,实现了动态图片的细粒度特征提取。实验结果表明,提出的模型在公开数据集上的召回率总和超越了现有的SOTA模型。
中图分类号:
[1]LI U C,SONG Y,CAO L L,et al.TGIF:A New Dataset and Benchmark on Animated GIF Description[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.IEEE,2016:4641-4650. [2]SHMUELI B,RAY S,KU L W.Happy dance,slow clap:Using reaction GIFs to predict induced affect on Twitter[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing.Stroudsburg,PA:ACL,2021:395-401. [3]CHEN H,DING G,LIU X,et al.IMRAM:Iterative Matching With Recurrent Attention Memory for Cross-Modal Image-Text Retrieval[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.IEEE,2020:12655-12663. [4]ZHANG Q,LEI Z,ZHANG Z,et al.Context-Aware Attention Network for Image-Text Retrieval[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.IEEE,2020:3536-3545. [5]ZHENG F,LI W,WANG X,et al.A Cross-Attention Mechanism Based on Regional-Level Semantic Features of Images for Cross-Modal Text-Image Retrieval in Remote Sensing[J].Applied Sciences,2022,12(23):12221. [6]SONG Y,SOLEYMANI M.Polysemous visual-semantic embedding for cross-modal retrieval[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition.Piscataway,NJ:IEEE,2019:1979-1988. [7]WANG X,JURGENS D.An animated picture says at least a thousand words:selecting gif-based replies in multimodal dialog[C]//Findings of the Association for Computational Linguistics:EMNLP 2021.Stroudsburg,PA:ACL,2021:3228-3257. [8]LI G,DUAN N,FANG Y,et al.Unicoder-vl:A universal en-coder for vision and language by cross-modal pre-training[C]//Proceedings of the AAAI Conference on Artificial Intelligence.New York:AAAI,2020:11336-11344. [9]CONNEAU A,LAMPLE G.Cross-lingual Language Model Pretraining[C]//NeurIPS:Advances in Neural Information Processing Systems.Curran Associates Inc.,2019. [10]HUANG H,LIANG Y,DUAN N,et al.Unicoder:A Universal Language Encoder by Pre-training with Multiple Cross-lingual Tasks[J].arXiv:1909.00964,2019. [11]ZHANG K,MAO Z,WANG Q,et al.Negative-Aware Attention Framework for Image-Text Matching[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.IEEE,2022:15661-15670. [12]LI X,YIN X,LI C,et al.Oscar:Object-Semantics Aligned Pre-training for Vision-Language Tasks[C]//Proceedings of 16th European Conference on Computer Vision(ECCV 2020).Sprin-ger,2020:121-137. [13]CHEN S,ZHAO Y,JIN Q,et al.Fine-Grained Video-Text Retrieval With Hierarchical Graph Reasoning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.IEEE,2020:10638-10647. [14]SONG X,CHEN J,WU Z,et al.Spatial-Temporal Graphs forCross-Modal Text2Video Retrieval[J].IEEE Transactions on Multimedia,2022,24:2914-2923. [15]MIECH A,ZHUKOV D,ALAYRAC J B,et al.HowTo100M:Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.IEEE,2019:2630-2640. [16]LUO J,LI Y,PAN Y,et al.CoCo-BERT:Improving Video-Language Pre-training with Contrastive Cross-modal Matching and Denoising[C]//Proceedings of the 29th ACM International Conference on Multimedia.New York:ACM,2021:5600-5608. [17]PENG J,HUANG J,XIONG P,et al.Video-Text As GamePlayers:Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.IEEE,2023:2472-2482. [18]DONG J F,ZHANG M,ZHANG Z,et al.Dual Learning with Dynamic Knowledge Distillation for Partially Relevant Video Retrieval[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.IEEE,2023:11302-11312. |
|