Computer Science ›› 2025, Vol. 52 ›› Issue (9): 276-281.doi: 10.11896/jsjkx.241200204

• Computer Graphics & Multimedia • Previous Articles     Next Articles

Text-Dynamic Image Cross-modal Retrieval Algorithm Based on Progressive Prototype Matching

PENG Jiao1, HE Yue1, SHANG Xiaoran2, HU Saier2, ZHANG Bo1, CHANG Yongjuan1, OU Zhonghong3, LU Yanyan1, JIANG dan1, LIU Yaduo1   

  1. 1 Information & Telecommunications Branch,State Grid Hebei Electric Power Co.,Ltd.,Shijiazhuang 050000,China
    2 School of Computer Science,Beijing University of Posts and Telecommunications,Beijing 100876,China
    3 State Key Laboratory of Networking and Switching Technology,Beijing University of Posts and Telecommunications,Beijing 100876,China
  • Received:2024-12-30 Revised:2025-03-31 Online:2025-09-15 Published:2025-09-11
  • About author:PENG Jiao,born in 1991,postgraduate,engineer.Her main research interests include NLP image processing and big data analysis.
    OU Zhonghong,born in 1982,Ph.D,professor,Ph.D supervisor,is a member of CCF(No.69730S).His main research interests include small sample learning,cross-domain adaptation and small target detection.
  • Supported by:
    State Grid Hebei Information and Telecommunication(SGHEXT00SJJS2310134).

Abstract: In social and chat scenes,users are no longer limited to using text or simple emoji,but are more inclined to use static or dynamic images with richer semantic meaning to communicate.Although existing text-dynamic image retrieval algorithms have been achieved,there are still problems such as lack of fine-grained intra-modal and inter-modal interactions,and lack of global guidance in the prototype generation process.In order to solve the above problems,this paper proposes a Global-aware Progressive Prototype Matching Model(GaPPMM) for text-dynamic image cross-modal retrieval.A three-stage progressive prototype matching method is used to achieve cross-modal fine-grained interaction.In addition,a globally sensitive temporal prototype ge-neration method is proposed,which uses the preview features generated by the global branch as the query of the attention mechanism to guide the local branch to pay attention to the most relevant local features,so as to realize the fine-grained feature extraction of dynamic images.The experimental results demonstrate that the proposed model surpasses state-of-the-art in terms of recall rate on the publicly available dataset.

Key words: Cross-modal retrieval, Text-dynamic image retrieval, Progressive prototype matching, Attention mechanisms, Global sensitivity analysis

CLC Number: 

  • TP391
[1]LI U C,SONG Y,CAO L L,et al.TGIF:A New Dataset and Benchmark on Animated GIF Description[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.IEEE,2016:4641-4650.
[2]SHMUELI B,RAY S,KU L W.Happy dance,slow clap:Using reaction GIFs to predict induced affect on Twitter[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing.Stroudsburg,PA:ACL,2021:395-401.
[3]CHEN H,DING G,LIU X,et al.IMRAM:Iterative Matching With Recurrent Attention Memory for Cross-Modal Image-Text Retrieval[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.IEEE,2020:12655-12663.
[4]ZHANG Q,LEI Z,ZHANG Z,et al.Context-Aware Attention Network for Image-Text Retrieval[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.IEEE,2020:3536-3545.
[5]ZHENG F,LI W,WANG X,et al.A Cross-Attention Mechanism Based on Regional-Level Semantic Features of Images for Cross-Modal Text-Image Retrieval in Remote Sensing[J].Applied Sciences,2022,12(23):12221.
[6]SONG Y,SOLEYMANI M.Polysemous visual-semantic embedding for cross-modal retrieval[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition.Piscataway,NJ:IEEE,2019:1979-1988.
[7]WANG X,JURGENS D.An animated picture says at least a thousand words:selecting gif-based replies in multimodal dialog[C]//Findings of the Association for Computational Linguistics:EMNLP 2021.Stroudsburg,PA:ACL,2021:3228-3257.
[8]LI G,DUAN N,FANG Y,et al.Unicoder-vl:A universal en-coder for vision and language by cross-modal pre-training[C]//Proceedings of the AAAI Conference on Artificial Intelligence.New York:AAAI,2020:11336-11344.
[9]CONNEAU A,LAMPLE G.Cross-lingual Language Model Pretraining[C]//NeurIPS:Advances in Neural Information Processing Systems.Curran Associates Inc.,2019.
[10]HUANG H,LIANG Y,DUAN N,et al.Unicoder:A Universal Language Encoder by Pre-training with Multiple Cross-lingual Tasks[J].arXiv:1909.00964,2019.
[11]ZHANG K,MAO Z,WANG Q,et al.Negative-Aware Attention Framework for Image-Text Matching[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.IEEE,2022:15661-15670.
[12]LI X,YIN X,LI C,et al.Oscar:Object-Semantics Aligned Pre-training for Vision-Language Tasks[C]//Proceedings of 16th European Conference on Computer Vision(ECCV 2020).Sprin-ger,2020:121-137.
[13]CHEN S,ZHAO Y,JIN Q,et al.Fine-Grained Video-Text Retrieval With Hierarchical Graph Reasoning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.IEEE,2020:10638-10647.
[14]SONG X,CHEN J,WU Z,et al.Spatial-Temporal Graphs forCross-Modal Text2Video Retrieval[J].IEEE Transactions on Multimedia,2022,24:2914-2923.
[15]MIECH A,ZHUKOV D,ALAYRAC J B,et al.HowTo100M:Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.IEEE,2019:2630-2640.
[16]LUO J,LI Y,PAN Y,et al.CoCo-BERT:Improving Video-Language Pre-training with Contrastive Cross-modal Matching and Denoising[C]//Proceedings of the 29th ACM International Conference on Multimedia.New York:ACM,2021:5600-5608.
[17]PENG J,HUANG J,XIONG P,et al.Video-Text As GamePlayers:Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.IEEE,2023:2472-2482.
[18]DONG J F,ZHANG M,ZHANG Z,et al.Dual Learning with Dynamic Knowledge Distillation for Partially Relevant Video Retrieval[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.IEEE,2023:11302-11312.
[1] TANG Lijun , YANG Zheng, ZHAO Nan, ZHAI Suwei. FLIP-based Joint Similarity Preserving Hashing for Cross-modal Retrieval [J]. Computer Science, 2025, 52(6A): 240400151-10.
[2] SUN Yang, DING Jianwei, ZHANG Qi, WEI Huiwen, TIAN Bowen. Study on Super-resolution Image Reconstruction Using Residual Feature Aggregation NetworkBased on Attention Mechanism [J]. Computer Science, 2024, 51(6A): 230600039-6.
[3] GAO Nan, ZHANG Lei, LIANG Ronghua, CHEN Peng, FU Zheng. Scene Text Detection Algorithm Based on Feature Enhancement [J]. Computer Science, 2024, 51(6): 256-263.
[4] CAO Qingyuan, ZHU Jianhong. Study on Identification of Concrete Sand and Gravel Aggregate Types Based on Improved Residual Network [J]. Computer Science, 2024, 51(11A): 231000082-6.
[5] LUO Huilan, LONG Jun, LIANG Miaomiao. Attentional Feature Fusion Approach for Siamese Network Based Object Tracking [J]. Computer Science, 2023, 50(6A): 220300237-9.
[6] ZHANG Changfan, MA Yuanyuan, LIU Jianhua, HE Jing. Dual Gating-Residual Feature Fusion for Image-Text Cross-modal Retrieval [J]. Computer Science, 2023, 50(6A): 220700030-7.
[7] YANG Xiaoyu, LI Chao, CHEN Shunyao, LI Haoliang, YIN Guangqiang. Text-Image Cross-modal Retrieval Based on Transformer [J]. Computer Science, 2023, 50(4): 141-148.
[8] ZHANG Longji, ZHAO Hui. Aspect-level Sentiment Analysis Integrating Syntactic Distance and Aspect-attention [J]. Computer Science, 2023, 50(12): 262-269.
[9] WANG Zhendong, DONG Kaikun, HUANG Junheng, WANG Bailing. SemFA:Extreme Multi-label Text Classification Model Based on Semantic Features and Association Attention [J]. Computer Science, 2023, 50(12): 270-278.
[10] GU Baocheng, LIU Li. Cross-modal Hash Retrieval Based on Text-guided Image Semantic Fusion [J]. Computer Science, 2023, 50(11A): 221100191-6.
[11] WANG Lin, LIU Zhe, SHI Dianxi, ZHOU Chenlei, YANG Shaowu, ZHANG Yongjun. Fusion Tracker:Single-object Tracking Framework Fusing Image Features and Event Features [J]. Computer Science, 2023, 50(10): 96-103.
[12] SUN Jie-qi, LI Ya-feng, ZHANG Wen-bo, LIU Peng-hui. Dual-field Feature Fusion Deep Convolutional Neural Network Based on Discrete Wavelet Transformation [J]. Computer Science, 2022, 49(6A): 434-440.
[13] HAN Hui-zhen, LIU Li-bo. Lycium Barbarum Pest Retrieval Based on Attention and Visual Semantic Reasoning [J]. Computer Science, 2022, 49(11A): 211200087-6.
[14] LIU Li-bo, GOU Ting-ting. Cross-modal Retrieval Combining Deep Canonical Correlation Analysis and Adversarial Learning [J]. Computer Science, 2021, 48(9): 200-207.
[15] FENG Xia, HU Zhi-yi, LIU Cai-hua. Survey of Research Progress on Cross-modal Retrieval [J]. Computer Science, 2021, 48(8): 13-23.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!