Computer Science ›› 2023, Vol. 50 ›› Issue (4): 141-148.doi: 10.11896/jsjkx.220100083

• Computer Graphics & Multimedia • Previous Articles     Next Articles

Text-Image Cross-modal Retrieval Based on Transformer

YANG Xiaoyu, LI Chao, CHEN Shunyao, LI Haoliang, YIN Guangqiang   

  1. Center for Public Security Technology,University of Electronic Science and Technology of China,Chengdu 611731,China
  • Received:2022-01-10 Revised:2022-07-05 Online:2023-04-15 Published:2023-04-06
  • About author:YANG Xiaoyu,born in 1996,postgra-duate.His main research interests include deep learning,computer vision and cross-modal retrieval.
    YIN Guangqiang,born in 1982,master,professor.His main research interests include network security,computer vision,signal processing and intelligent manufacturing.
  • Supported by:
    Shenzhen Science and Technology Program(JSGG20220301090405009).

Abstract: With the growth of Internet multimedia data,text image retrieval has become a research hotspot.In image and text retrieval,the mutual attention mechanism is used to achieve better image-text matching results by interacting image and text features.However,this method cannot obtain image features and text features separately,and requires interaction of image and text features in the later stage of large-scale retrieval,which consumes a lot of time and is not able to achieve fast retrieval and ma-tching.However,the cross-modal image text feature learning based on Transformer has achieved good results and has received more and more attention from researchers.This paper designs a novel Transformer-based text image retrieval network structure(HAS-Net),which mainly has the following improvements:a hierarchical Transformer coding structure is designed to better utilize the underlying grammatical information and high-level semantic information;the traditional global feature aggregation method is improved,and the self-attention mechanism is used to design a new feature aggregation method;by sharing the Transformer coding layer,image features and text features are mapped to a common feature coding space.Finally,experiments are conducted on the MS-COCO and Flickr30k datasets,the cross-modal retrieval performance has been improved,and it is in a leading position among similar algorithms.It is proved that the designed network structure is effective.

Key words: Transformer, Cross-modal retrieval, Hierarchical feature extraction, Feature aggregation, Feature share

CLC Number: 

  • TP399
[1]HAO Y,DONG L,WEI F,et al.Visualizing and Understanding the Effectiveness of BERT[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing.Hong Kong:Association for Computational Linguistics,2019:4141-4150.
[2]TENNEY I,DAS D,PAVLICK E.BERT Rediscovers the Classical NLP Pipeline[C]//Proceedings of the 57th Annual Mee-ting of the Association for Computational Linguistics.Florence:Association for Computational Linguistics,2019:4593-4601.
[3]GABEUR V,SUN C,ALAHARI K,et al.Multi-modal trans-former for video retrieval[C]//Proceedings of the 16th Euro-pean Conference Computer Vision(ECCV).Glasgow:Springer,2020:214-229.
[4]PATRICK M,HUANG P,ASANO Y,et al.Support-set bottlenecks for video-text representation learning[C]//Proceedings of the 9th International Conference on Learning Representations(ICLR).Austria:OpenReview,2021:1-18.
[5]LI K,ZHANG Y,LI K,et al.Visual Semantic Reasoning forImage-Text Matching[C]//2019 IEEE International Conference on Computer Vision(ICCV).2019:4653-4661.
[6]EISENSCHTAT A,WOLF L.Linking Image and Text with 2-Way Nets[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition(CVPR).2017:1855-1865.
[7]FAGHRI F,FLEET D J,KIROS J R,et al.VSE++:Improving Visual-Semantic Embeddings with Hard Negatives[C]//British Machine Vision Conference(BMVC).2018:12-21.
[8]GU J,CAI J,JOTY S,et al.Look,Imagine and Match:Improving Textual-Visual Cross-Modal Retrieval with Generative Models[C]//2018 IEEE Conference on Computer Vision and Pattern Recognition(CVPR).2018:7181-7189.
[9]HUANG Y,WANG W,WANG L.Instance-aware Image andSentence Matching with Selective Multimodal LSTM[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition(CVPR).2017:7251-7262.
[10]REN S,HE K,GIRSHICK R,et al.Faster R-CNN:Towards Real-Time Object Detection with Region Proposal Networks[J].IEEE Transactions on Pattern Analysis & Machine Intelligence,2017,39(6):1137-1149.
[11]CHEN H,DING G,LIU X,et al.IMRAM:Iterative Matching With Recurrent Attention Memory for Cross-Modal Image-Text Retrieval[C]//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR).Seattle:IEEE,2020:12652-12660.
[12]WANG Y X,YANG H,QIAN X M,et al.Position Focused Attention Network for Image-Text Matching[C]//Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence(IJCAI).Macao:AAAI,2019:3792-3798.
[13]JI Z,WANG H,HAN J,et al.SMAN:Stacked Multimodal Attention Network for Cross-Modal Image-Text Retrieval[J].IEEE Transactions on Cybernetics,2020(99):1-12.
[14]XU X,WANG T,YANG Y,et al.Cross-Modal Attention with Semantic Consistence for Image-Text Matching[J].IEEE Transactions on Neural Networks and Learning Systems,2020(99):1-14.
[15]ASHISH V,NOAM S,NIKI P,et al.Attention is all you need[J].Advances in Neural Information Processing Systems,2017(1):5998-6008.
[16]DEVLIN J,CHANG M W,LEE K,et al.BERT:Pre-training of Deep Bidirectional Transformers for Language Understanding[C]//Proceedings of 2019 Conference of the North American Chapter of the Association for Computational Linguistics.Minneapolis:Association for Computational Linguistics,2019:4171-4186.
[17]QU L,LIU M,CAO D,et al.Context-Aware Multi-View Summarization Network for Image-Text Matching[C]//Proceedings of the 28th ACM International Conference on Multimedia.Seattle:ACM,2020:1047-1055.
[18]WEI X,ZHANG T,LI Y,et al.Multi-Modality Cross Attention Network for Image and Sentence Matching[C]//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR).Seattle:IEEE,2020:10938-10947.
[19]LU J,BATRA D,PARIKH D,et al.ViLBERT:PretrainingTask-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks[C]//Proceedings of International Conference on Neural Information Processing Systems.Vabcouver:IEEE,2019:13-23.
[20]SU W,ZHU X,CAO Y,et al.VL-BERT:Pre-training of Gene-ric Visual-Linguistic Representations[C]//International Confe-rence on Learning Representations(ICLR).2020.
[21]PARMAR N,VASWANI A,USZKOREIT J,et al.ImageTransformer[J].International Conference on Machine Lear-ning,2018(80):4052-4061.
[22]CORDONNIER J,LOUKAS A,JAGGI M.On the Relationship between Self-Attention and Convolutional Layers[C]//International Conference on Learning Representations(ICLR).2020.
[23]DOSOVITSKIY A,BEYER L,KOLESNIKOV A,et al.AnImage is Worth 16x16 Words:Transformers for Image Recognition at Scale[C]//International Conference on Learning Representations(ICLR).2021.
[24]MESSINA N,FALCHI F,ESULI A,et al.Transformer Reaso-ning Network for Image-Text Matching and Retrieval[C]//International Conference on Learning Representations(ICLR).2020:5222-5229.
[25]MESSINA N,AMATO G,ESULI A,et al.Fine-grained Visual Textual Alignment for Cross-Modal Retrieval using Transfor-mer Encoders[C]//CoRR.2020.
[26]WEI X,ZHANG T,LI Y,et al.Multi-Modality Cross Attention Network for Image and Sentence Matching[C]//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR).Seattle:IEEE,2020:10938-10947.
[27]LI G,DUAN N,FANG Y,et al.Unicoder-VL:A Universal Encoder for Vision and Language by Cross-Modal Pre-Training[J].AAAI Conference on Artificial Intelligence.2020:11336-11344.
[28]CAO L,QIAN S,ZHANG H,et al.Global Relation-Aware Attention Network for Image-Text Retrieval[C]//Proceedings of International Conference on Multimedia Retrieval.Taiwan:ACM,2021:19-28.
[29]PETERS M,NEUMANN M,ZETTLEMOYER L,et al.Dissecting Contextual Word Embeddings:Architecture and Representation[C]//Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing.Brussels:Association for Computational Linguistics,2018:1499-1509.
[30]VIG J.A Multiscale Visualization of Attention in the Trans-former Model[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics:System Demonstrations.Florence:Association for Computational Linguistics,2019:37-42.
[31]CHEN T,KORNBLITH S,NOROUZI M,et al.A SimpleFramework for Contrastive Learning of Visual Representations[C]//International Conference on Machine Learning(ICML).2020:1597-1607.
[32]LIN T,MAIRE M,BELONGIE S,et al.Microsoft coco:Common objects in context[J].European Conference Computer Vision(ECCV),2014,8693:740-755.
[33]YOUNG P,LAI A,HODOSH M,et al.From image descriptions to visual denotations:New similarity metrics for semantic infe-rence over event descriptions[J].Transactions of the Association for Computational Linguistics,2014,2:67-78.
[34]LEE K H,XI C,GANG H,et al.Stacked Cross Attention for Image-Text Matching[C]//15th European Conference Compu-ter Vision(ECCV).2018:212-228.
[35]WANG Z,LIU X,LI H,et al.CAMP:Cross-Modal Adaptive Message Passing for Text-Image Retrieval[C]//2019 IEEE International Conference on Computer Vision(ICCV).2019:5763-5772.
[1] WANG Zhenbiao, QIN Yali, WANG Rongfang, ZHENG Huan. Image Compressed Sensing Attention Neural Network Based on Residual Feature Aggregation [J]. Computer Science, 2023, 50(4): 117-124.
[2] LIANG Weiliang, LI Yue, WANG Pengfei. Lightweight Face Generation Method Based on TransEditor and Its Application Specification [J]. Computer Science, 2023, 50(2): 221-230.
[3] CAO Jinjuan, QIAN Zhong, LI Peifeng. End-to-End Event Factuality Identification with Joint Model [J]. Computer Science, 2023, 50(2): 292-299.
[4] CAI Xiao, CEHN Zhihua, SHENG Bin. SPT:Swin Pyramid Transformer for Object Detection of Remote Sensing [J]. Computer Science, 2023, 50(1): 105-113.
[5] ZHANG Jingyuan, WANG Hongxia, HE Peisong. Multitask Transformer-based Network for Image Splicing Manipulation Detection [J]. Computer Science, 2023, 50(1): 114-122.
[6] WANG Ming, PENG Jian, HUANG Fei-hu. Multi-time Scale Spatial-Temporal Graph Neural Network for Traffic Flow Prediction [J]. Computer Science, 2022, 49(8): 40-48.
[7] ZHANG Jia-hao, LIU Feng, QI Jia-yin. Lightweight Micro-expression Recognition Architecture Based on Bottleneck Transformer [J]. Computer Science, 2022, 49(6A): 370-377.
[8] KANG Yan, XU Yu-long, KOU Yong-qi, XIE Si-yu, YANG Xue-kun, LI Hao. Drug-Drug Interaction Prediction Based on Transformer and LSTM [J]. Computer Science, 2022, 49(6A): 17-21.
[9] ZHAO Xiao-hu, YE Sheng, LI Xiao. Multi-algorithm Fusion Behavior Classification Method for Body Bone Information Reconstruction [J]. Computer Science, 2022, 49(6): 269-275.
[10] LU Liang, KONG Fang. Dialogue-based Entity Relation Extraction with Knowledge [J]. Computer Science, 2022, 49(5): 200-205.
[11] LI Chuan, LI Wei-hua, WANG Ying-hui, CHEN Wei, WEN Jun-ying. Gated Two-tower Transformer-based Model for Predicting Antigenicity of Influenza H1N1 [J]. Computer Science, 2022, 49(11A): 211000209-6.
[12] WANG Shuai, ZHANG Shu-jun, YE Kang, GUO Qi. Continuous Sign Language Recognition Method Based on Improved Transformer [J]. Computer Science, 2022, 49(11A): 211200198-6.
[13] HU Xin-rong, CHEN Zhi-heng, LIU Jun-ping, PENG Tao, YE Peng, ZHU Qiang. Sentiment Analysis Framework Based on Multimodal Representation Learning [J]. Computer Science, 2022, 49(11A): 210900107-6.
[14] HAN Hui-zhen, LIU Li-bo. Lycium Barbarum Pest Retrieval Based on Attention and Visual Semantic Reasoning [J]. Computer Science, 2022, 49(11A): 211200087-6.
[15] FANG Zhong-jun, ZHANG Jing, LI Dong-dong. Spatial Encoding and Multi-layer Joint Encoding Enhanced Transformer for Image Captioning [J]. Computer Science, 2022, 49(10): 151-158.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!