计算机科学 ›› 2023, Vol. 50 ›› Issue (4): 141-148.doi: 10.11896/jsjkx.220100083

• 计算机图形学&多媒体 • 上一篇    下一篇

基于Transformer的图文跨模态检索算法

杨晓宇, 李超, 陈舜尧, 李浩亮, 殷光强   

  1. 电子科技大学公共安全技术研究中心 成都 611731
  • 收稿日期:2022-01-10 修回日期:2022-07-05 出版日期:2023-04-15 发布日期:2023-04-06
  • 通讯作者: 殷光强(yingq@uestc.edu.cn)
  • 作者简介:(yangxy@std.uestc.edu.cn)
  • 基金资助:
    深圳市科技计划项目(JSGG20220301090405009)

Text-Image Cross-modal Retrieval Based on Transformer

YANG Xiaoyu, LI Chao, CHEN Shunyao, LI Haoliang, YIN Guangqiang   

  1. Center for Public Security Technology,University of Electronic Science and Technology of China,Chengdu 611731,China
  • Received:2022-01-10 Revised:2022-07-05 Online:2023-04-15 Published:2023-04-06
  • About author:YANG Xiaoyu,born in 1996,postgra-duate.His main research interests include deep learning,computer vision and cross-modal retrieval.
    YIN Guangqiang,born in 1982,master,professor.His main research interests include network security,computer vision,signal processing and intelligent manufacturing.
  • Supported by:
    Shenzhen Science and Technology Program(JSGG20220301090405009).

摘要: 随着互联网多媒体数据的不断增长,文本图像检索已成为研究热点。在图文检索中,通常使用相互注意力机制,通过将图像和文本特征进行交互,来实现较好的图文匹配结果。但是,这种方法不能获取单独的图像特征和文本特征,在大规模检索后期需要对图像文本特征进行交互,消耗了大量的时间,无法做到快速检索匹配。然而基于Transformer的跨模态图像文本特征学习取得了良好的效果,受到了越来越多的关注。文中设计了一种新颖的基于Transformer的文本图像检索网络结构(HAS-Net),该结构主要有以下几点改进:1)设计了一种分层Transformer编码结构,以更好地利用底层的语法信息和高层的语义信息;2)改进了传统的全局特征聚合方式,利用自注意力机制设计了一种新的特征聚合方式;3)通过共享Transformer编码层,使图片特征和文本特征映射到公共的特征编码空间。在MS-COCO数据集和Flickr30k数据集上进行实验,结果表明跨模态检索性能均得到提升,在同类算法中处于领先地位,证明了所设计的网络结构的有效性。

关键词: Transformer, 跨模态检索, 特征分层提取, 特征聚合, 特征共享

Abstract: With the growth of Internet multimedia data,text image retrieval has become a research hotspot.In image and text retrieval,the mutual attention mechanism is used to achieve better image-text matching results by interacting image and text features.However,this method cannot obtain image features and text features separately,and requires interaction of image and text features in the later stage of large-scale retrieval,which consumes a lot of time and is not able to achieve fast retrieval and ma-tching.However,the cross-modal image text feature learning based on Transformer has achieved good results and has received more and more attention from researchers.This paper designs a novel Transformer-based text image retrieval network structure(HAS-Net),which mainly has the following improvements:a hierarchical Transformer coding structure is designed to better utilize the underlying grammatical information and high-level semantic information;the traditional global feature aggregation method is improved,and the self-attention mechanism is used to design a new feature aggregation method;by sharing the Transformer coding layer,image features and text features are mapped to a common feature coding space.Finally,experiments are conducted on the MS-COCO and Flickr30k datasets,the cross-modal retrieval performance has been improved,and it is in a leading position among similar algorithms.It is proved that the designed network structure is effective.

Key words: Transformer, Cross-modal retrieval, Hierarchical feature extraction, Feature aggregation, Feature share

中图分类号: 

  • TP399
[1]HAO Y,DONG L,WEI F,et al.Visualizing and Understanding the Effectiveness of BERT[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing.Hong Kong:Association for Computational Linguistics,2019:4141-4150.
[2]TENNEY I,DAS D,PAVLICK E.BERT Rediscovers the Classical NLP Pipeline[C]//Proceedings of the 57th Annual Mee-ting of the Association for Computational Linguistics.Florence:Association for Computational Linguistics,2019:4593-4601.
[3]GABEUR V,SUN C,ALAHARI K,et al.Multi-modal trans-former for video retrieval[C]//Proceedings of the 16th Euro-pean Conference Computer Vision(ECCV).Glasgow:Springer,2020:214-229.
[4]PATRICK M,HUANG P,ASANO Y,et al.Support-set bottlenecks for video-text representation learning[C]//Proceedings of the 9th International Conference on Learning Representations(ICLR).Austria:OpenReview,2021:1-18.
[5]LI K,ZHANG Y,LI K,et al.Visual Semantic Reasoning forImage-Text Matching[C]//2019 IEEE International Conference on Computer Vision(ICCV).2019:4653-4661.
[6]EISENSCHTAT A,WOLF L.Linking Image and Text with 2-Way Nets[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition(CVPR).2017:1855-1865.
[7]FAGHRI F,FLEET D J,KIROS J R,et al.VSE++:Improving Visual-Semantic Embeddings with Hard Negatives[C]//British Machine Vision Conference(BMVC).2018:12-21.
[8]GU J,CAI J,JOTY S,et al.Look,Imagine and Match:Improving Textual-Visual Cross-Modal Retrieval with Generative Models[C]//2018 IEEE Conference on Computer Vision and Pattern Recognition(CVPR).2018:7181-7189.
[9]HUANG Y,WANG W,WANG L.Instance-aware Image andSentence Matching with Selective Multimodal LSTM[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition(CVPR).2017:7251-7262.
[10]REN S,HE K,GIRSHICK R,et al.Faster R-CNN:Towards Real-Time Object Detection with Region Proposal Networks[J].IEEE Transactions on Pattern Analysis & Machine Intelligence,2017,39(6):1137-1149.
[11]CHEN H,DING G,LIU X,et al.IMRAM:Iterative Matching With Recurrent Attention Memory for Cross-Modal Image-Text Retrieval[C]//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR).Seattle:IEEE,2020:12652-12660.
[12]WANG Y X,YANG H,QIAN X M,et al.Position Focused Attention Network for Image-Text Matching[C]//Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence(IJCAI).Macao:AAAI,2019:3792-3798.
[13]JI Z,WANG H,HAN J,et al.SMAN:Stacked Multimodal Attention Network for Cross-Modal Image-Text Retrieval[J].IEEE Transactions on Cybernetics,2020(99):1-12.
[14]XU X,WANG T,YANG Y,et al.Cross-Modal Attention with Semantic Consistence for Image-Text Matching[J].IEEE Transactions on Neural Networks and Learning Systems,2020(99):1-14.
[15]ASHISH V,NOAM S,NIKI P,et al.Attention is all you need[J].Advances in Neural Information Processing Systems,2017(1):5998-6008.
[16]DEVLIN J,CHANG M W,LEE K,et al.BERT:Pre-training of Deep Bidirectional Transformers for Language Understanding[C]//Proceedings of 2019 Conference of the North American Chapter of the Association for Computational Linguistics.Minneapolis:Association for Computational Linguistics,2019:4171-4186.
[17]QU L,LIU M,CAO D,et al.Context-Aware Multi-View Summarization Network for Image-Text Matching[C]//Proceedings of the 28th ACM International Conference on Multimedia.Seattle:ACM,2020:1047-1055.
[18]WEI X,ZHANG T,LI Y,et al.Multi-Modality Cross Attention Network for Image and Sentence Matching[C]//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR).Seattle:IEEE,2020:10938-10947.
[19]LU J,BATRA D,PARIKH D,et al.ViLBERT:PretrainingTask-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks[C]//Proceedings of International Conference on Neural Information Processing Systems.Vabcouver:IEEE,2019:13-23.
[20]SU W,ZHU X,CAO Y,et al.VL-BERT:Pre-training of Gene-ric Visual-Linguistic Representations[C]//International Confe-rence on Learning Representations(ICLR).2020.
[21]PARMAR N,VASWANI A,USZKOREIT J,et al.ImageTransformer[J].International Conference on Machine Lear-ning,2018(80):4052-4061.
[22]CORDONNIER J,LOUKAS A,JAGGI M.On the Relationship between Self-Attention and Convolutional Layers[C]//International Conference on Learning Representations(ICLR).2020.
[23]DOSOVITSKIY A,BEYER L,KOLESNIKOV A,et al.AnImage is Worth 16x16 Words:Transformers for Image Recognition at Scale[C]//International Conference on Learning Representations(ICLR).2021.
[24]MESSINA N,FALCHI F,ESULI A,et al.Transformer Reaso-ning Network for Image-Text Matching and Retrieval[C]//International Conference on Learning Representations(ICLR).2020:5222-5229.
[25]MESSINA N,AMATO G,ESULI A,et al.Fine-grained Visual Textual Alignment for Cross-Modal Retrieval using Transfor-mer Encoders[C]//CoRR.2020.
[26]WEI X,ZHANG T,LI Y,et al.Multi-Modality Cross Attention Network for Image and Sentence Matching[C]//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR).Seattle:IEEE,2020:10938-10947.
[27]LI G,DUAN N,FANG Y,et al.Unicoder-VL:A Universal Encoder for Vision and Language by Cross-Modal Pre-Training[J].AAAI Conference on Artificial Intelligence.2020:11336-11344.
[28]CAO L,QIAN S,ZHANG H,et al.Global Relation-Aware Attention Network for Image-Text Retrieval[C]//Proceedings of International Conference on Multimedia Retrieval.Taiwan:ACM,2021:19-28.
[29]PETERS M,NEUMANN M,ZETTLEMOYER L,et al.Dissecting Contextual Word Embeddings:Architecture and Representation[C]//Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing.Brussels:Association for Computational Linguistics,2018:1499-1509.
[30]VIG J.A Multiscale Visualization of Attention in the Trans-former Model[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics:System Demonstrations.Florence:Association for Computational Linguistics,2019:37-42.
[31]CHEN T,KORNBLITH S,NOROUZI M,et al.A SimpleFramework for Contrastive Learning of Visual Representations[C]//International Conference on Machine Learning(ICML).2020:1597-1607.
[32]LIN T,MAIRE M,BELONGIE S,et al.Microsoft coco:Common objects in context[J].European Conference Computer Vision(ECCV),2014,8693:740-755.
[33]YOUNG P,LAI A,HODOSH M,et al.From image descriptions to visual denotations:New similarity metrics for semantic infe-rence over event descriptions[J].Transactions of the Association for Computational Linguistics,2014,2:67-78.
[34]LEE K H,XI C,GANG H,et al.Stacked Cross Attention for Image-Text Matching[C]//15th European Conference Compu-ter Vision(ECCV).2018:212-228.
[35]WANG Z,LIU X,LI H,et al.CAMP:Cross-Modal Adaptive Message Passing for Text-Image Retrieval[C]//2019 IEEE International Conference on Computer Vision(ICCV).2019:5763-5772.
[1] 王振彪, 覃亚丽, 王荣芳, 郑欢.
基于残差特征聚合的图像压缩感知注意力神经网络
Image Compressed Sensing Attention Neural Network Based on Residual Feature Aggregation
计算机科学, 2023, 50(4): 117-124. https://doi.org/10.11896/jsjkx.211200215
[2] 梁伟亮, 李悦, 王棚飞.
基于TransEditor的轻量化人脸生成方法及其应用规范
Lightweight Face Generation Method Based on TransEditor and Its Application Specification
计算机科学, 2023, 50(2): 221-230. https://doi.org/10.11896/jsjkx.220800166
[3] 蔡肖, 陈志华, 盛斌.
基于移位窗口金字塔Transformer的遥感图像目标检测
SPT:Swin Pyramid Transformer for Object Detection of Remote Sensing
计算机科学, 2023, 50(1): 105-113. https://doi.org/10.11896/jsjkx.211100208
[4] 张婧媛, 王宏霞, 何沛松.
基于Transformer的多任务图像拼接篡改检测算法
Multitask Transformer-based Network for Image Splicing Manipulation Detection
计算机科学, 2023, 50(1): 114-122. https://doi.org/10.11896/jsjkx.211100269
[5] 汪鸣, 彭舰, 黄飞虎.
基于多时间尺度时空图网络的交通流量预测模型
Multi-time Scale Spatial-Temporal Graph Neural Network for Traffic Flow Prediction
计算机科学, 2022, 49(8): 40-48. https://doi.org/10.11896/jsjkx.220100188
[6] 康雁, 徐玉龙, 寇勇奇, 谢思宇, 杨学昆, 李浩.
基于Transformer和LSTM的药物相互作用预测
Drug-Drug Interaction Prediction Based on Transformer and LSTM
计算机科学, 2022, 49(6A): 17-21. https://doi.org/10.11896/jsjkx.210400150
[7] 张嘉淏, 刘峰, 齐佳音.
一种基于Bottleneck Transformer的轻量级微表情识别架构
Lightweight Micro-expression Recognition Architecture Based on Bottleneck Transformer
计算机科学, 2022, 49(6A): 370-377. https://doi.org/10.11896/jsjkx.210500023
[8] 赵小虎, 叶圣, 李晓.
多算法融合的骨骼重建信息动作分类方法
Multi-algorithm Fusion Behavior Classification Method for Body Bone Information Reconstruction
计算机科学, 2022, 49(6): 269-275. https://doi.org/10.11896/jsjkx.210500070
[9] 陆亮, 孔芳.
面向对话的融入知识的实体关系抽取
Dialogue-based Entity Relation Extraction with Knowledge
计算机科学, 2022, 49(5): 200-205. https://doi.org/10.11896/jsjkx.210300198
[10] 李川, 李维华, 王迎晖, 陈伟, 文俊颖.
基于transformer的门控双塔模型预测H1N1流感抗原性
Gated Two-tower Transformer-based Model for Predicting Antigenicity of Influenza H1N1
计算机科学, 2022, 49(11A): 211000209-6. https://doi.org/10.11896/jsjkx.211000209
[11] 韩会珍, 刘立波.
基于注意力和视觉语义推理的枸杞虫害检索
Lycium Barbarum Pest Retrieval Based on Attention and Visual Semantic Reasoning
计算机科学, 2022, 49(11A): 211200087-6. https://doi.org/10.11896/jsjkx.211200087
[12] 王帅, 张淑军, 叶康, 郭淇.
基于改进Transformer的连续手语识别方法
Continuous Sign Language Recognition Method Based on Improved Transformer
计算机科学, 2022, 49(11A): 211200198-6. https://doi.org/10.11896/jsjkx.211200198
[13] 胡新荣, 陈志恒, 刘军平, 彭涛, 叶鹏, 朱强.
基于多模态表示学习的情感分析框架
Sentiment Analysis Framework Based on Multimodal Representation Learning
计算机科学, 2022, 49(11A): 210900107-6. https://doi.org/10.11896/jsjkx.210900107
[14] 缪岚芯, 雷雨, 曾鹏鹏, 李晓瑜, 宋井宽.
基于粒度感知和语义聚合的图像-文本检索网络
Granularity-aware and Semantic Aggregation Based Image-Text Retrieval Network
计算机科学, 2022, 49(11): 134-140. https://doi.org/10.11896/jsjkx.220600010
[15] 方仲俊, 张静, 李冬冬.
基于空间和多层级联合编码的图像描述算法
Spatial Encoding and Multi-layer Joint Encoding Enhanced Transformer for Image Captioning
计算机科学, 2022, 49(10): 151-158. https://doi.org/10.11896/jsjkx.210900159
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!