计算机科学 ›› 2023, Vol. 50 ›› Issue (4): 141-148.doi: 10.11896/jsjkx.220100083
杨晓宇, 李超, 陈舜尧, 李浩亮, 殷光强
YANG Xiaoyu, LI Chao, CHEN Shunyao, LI Haoliang, YIN Guangqiang
摘要: 随着互联网多媒体数据的不断增长,文本图像检索已成为研究热点。在图文检索中,通常使用相互注意力机制,通过将图像和文本特征进行交互,来实现较好的图文匹配结果。但是,这种方法不能获取单独的图像特征和文本特征,在大规模检索后期需要对图像文本特征进行交互,消耗了大量的时间,无法做到快速检索匹配。然而基于Transformer的跨模态图像文本特征学习取得了良好的效果,受到了越来越多的关注。文中设计了一种新颖的基于Transformer的文本图像检索网络结构(HAS-Net),该结构主要有以下几点改进:1)设计了一种分层Transformer编码结构,以更好地利用底层的语法信息和高层的语义信息;2)改进了传统的全局特征聚合方式,利用自注意力机制设计了一种新的特征聚合方式;3)通过共享Transformer编码层,使图片特征和文本特征映射到公共的特征编码空间。在MS-COCO数据集和Flickr30k数据集上进行实验,结果表明跨模态检索性能均得到提升,在同类算法中处于领先地位,证明了所设计的网络结构的有效性。
中图分类号:
[1]HAO Y,DONG L,WEI F,et al.Visualizing and Understanding the Effectiveness of BERT[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing.Hong Kong:Association for Computational Linguistics,2019:4141-4150. [2]TENNEY I,DAS D,PAVLICK E.BERT Rediscovers the Classical NLP Pipeline[C]//Proceedings of the 57th Annual Mee-ting of the Association for Computational Linguistics.Florence:Association for Computational Linguistics,2019:4593-4601. [3]GABEUR V,SUN C,ALAHARI K,et al.Multi-modal trans-former for video retrieval[C]//Proceedings of the 16th Euro-pean Conference Computer Vision(ECCV).Glasgow:Springer,2020:214-229. [4]PATRICK M,HUANG P,ASANO Y,et al.Support-set bottlenecks for video-text representation learning[C]//Proceedings of the 9th International Conference on Learning Representations(ICLR).Austria:OpenReview,2021:1-18. [5]LI K,ZHANG Y,LI K,et al.Visual Semantic Reasoning forImage-Text Matching[C]//2019 IEEE International Conference on Computer Vision(ICCV).2019:4653-4661. [6]EISENSCHTAT A,WOLF L.Linking Image and Text with 2-Way Nets[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition(CVPR).2017:1855-1865. [7]FAGHRI F,FLEET D J,KIROS J R,et al.VSE++:Improving Visual-Semantic Embeddings with Hard Negatives[C]//British Machine Vision Conference(BMVC).2018:12-21. [8]GU J,CAI J,JOTY S,et al.Look,Imagine and Match:Improving Textual-Visual Cross-Modal Retrieval with Generative Models[C]//2018 IEEE Conference on Computer Vision and Pattern Recognition(CVPR).2018:7181-7189. [9]HUANG Y,WANG W,WANG L.Instance-aware Image andSentence Matching with Selective Multimodal LSTM[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition(CVPR).2017:7251-7262. [10]REN S,HE K,GIRSHICK R,et al.Faster R-CNN:Towards Real-Time Object Detection with Region Proposal Networks[J].IEEE Transactions on Pattern Analysis & Machine Intelligence,2017,39(6):1137-1149. [11]CHEN H,DING G,LIU X,et al.IMRAM:Iterative Matching With Recurrent Attention Memory for Cross-Modal Image-Text Retrieval[C]//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR).Seattle:IEEE,2020:12652-12660. [12]WANG Y X,YANG H,QIAN X M,et al.Position Focused Attention Network for Image-Text Matching[C]//Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence(IJCAI).Macao:AAAI,2019:3792-3798. [13]JI Z,WANG H,HAN J,et al.SMAN:Stacked Multimodal Attention Network for Cross-Modal Image-Text Retrieval[J].IEEE Transactions on Cybernetics,2020(99):1-12. [14]XU X,WANG T,YANG Y,et al.Cross-Modal Attention with Semantic Consistence for Image-Text Matching[J].IEEE Transactions on Neural Networks and Learning Systems,2020(99):1-14. [15]ASHISH V,NOAM S,NIKI P,et al.Attention is all you need[J].Advances in Neural Information Processing Systems,2017(1):5998-6008. [16]DEVLIN J,CHANG M W,LEE K,et al.BERT:Pre-training of Deep Bidirectional Transformers for Language Understanding[C]//Proceedings of 2019 Conference of the North American Chapter of the Association for Computational Linguistics.Minneapolis:Association for Computational Linguistics,2019:4171-4186. [17]QU L,LIU M,CAO D,et al.Context-Aware Multi-View Summarization Network for Image-Text Matching[C]//Proceedings of the 28th ACM International Conference on Multimedia.Seattle:ACM,2020:1047-1055. [18]WEI X,ZHANG T,LI Y,et al.Multi-Modality Cross Attention Network for Image and Sentence Matching[C]//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR).Seattle:IEEE,2020:10938-10947. [19]LU J,BATRA D,PARIKH D,et al.ViLBERT:PretrainingTask-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks[C]//Proceedings of International Conference on Neural Information Processing Systems.Vabcouver:IEEE,2019:13-23. [20]SU W,ZHU X,CAO Y,et al.VL-BERT:Pre-training of Gene-ric Visual-Linguistic Representations[C]//International Confe-rence on Learning Representations(ICLR).2020. [21]PARMAR N,VASWANI A,USZKOREIT J,et al.ImageTransformer[J].International Conference on Machine Lear-ning,2018(80):4052-4061. [22]CORDONNIER J,LOUKAS A,JAGGI M.On the Relationship between Self-Attention and Convolutional Layers[C]//International Conference on Learning Representations(ICLR).2020. [23]DOSOVITSKIY A,BEYER L,KOLESNIKOV A,et al.AnImage is Worth 16x16 Words:Transformers for Image Recognition at Scale[C]//International Conference on Learning Representations(ICLR).2021. [24]MESSINA N,FALCHI F,ESULI A,et al.Transformer Reaso-ning Network for Image-Text Matching and Retrieval[C]//International Conference on Learning Representations(ICLR).2020:5222-5229. [25]MESSINA N,AMATO G,ESULI A,et al.Fine-grained Visual Textual Alignment for Cross-Modal Retrieval using Transfor-mer Encoders[C]//CoRR.2020. [26]WEI X,ZHANG T,LI Y,et al.Multi-Modality Cross Attention Network for Image and Sentence Matching[C]//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR).Seattle:IEEE,2020:10938-10947. [27]LI G,DUAN N,FANG Y,et al.Unicoder-VL:A Universal Encoder for Vision and Language by Cross-Modal Pre-Training[J].AAAI Conference on Artificial Intelligence.2020:11336-11344. [28]CAO L,QIAN S,ZHANG H,et al.Global Relation-Aware Attention Network for Image-Text Retrieval[C]//Proceedings of International Conference on Multimedia Retrieval.Taiwan:ACM,2021:19-28. [29]PETERS M,NEUMANN M,ZETTLEMOYER L,et al.Dissecting Contextual Word Embeddings:Architecture and Representation[C]//Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing.Brussels:Association for Computational Linguistics,2018:1499-1509. [30]VIG J.A Multiscale Visualization of Attention in the Trans-former Model[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics:System Demonstrations.Florence:Association for Computational Linguistics,2019:37-42. [31]CHEN T,KORNBLITH S,NOROUZI M,et al.A SimpleFramework for Contrastive Learning of Visual Representations[C]//International Conference on Machine Learning(ICML).2020:1597-1607. [32]LIN T,MAIRE M,BELONGIE S,et al.Microsoft coco:Common objects in context[J].European Conference Computer Vision(ECCV),2014,8693:740-755. [33]YOUNG P,LAI A,HODOSH M,et al.From image descriptions to visual denotations:New similarity metrics for semantic infe-rence over event descriptions[J].Transactions of the Association for Computational Linguistics,2014,2:67-78. [34]LEE K H,XI C,GANG H,et al.Stacked Cross Attention for Image-Text Matching[C]//15th European Conference Compu-ter Vision(ECCV).2018:212-228. [35]WANG Z,LIU X,LI H,et al.CAMP:Cross-Modal Adaptive Message Passing for Text-Image Retrieval[C]//2019 IEEE International Conference on Computer Vision(ICCV).2019:5763-5772. |
[1] | 王振彪, 覃亚丽, 王荣芳, 郑欢. 基于残差特征聚合的图像压缩感知注意力神经网络 Image Compressed Sensing Attention Neural Network Based on Residual Feature Aggregation 计算机科学, 2023, 50(4): 117-124. https://doi.org/10.11896/jsjkx.211200215 |
[2] | 梁伟亮, 李悦, 王棚飞. 基于TransEditor的轻量化人脸生成方法及其应用规范 Lightweight Face Generation Method Based on TransEditor and Its Application Specification 计算机科学, 2023, 50(2): 221-230. https://doi.org/10.11896/jsjkx.220800166 |
[3] | 蔡肖, 陈志华, 盛斌. 基于移位窗口金字塔Transformer的遥感图像目标检测 SPT:Swin Pyramid Transformer for Object Detection of Remote Sensing 计算机科学, 2023, 50(1): 105-113. https://doi.org/10.11896/jsjkx.211100208 |
[4] | 张婧媛, 王宏霞, 何沛松. 基于Transformer的多任务图像拼接篡改检测算法 Multitask Transformer-based Network for Image Splicing Manipulation Detection 计算机科学, 2023, 50(1): 114-122. https://doi.org/10.11896/jsjkx.211100269 |
[5] | 汪鸣, 彭舰, 黄飞虎. 基于多时间尺度时空图网络的交通流量预测模型 Multi-time Scale Spatial-Temporal Graph Neural Network for Traffic Flow Prediction 计算机科学, 2022, 49(8): 40-48. https://doi.org/10.11896/jsjkx.220100188 |
[6] | 康雁, 徐玉龙, 寇勇奇, 谢思宇, 杨学昆, 李浩. 基于Transformer和LSTM的药物相互作用预测 Drug-Drug Interaction Prediction Based on Transformer and LSTM 计算机科学, 2022, 49(6A): 17-21. https://doi.org/10.11896/jsjkx.210400150 |
[7] | 张嘉淏, 刘峰, 齐佳音. 一种基于Bottleneck Transformer的轻量级微表情识别架构 Lightweight Micro-expression Recognition Architecture Based on Bottleneck Transformer 计算机科学, 2022, 49(6A): 370-377. https://doi.org/10.11896/jsjkx.210500023 |
[8] | 赵小虎, 叶圣, 李晓. 多算法融合的骨骼重建信息动作分类方法 Multi-algorithm Fusion Behavior Classification Method for Body Bone Information Reconstruction 计算机科学, 2022, 49(6): 269-275. https://doi.org/10.11896/jsjkx.210500070 |
[9] | 陆亮, 孔芳. 面向对话的融入知识的实体关系抽取 Dialogue-based Entity Relation Extraction with Knowledge 计算机科学, 2022, 49(5): 200-205. https://doi.org/10.11896/jsjkx.210300198 |
[10] | 李川, 李维华, 王迎晖, 陈伟, 文俊颖. 基于transformer的门控双塔模型预测H1N1流感抗原性 Gated Two-tower Transformer-based Model for Predicting Antigenicity of Influenza H1N1 计算机科学, 2022, 49(11A): 211000209-6. https://doi.org/10.11896/jsjkx.211000209 |
[11] | 韩会珍, 刘立波. 基于注意力和视觉语义推理的枸杞虫害检索 Lycium Barbarum Pest Retrieval Based on Attention and Visual Semantic Reasoning 计算机科学, 2022, 49(11A): 211200087-6. https://doi.org/10.11896/jsjkx.211200087 |
[12] | 王帅, 张淑军, 叶康, 郭淇. 基于改进Transformer的连续手语识别方法 Continuous Sign Language Recognition Method Based on Improved Transformer 计算机科学, 2022, 49(11A): 211200198-6. https://doi.org/10.11896/jsjkx.211200198 |
[13] | 胡新荣, 陈志恒, 刘军平, 彭涛, 叶鹏, 朱强. 基于多模态表示学习的情感分析框架 Sentiment Analysis Framework Based on Multimodal Representation Learning 计算机科学, 2022, 49(11A): 210900107-6. https://doi.org/10.11896/jsjkx.210900107 |
[14] | 缪岚芯, 雷雨, 曾鹏鹏, 李晓瑜, 宋井宽. 基于粒度感知和语义聚合的图像-文本检索网络 Granularity-aware and Semantic Aggregation Based Image-Text Retrieval Network 计算机科学, 2022, 49(11): 134-140. https://doi.org/10.11896/jsjkx.220600010 |
[15] | 方仲俊, 张静, 李冬冬. 基于空间和多层级联合编码的图像描述算法 Spatial Encoding and Multi-layer Joint Encoding Enhanced Transformer for Image Captioning 计算机科学, 2022, 49(10): 151-158. https://doi.org/10.11896/jsjkx.210900159 |
|