计算机科学 ›› 2022, Vol. 49 ›› Issue (11): 134-140.doi: 10.11896/jsjkx.220600010
缪岚芯1, 雷雨1, 曾鹏鹏1, 李晓瑜2, 宋井宽1
MIAO Lan-xin1, LEI Yu1, ZENG Peng-peng1, LI Xiao-yu2, SONG Jing-kuan1
摘要: 图像-文本检索是视觉-语言领域中的基本任务,其目的在于挖掘不同模态样本之间的关系,即通过一种模态样本来检索具有近似语义的另一种模态样本。然而,现有方法大多高度依赖于将图像特定区域和句中单词进行相似语义关联,低估了视觉多粒度信息的重要性,导致了错误匹配以及语义模糊嵌入等问题。通常,图片包含了目标级、动作级、关系级以及场景级的粗、细粒度信息,而这些信息无显式多粒度标签,难以与模糊的文本表达直接一一对应。为了解决此问题,提出了一个粒度感知和语义聚合(Granularity-Aware and Semantic Aggregation,GASA)网络,用于获得多粒度视觉特征并缩小文本和视觉之间的语义鸿沟。具体来说,粒度感知的特征选择模块挖掘视觉多粒度信息,并在自适应门控融合机制和金字塔空洞卷积结构的引导下进行了多尺度融合。语义聚合模块在一个共享空间中对来自视觉和文本的多粒度信息进行聚类,以获得局部表征。模型在两个基准数据集上进行了实验,在MSCOCO 1k上R@1优于最先进的技术2%以上,在Flickr30K上R@Sum优于之前最先进的技术4.1%。
中图分类号:
| [1]ZENG P,GAO L,LYU X,et al.Conceptual and syntacticalcross-modal alignment with cross-level consistency for image-text matching [C]//Proceedings of the 29th ACM International Conference on Multimedia.2021:2205-2213. [2]FENG X,HU Z Y,LIU C H.Survey of Research Progress on Cross modal Retrieval [J].Computer Science,2021,48(8):13-23. [3]FENG Y G,CAI G Y.Cross-modal Retrieval Fusing Multilayer Semantics[J].Computer Science,2019,46(3):227-233. [4]WANG W,CHEN Z,HU H.Hierarchical attention network for image captioning[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2019:8957-8964. [5]YAN Y,ZHUANG N,NI B,et al.Fine-grained Video Captio-ning via Graph-based Multi-granularity Interaction Learning[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2022,44(2):666-683. [6]LEI Y,HE Z,ZENG P,et al.Hierarchical Representation Net-work With Auxiliary Tasks For Video Captioning[C]//2021 IEEE International Conference on Multimedia and Expo(IC-ME).IEEE,2021. [7]SEO A,KANG G C,PARK J,et al.Attend What You Need:Motion-Appearance Synergistic Networks for Video Question Answering[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing(Vo-lume 1:Long Papers).2021:6167-6177. [8]WANG H,GUO D,HUA X S,et al.Pairwise VLAD Interaction Network for Video Question Answering[C]//Proceedings of the 29th ACM International Conference on Multimedia.2021:5119-5127. [9]GAO L,ZENG P,SONG J,et al.Structured two-stream attention network for video question answering[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2019:6391-6398. [10]GAO L,LEI Y,ZENG P,et al.Hierarchical Representation Net-work With Auxiliary Tasks for Video Captioning and Video Question Answering[J].IEEE Transactions on Image Proces-sing,2022,31:202-215. [11]KARPATHY A,JOULIN A,LI F.Deep fragment embeddings for bidirectional image sentence mapping[J].Advances in Neural Information Processing Systems,2014,2:1889-1897. [12]FAGHRI F,FLEET D J,KIROS J R,et al.Vse++:Improving visual-semantic embeddings with hard negatives[C]//Procee-dings of the British Machine Vision Conference(BMVC).2018. [13]DENG Y J,ZHANG F L,CHEN X Q,et al.Collaborative Attention Network Model for Cross-modal Retrieval[J].Computer Science,2020,47(4):54-59. [14]LEE K H,CHEN X,HUA G,et al.Stacked cross attention for image-text matching[C]//Proceedings of the European Confe-rence on Computer Vision(ECCV).2018:201-216. [15]WU Y,WANG S,SONG G,et al.Learning fragment self-attention embeddings for image-text matching[C]//Proceedings of the 27th ACM International Conference on Multimedia.2019:2088-2096. [16]LI K,ZHANG Y,LI K,et al.Visual semantic reasoning forimage-text matching[C]//Proceedings of the IEEE/CVF International conference on computer vision.2019:4654-4662. [17]QU L,LIU M,CAO D,et al.Context-aware multi-view summarization network for image-text matching[C]//Proceedings of the 28th ACM International Conference on Multimedia.2020:1047-1055. [18]CHEN H,DING G,LIU X,et al.Imram:Iterative matchingwith recurrent attention memory for cross-modal image-text retrieval[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:12655-12663. [19]GE X,CHEN F,JOSE J M,et al.Structured Multi-modal Feature Embedding and Alignment for Image-Sentence Retrieval[C]//Proceedings of the 29th ACM International Conference on Multimedia.2021:5185-5193. [20]WANG X,ZHU L,YANG Y.T2vlad:global-local sequencealignmentfor text-video retrieval[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2021:5079-5088. [21]WANG Y,YANG H,QIAN X,et al.Position focused attention network for image-text matching[C]//Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence.2019:3792-3798. [22]WANG H,ZHANG Y,JI Z,et al.Consensus-aware visual-se-mantic embedding for image-text matching[C]//European Conference on Computer Vision.Cham:Springer,2020:18-34. [23]ARANDJELOVIC R,GRONAT P,TORII A,et al.NetVLAD:CNN architecture for weakly supervised place recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:5297-5307. [24]UY M A,LEE G H.Pointnetvlad:Deep point cloud basedretrieval for large-scale place recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:4470-4479. [25]HAUSLER S,GARG S,XU M,et al.Patch-netvlad:Multi-scale fusion of locally-global descriptors for place recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2021:14141-14152. [26]ANDERSON P,HE X,BUEHLER C,et al.Bottom-up and top-down attention for image captioning and visual question answe-ring[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:6077-6086. [27]JIANG H,MISRA I,ROHRBACH M,et al.In defense of grid features for visual question answering[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:10267-10276. [28]PENNINGTON J,SOCHER R,MANNING C D.Glove:Global vectors for word representation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Proces-sing(EMNLP).2014:1532-1543. [29]YU F,KOLTUN V.Multi-scale context aggregation by dilated convolutions[J].arXiv:1511.07122,2015. [30]LIN T Y,MAIRE M,BELONGIE S,et al.Microsoft coco:Common objects in context[C]//European Conference on Computer Vision.Cham:Springer,2014:740-755. [31]PLUMMER B A,WANG L,CERVANTES C M,et al.Flickr30k entities:Collecting region-to-phrase correspondences for richer image-to-sentence models[C]//Proceedings of the IEEE International Conference on Computer Vision.2015:2641-2649. [32]WANG Z,LIU X,LI H,et al.Camp:Cross-modal adaptive message passing for text-image retrieval[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2019:5764-5773. [33]ZHANG Q,LEI Z,ZHANG Z,et al.Context-aware attention network for image-text retrieval[C]//Proceedings of theIEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:3536-3545. [34]ZHONG X,YANG Z,YE M,et al.Auxiliary bi-level graph representation for cross-modal image-text retrieval[C]//2021 IEEE International Conference on Multimedia and Expo(ICME).IEEE,2021:1-6. | 
| [1] | 张源, 康乐, 宫朝辉, 张志鸿. 基于Bi-LSTM的期货市场关联交易行为检测方法 Related Transaction Behavior Detection in Futures Market Based on Bi-LSTM 计算机科学, 2022, 49(7): 31-39. https://doi.org/10.11896/jsjkx.210400304 | 
| [2] | 曾志贤, 曹建军, 翁年凤, 蒋国权, 徐滨. 基于注意力机制的细粒度语义关联视频-文本跨模态实体分辨 Fine-grained Semantic Association Video-Text Cross-modal Entity Resolution Based on Attention Mechanism 计算机科学, 2022, 49(7): 106-112. https://doi.org/10.11896/jsjkx.210500224 | 
| [3] | 程成, 降爱莲. 基于多路径特征提取的实时语义分割方法 Real-time Semantic Segmentation Method Based on Multi-path Feature Extraction 计算机科学, 2022, 49(7): 120-126. https://doi.org/10.11896/jsjkx.210500157 | 
| [4] | 刘伟业, 鲁慧民, 李玉鹏, 马宁. 指静脉识别技术研究综述 Survey on Finger Vein Recognition Research 计算机科学, 2022, 49(6A): 1-11. https://doi.org/10.11896/jsjkx.210400056 | 
| [5] | 高元浩, 罗晓清, 张战成. 基于特征分离的红外与可见光图像融合算法 Infrared and Visible Image Fusion Based on Feature Separation 计算机科学, 2022, 49(5): 58-63. https://doi.org/10.11896/jsjkx.210200148 | 
| [6] | 左杰格, 柳晓鸣, 蔡兵. 基于图像分块与特征融合的户外图像天气识别 Outdoor Image Weather Recognition Based on Image Blocks and Feature Fusion 计算机科学, 2022, 49(3): 197-203. https://doi.org/10.11896/jsjkx.201200263 | 
| [7] | 任首朋, 李劲, 王静茹, 岳昆. 基于集成回归决策树的lncRNA-疾病关联预测方法 Ensemble Regression Decision Trees-based lncRNA-disease Association Prediction 计算机科学, 2022, 49(2): 265-271. https://doi.org/10.11896/jsjkx.201100132 | 
| [8] | 何玉林, 李旭, 金一, 黄哲学. 基于分解极限学习机的手写字符识别方法 Handwritten Character Recognition Based on Decomposition Extreme Learning Machine 计算机科学, 2022, 49(11): 148-155. https://doi.org/10.11896/jsjkx.211200265 | 
| [9] | 张敏, 余增, 韩云星, 李天瑞. 面向复杂场景的行人重识别综述 Overview of Person Re-identification for Complex Scenes 计算机科学, 2022, 49(10): 138-150. https://doi.org/10.11896/jsjkx.211200207 | 
| [10] | 刘立波, 苟婷婷. 融合深度典型相关分析和对抗学习的跨模态检索 Cross-modal Retrieval Combining Deep Canonical Correlation Analysis and Adversarial Learning 计算机科学, 2021, 48(9): 200-207. https://doi.org/10.11896/jsjkx.200600119 | 
| [11] | 张师鹏, 李永忠. 基于降噪自编码器和三支决策的入侵检测方法 Intrusion Detection Method Based on Denoising Autoencoder and Three-way Decisions 计算机科学, 2021, 48(9): 345-351. https://doi.org/10.11896/jsjkx.200500059 | 
| [12] | 冯霞, 胡志毅, 刘才华. 跨模态检索研究进展综述 Survey of Research Progress on Cross-modal Retrieval 计算机科学, 2021, 48(8): 13-23. https://doi.org/10.11896/jsjkx.200800165 | 
| [13] | 暴雨轩, 芦天亮, 杜彦辉, 石达. 基于i_ResNet34模型和数据增强的深度伪造视频检测方法 Deepfake Videos Detection Method Based on i_ResNet34 Model and Data Augmentation 计算机科学, 2021, 48(7): 77-85. https://doi.org/10.11896/jsjkx.210300258 | 
| [14] | 张丽倩, 李孟航, 高珊珊, 张彩明. 面向计算机辅助舌诊关键问题的解决方案综述 Summary of Computer-assisted Tongue Diagnosis Solutions for Key Problems 计算机科学, 2021, 48(7): 256-269. https://doi.org/10.11896/jsjkx.200800223 | 
| [15] | 霍帅, 庞春江. 基于Transformer和多通道卷积神经网络的情感分析研究 Research on Sentiment Analysis Based on Transformer and Multi-channel Convolutional Neural Network 计算机科学, 2021, 48(6A): 349-356. https://doi.org/10.11896/jsjkx.200800004 | 
| 
 | ||