计算机科学 ›› 2022, Vol. 49 ›› Issue (11): 134-140.doi: 10.11896/jsjkx.220600010

• 计算机图形学&多媒体 • 上一篇    下一篇

基于粒度感知和语义聚合的图像-文本检索网络

缪岚芯1, 雷雨1, 曾鹏鹏1, 李晓瑜2, 宋井宽1   

  1. 1 电子科技大学计算机科学与工程学院(网络空间安全学院) 成都 611731
    2 电子科技大学信息与软件工程学院 成都 610054
  • 收稿日期:2022-05-31 修回日期:2022-08-02 出版日期:2022-11-15 发布日期:2022-11-03
  • 通讯作者: 宋井宽(jingkuan.song@gmail.com)
  • 作者简介:(miaolanxin.lisa@gmail.com)
  • 基金资助:
    国家自然科学基金(62122018,61872064)

Granularity-aware and Semantic Aggregation Based Image-Text Retrieval Network

MIAO Lan-xin1, LEI Yu1, ZENG Peng-peng1, LI Xiao-yu2, SONG Jing-kuan1   

  1. 1 School of Computer Science and Engineering,University of Electronic Science and Technology of China,Chengdu 611731,China
    2 School of Information and Software Engineering,University of Electronic Science and Technology of China,Chengdu 610054,China
  • Received:2022-05-31 Revised:2022-08-02 Online:2022-11-15 Published:2022-11-03
  • About author:MIAO Lan-xin,born in 1998,postgra-duate.Her main research interests include cross-modal retrieval,computer vision and machine learning.
    SONG Jing-kuan,born in 1986,Ph.D,professor,associate editor.His main research interests include large-scale multi-media retrieval,image/video segmentation and image/video understan-ding using hashing,graph learning,and deep learning techniques.
  • Supported by:
    National Natural Science Foundation of China(62122018,61872064).

摘要: 图像-文本检索是视觉-语言领域中的基本任务,其目的在于挖掘不同模态样本之间的关系,即通过一种模态样本来检索具有近似语义的另一种模态样本。然而,现有方法大多高度依赖于将图像特定区域和句中单词进行相似语义关联,低估了视觉多粒度信息的重要性,导致了错误匹配以及语义模糊嵌入等问题。通常,图片包含了目标级、动作级、关系级以及场景级的粗、细粒度信息,而这些信息无显式多粒度标签,难以与模糊的文本表达直接一一对应。为了解决此问题,提出了一个粒度感知和语义聚合(Granularity-Aware and Semantic Aggregation,GASA)网络,用于获得多粒度视觉特征并缩小文本和视觉之间的语义鸿沟。具体来说,粒度感知的特征选择模块挖掘视觉多粒度信息,并在自适应门控融合机制和金字塔空洞卷积结构的引导下进行了多尺度融合。语义聚合模块在一个共享空间中对来自视觉和文本的多粒度信息进行聚类,以获得局部表征。模型在两个基准数据集上进行了实验,在MSCOCO 1k上R@1优于最先进的技术2%以上,在Flickr30K上R@Sum优于之前最先进的技术4.1%。

关键词: 图文匹配, 跨模态检索, 特征提取, 语义聚类, 多粒度信息提取

Abstract: Image-text retrieval is a basic task in visual-language domain,which aims at mining the relationships between different modalities.However,most existing approaches rely heavily on associating specific regions of an image with each word in a sentence with similar semantics and underappreciate the significance of multi-granular information in images,resulting in irrelevant matches between the two modalities and semantically ambiguous embedding.Generally,an image contains object-level,action-le-vel,relationship-level or even scene-level information that is not explicitly labeled.Therefore,it is challenging to align complex visual information with ambiguous descriptions.To tackle this issue,this paper proposes a granularity aware and semantic aggregating(GASA) network to obtain multi-visual representations and narrow the cross-modal gap.Specifically,the granularity-aware feature selection module selects copious multi-granularity information of images and conducts a multi-scale fusion,guided by an adaptive gated fusion mechanism and a pyramid structure.The semantic aggregation module clusters the multi-granularity information from visual and textual clues in a shared space to obtain the residual representations.Experiments are conducted on two benchmark datasets,and the results show our model outperforms the state-of-the-arts by over 2% on R@1 of MSCOCO 1k.Besides,our model outperforms the state-of-the-art by 4.1% in terms of Flickr30k on R@Sum.

Key words: Image-text matching, Cross-model retrieval, Feature extraction, Semantic aggregation, Multi-granularity information extraction

中图分类号: 

  • TP391
[1]ZENG P,GAO L,LYU X,et al.Conceptual and syntacticalcross-modal alignment with cross-level consistency for image-text matching [C]//Proceedings of the 29th ACM International Conference on Multimedia.2021:2205-2213.
[2]FENG X,HU Z Y,LIU C H.Survey of Research Progress on Cross modal Retrieval [J].Computer Science,2021,48(8):13-23.
[3]FENG Y G,CAI G Y.Cross-modal Retrieval Fusing Multilayer Semantics[J].Computer Science,2019,46(3):227-233.
[4]WANG W,CHEN Z,HU H.Hierarchical attention network for image captioning[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2019:8957-8964.
[5]YAN Y,ZHUANG N,NI B,et al.Fine-grained Video Captio-ning via Graph-based Multi-granularity Interaction Learning[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2022,44(2):666-683.
[6]LEI Y,HE Z,ZENG P,et al.Hierarchical Representation Net-work With Auxiliary Tasks For Video Captioning[C]//2021 IEEE International Conference on Multimedia and Expo(IC-ME).IEEE,2021.
[7]SEO A,KANG G C,PARK J,et al.Attend What You Need:Motion-Appearance Synergistic Networks for Video Question Answering[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing(Vo-lume 1:Long Papers).2021:6167-6177.
[8]WANG H,GUO D,HUA X S,et al.Pairwise VLAD Interaction Network for Video Question Answering[C]//Proceedings of the 29th ACM International Conference on Multimedia.2021:5119-5127.
[9]GAO L,ZENG P,SONG J,et al.Structured two-stream attention network for video question answering[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2019:6391-6398.
[10]GAO L,LEI Y,ZENG P,et al.Hierarchical Representation Net-work With Auxiliary Tasks for Video Captioning and Video Question Answering[J].IEEE Transactions on Image Proces-sing,2022,31:202-215.
[11]KARPATHY A,JOULIN A,LI F.Deep fragment embeddings for bidirectional image sentence mapping[J].Advances in Neural Information Processing Systems,2014,2:1889-1897.
[12]FAGHRI F,FLEET D J,KIROS J R,et al.Vse++:Improving visual-semantic embeddings with hard negatives[C]//Procee-dings of the British Machine Vision Conference(BMVC).2018.
[13]DENG Y J,ZHANG F L,CHEN X Q,et al.Collaborative Attention Network Model for Cross-modal Retrieval[J].Computer Science,2020,47(4):54-59.
[14]LEE K H,CHEN X,HUA G,et al.Stacked cross attention for image-text matching[C]//Proceedings of the European Confe-rence on Computer Vision(ECCV).2018:201-216.
[15]WU Y,WANG S,SONG G,et al.Learning fragment self-attention embeddings for image-text matching[C]//Proceedings of the 27th ACM International Conference on Multimedia.2019:2088-2096.
[16]LI K,ZHANG Y,LI K,et al.Visual semantic reasoning forimage-text matching[C]//Proceedings of the IEEE/CVF International conference on computer vision.2019:4654-4662.
[17]QU L,LIU M,CAO D,et al.Context-aware multi-view summarization network for image-text matching[C]//Proceedings of the 28th ACM International Conference on Multimedia.2020:1047-1055.
[18]CHEN H,DING G,LIU X,et al.Imram:Iterative matchingwith recurrent attention memory for cross-modal image-text retrieval[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:12655-12663.
[19]GE X,CHEN F,JOSE J M,et al.Structured Multi-modal Feature Embedding and Alignment for Image-Sentence Retrieval[C]//Proceedings of the 29th ACM International Conference on Multimedia.2021:5185-5193.
[20]WANG X,ZHU L,YANG Y.T2vlad:global-local sequencealignmentfor text-video retrieval[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2021:5079-5088.
[21]WANG Y,YANG H,QIAN X,et al.Position focused attention network for image-text matching[C]//Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence.2019:3792-3798.
[22]WANG H,ZHANG Y,JI Z,et al.Consensus-aware visual-se-mantic embedding for image-text matching[C]//European Conference on Computer Vision.Cham:Springer,2020:18-34.
[23]ARANDJELOVIC R,GRONAT P,TORII A,et al.NetVLAD:CNN architecture for weakly supervised place recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:5297-5307.
[24]UY M A,LEE G H.Pointnetvlad:Deep point cloud basedretrieval for large-scale place recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:4470-4479.
[25]HAUSLER S,GARG S,XU M,et al.Patch-netvlad:Multi-scale fusion of locally-global descriptors for place recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2021:14141-14152.
[26]ANDERSON P,HE X,BUEHLER C,et al.Bottom-up and top-down attention for image captioning and visual question answe-ring[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:6077-6086.
[27]JIANG H,MISRA I,ROHRBACH M,et al.In defense of grid features for visual question answering[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:10267-10276.
[28]PENNINGTON J,SOCHER R,MANNING C D.Glove:Global vectors for word representation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Proces-sing(EMNLP).2014:1532-1543.
[29]YU F,KOLTUN V.Multi-scale context aggregation by dilated convolutions[J].arXiv:1511.07122,2015.
[30]LIN T Y,MAIRE M,BELONGIE S,et al.Microsoft coco:Common objects in context[C]//European Conference on Computer Vision.Cham:Springer,2014:740-755.
[31]PLUMMER B A,WANG L,CERVANTES C M,et al.Flickr30k entities:Collecting region-to-phrase correspondences for richer image-to-sentence models[C]//Proceedings of the IEEE International Conference on Computer Vision.2015:2641-2649.
[32]WANG Z,LIU X,LI H,et al.Camp:Cross-modal adaptive message passing for text-image retrieval[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2019:5764-5773.
[33]ZHANG Q,LEI Z,ZHANG Z,et al.Context-aware attention network for image-text retrieval[C]//Proceedings of theIEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:3536-3545.
[34]ZHONG X,YANG Z,YE M,et al.Auxiliary bi-level graph representation for cross-modal image-text retrieval[C]//2021 IEEE International Conference on Multimedia and Expo(ICME).IEEE,2021:1-6.
[1] 张源, 康乐, 宫朝辉, 张志鸿.
基于Bi-LSTM的期货市场关联交易行为检测方法
Related Transaction Behavior Detection in Futures Market Based on Bi-LSTM
计算机科学, 2022, 49(7): 31-39. https://doi.org/10.11896/jsjkx.210400304
[2] 曾志贤, 曹建军, 翁年凤, 蒋国权, 徐滨.
基于注意力机制的细粒度语义关联视频-文本跨模态实体分辨
Fine-grained Semantic Association Video-Text Cross-modal Entity Resolution Based on Attention Mechanism
计算机科学, 2022, 49(7): 106-112. https://doi.org/10.11896/jsjkx.210500224
[3] 程成, 降爱莲.
基于多路径特征提取的实时语义分割方法
Real-time Semantic Segmentation Method Based on Multi-path Feature Extraction
计算机科学, 2022, 49(7): 120-126. https://doi.org/10.11896/jsjkx.210500157
[4] 刘伟业, 鲁慧民, 李玉鹏, 马宁.
指静脉识别技术研究综述
Survey on Finger Vein Recognition Research
计算机科学, 2022, 49(6A): 1-11. https://doi.org/10.11896/jsjkx.210400056
[5] 高元浩, 罗晓清, 张战成.
基于特征分离的红外与可见光图像融合算法
Infrared and Visible Image Fusion Based on Feature Separation
计算机科学, 2022, 49(5): 58-63. https://doi.org/10.11896/jsjkx.210200148
[6] 左杰格, 柳晓鸣, 蔡兵.
基于图像分块与特征融合的户外图像天气识别
Outdoor Image Weather Recognition Based on Image Blocks and Feature Fusion
计算机科学, 2022, 49(3): 197-203. https://doi.org/10.11896/jsjkx.201200263
[7] 任首朋, 李劲, 王静茹, 岳昆.
基于集成回归决策树的lncRNA-疾病关联预测方法
Ensemble Regression Decision Trees-based lncRNA-disease Association Prediction
计算机科学, 2022, 49(2): 265-271. https://doi.org/10.11896/jsjkx.201100132
[8] 何玉林, 李旭, 金一, 黄哲学.
基于分解极限学习机的手写字符识别方法
Handwritten Character Recognition Based on Decomposition Extreme Learning Machine
计算机科学, 2022, 49(11): 148-155. https://doi.org/10.11896/jsjkx.211200265
[9] 张敏, 余增, 韩云星, 李天瑞.
面向复杂场景的行人重识别综述
Overview of Person Re-identification for Complex Scenes
计算机科学, 2022, 49(10): 138-150. https://doi.org/10.11896/jsjkx.211200207
[10] 刘立波, 苟婷婷.
融合深度典型相关分析和对抗学习的跨模态检索
Cross-modal Retrieval Combining Deep Canonical Correlation Analysis and Adversarial Learning
计算机科学, 2021, 48(9): 200-207. https://doi.org/10.11896/jsjkx.200600119
[11] 张师鹏, 李永忠.
基于降噪自编码器和三支决策的入侵检测方法
Intrusion Detection Method Based on Denoising Autoencoder and Three-way Decisions
计算机科学, 2021, 48(9): 345-351. https://doi.org/10.11896/jsjkx.200500059
[12] 冯霞, 胡志毅, 刘才华.
跨模态检索研究进展综述
Survey of Research Progress on Cross-modal Retrieval
计算机科学, 2021, 48(8): 13-23. https://doi.org/10.11896/jsjkx.200800165
[13] 暴雨轩, 芦天亮, 杜彦辉, 石达.
基于i_ResNet34模型和数据增强的深度伪造视频检测方法
Deepfake Videos Detection Method Based on i_ResNet34 Model and Data Augmentation
计算机科学, 2021, 48(7): 77-85. https://doi.org/10.11896/jsjkx.210300258
[14] 张丽倩, 李孟航, 高珊珊, 张彩明.
面向计算机辅助舌诊关键问题的解决方案综述
Summary of Computer-assisted Tongue Diagnosis Solutions for Key Problems
计算机科学, 2021, 48(7): 256-269. https://doi.org/10.11896/jsjkx.200800223
[15] 霍帅, 庞春江.
基于Transformer和多通道卷积神经网络的情感分析研究
Research on Sentiment Analysis Based on Transformer and Multi-channel Convolutional Neural Network
计算机科学, 2021, 48(6A): 349-356. https://doi.org/10.11896/jsjkx.200800004
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!