Computer Science ›› 2022, Vol. 49 ›› Issue (11): 134-140.doi: 10.11896/jsjkx.220600010

• Computer Graphics & Multimedia • Previous Articles     Next Articles

Granularity-aware and Semantic Aggregation Based Image-Text Retrieval Network

MIAO Lan-xin1, LEI Yu1, ZENG Peng-peng1, LI Xiao-yu2, SONG Jing-kuan1   

  1. 1 School of Computer Science and Engineering,University of Electronic Science and Technology of China,Chengdu 611731,China
    2 School of Information and Software Engineering,University of Electronic Science and Technology of China,Chengdu 610054,China
  • Received:2022-05-31 Revised:2022-08-02 Online:2022-11-15 Published:2022-11-03
  • About author:MIAO Lan-xin,born in 1998,postgra-duate.Her main research interests include cross-modal retrieval,computer vision and machine learning.
    SONG Jing-kuan,born in 1986,Ph.D,professor,associate editor.His main research interests include large-scale multi-media retrieval,image/video segmentation and image/video understan-ding using hashing,graph learning,and deep learning techniques.
  • Supported by:
    National Natural Science Foundation of China(62122018,61872064).

Abstract: Image-text retrieval is a basic task in visual-language domain,which aims at mining the relationships between different modalities.However,most existing approaches rely heavily on associating specific regions of an image with each word in a sentence with similar semantics and underappreciate the significance of multi-granular information in images,resulting in irrelevant matches between the two modalities and semantically ambiguous embedding.Generally,an image contains object-level,action-le-vel,relationship-level or even scene-level information that is not explicitly labeled.Therefore,it is challenging to align complex visual information with ambiguous descriptions.To tackle this issue,this paper proposes a granularity aware and semantic aggregating(GASA) network to obtain multi-visual representations and narrow the cross-modal gap.Specifically,the granularity-aware feature selection module selects copious multi-granularity information of images and conducts a multi-scale fusion,guided by an adaptive gated fusion mechanism and a pyramid structure.The semantic aggregation module clusters the multi-granularity information from visual and textual clues in a shared space to obtain the residual representations.Experiments are conducted on two benchmark datasets,and the results show our model outperforms the state-of-the-arts by over 2% on R@1 of MSCOCO 1k.Besides,our model outperforms the state-of-the-art by 4.1% in terms of Flickr30k on R@Sum.

Key words: Image-text matching, Cross-model retrieval, Feature extraction, Semantic aggregation, Multi-granularity information extraction

CLC Number: 

  • TP391
[1]ZENG P,GAO L,LYU X,et al.Conceptual and syntacticalcross-modal alignment with cross-level consistency for image-text matching [C]//Proceedings of the 29th ACM International Conference on Multimedia.2021:2205-2213.
[2]FENG X,HU Z Y,LIU C H.Survey of Research Progress on Cross modal Retrieval [J].Computer Science,2021,48(8):13-23.
[3]FENG Y G,CAI G Y.Cross-modal Retrieval Fusing Multilayer Semantics[J].Computer Science,2019,46(3):227-233.
[4]WANG W,CHEN Z,HU H.Hierarchical attention network for image captioning[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2019:8957-8964.
[5]YAN Y,ZHUANG N,NI B,et al.Fine-grained Video Captio-ning via Graph-based Multi-granularity Interaction Learning[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2022,44(2):666-683.
[6]LEI Y,HE Z,ZENG P,et al.Hierarchical Representation Net-work With Auxiliary Tasks For Video Captioning[C]//2021 IEEE International Conference on Multimedia and Expo(IC-ME).IEEE,2021.
[7]SEO A,KANG G C,PARK J,et al.Attend What You Need:Motion-Appearance Synergistic Networks for Video Question Answering[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing(Vo-lume 1:Long Papers).2021:6167-6177.
[8]WANG H,GUO D,HUA X S,et al.Pairwise VLAD Interaction Network for Video Question Answering[C]//Proceedings of the 29th ACM International Conference on Multimedia.2021:5119-5127.
[9]GAO L,ZENG P,SONG J,et al.Structured two-stream attention network for video question answering[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2019:6391-6398.
[10]GAO L,LEI Y,ZENG P,et al.Hierarchical Representation Net-work With Auxiliary Tasks for Video Captioning and Video Question Answering[J].IEEE Transactions on Image Proces-sing,2022,31:202-215.
[11]KARPATHY A,JOULIN A,LI F.Deep fragment embeddings for bidirectional image sentence mapping[J].Advances in Neural Information Processing Systems,2014,2:1889-1897.
[12]FAGHRI F,FLEET D J,KIROS J R,et al.Vse++:Improving visual-semantic embeddings with hard negatives[C]//Procee-dings of the British Machine Vision Conference(BMVC).2018.
[13]DENG Y J,ZHANG F L,CHEN X Q,et al.Collaborative Attention Network Model for Cross-modal Retrieval[J].Computer Science,2020,47(4):54-59.
[14]LEE K H,CHEN X,HUA G,et al.Stacked cross attention for image-text matching[C]//Proceedings of the European Confe-rence on Computer Vision(ECCV).2018:201-216.
[15]WU Y,WANG S,SONG G,et al.Learning fragment self-attention embeddings for image-text matching[C]//Proceedings of the 27th ACM International Conference on Multimedia.2019:2088-2096.
[16]LI K,ZHANG Y,LI K,et al.Visual semantic reasoning forimage-text matching[C]//Proceedings of the IEEE/CVF International conference on computer vision.2019:4654-4662.
[17]QU L,LIU M,CAO D,et al.Context-aware multi-view summarization network for image-text matching[C]//Proceedings of the 28th ACM International Conference on Multimedia.2020:1047-1055.
[18]CHEN H,DING G,LIU X,et al.Imram:Iterative matchingwith recurrent attention memory for cross-modal image-text retrieval[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:12655-12663.
[19]GE X,CHEN F,JOSE J M,et al.Structured Multi-modal Feature Embedding and Alignment for Image-Sentence Retrieval[C]//Proceedings of the 29th ACM International Conference on Multimedia.2021:5185-5193.
[20]WANG X,ZHU L,YANG Y.T2vlad:global-local sequencealignmentfor text-video retrieval[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2021:5079-5088.
[21]WANG Y,YANG H,QIAN X,et al.Position focused attention network for image-text matching[C]//Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence.2019:3792-3798.
[22]WANG H,ZHANG Y,JI Z,et al.Consensus-aware visual-se-mantic embedding for image-text matching[C]//European Conference on Computer Vision.Cham:Springer,2020:18-34.
[23]ARANDJELOVIC R,GRONAT P,TORII A,et al.NetVLAD:CNN architecture for weakly supervised place recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:5297-5307.
[24]UY M A,LEE G H.Pointnetvlad:Deep point cloud basedretrieval for large-scale place recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:4470-4479.
[25]HAUSLER S,GARG S,XU M,et al.Patch-netvlad:Multi-scale fusion of locally-global descriptors for place recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2021:14141-14152.
[26]ANDERSON P,HE X,BUEHLER C,et al.Bottom-up and top-down attention for image captioning and visual question answe-ring[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:6077-6086.
[27]JIANG H,MISRA I,ROHRBACH M,et al.In defense of grid features for visual question answering[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:10267-10276.
[28]PENNINGTON J,SOCHER R,MANNING C D.Glove:Global vectors for word representation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Proces-sing(EMNLP).2014:1532-1543.
[29]YU F,KOLTUN V.Multi-scale context aggregation by dilated convolutions[J].arXiv:1511.07122,2015.
[30]LIN T Y,MAIRE M,BELONGIE S,et al.Microsoft coco:Common objects in context[C]//European Conference on Computer Vision.Cham:Springer,2014:740-755.
[31]PLUMMER B A,WANG L,CERVANTES C M,et al.Flickr30k entities:Collecting region-to-phrase correspondences for richer image-to-sentence models[C]//Proceedings of the IEEE International Conference on Computer Vision.2015:2641-2649.
[32]WANG Z,LIU X,LI H,et al.Camp:Cross-modal adaptive message passing for text-image retrieval[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2019:5764-5773.
[33]ZHANG Q,LEI Z,ZHANG Z,et al.Context-aware attention network for image-text retrieval[C]//Proceedings of theIEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:3536-3545.
[34]ZHONG X,YANG Z,YE M,et al.Auxiliary bi-level graph representation for cross-modal image-text retrieval[C]//2021 IEEE International Conference on Multimedia and Expo(ICME).IEEE,2021:1-6.
[1] ZENG Zhi-xian, CAO Jian-jun, WENG Nian-feng, JIANG Guo-quan, XU Bin. Fine-grained Semantic Association Video-Text Cross-modal Entity Resolution Based on Attention Mechanism [J]. Computer Science, 2022, 49(7): 106-112.
[2] CHENG Cheng, JIANG Ai-lian. Real-time Semantic Segmentation Method Based on Multi-path Feature Extraction [J]. Computer Science, 2022, 49(7): 120-126.
[3] ZHANG Yuan, KANG Le, GONG Zhao-hui, ZHANG Zhi-hong. Related Transaction Behavior Detection in Futures Market Based on Bi-LSTM [J]. Computer Science, 2022, 49(7): 31-39.
[4] LIU Wei-ye, LU Hui-min, LI Yu-peng, MA Ning. Survey on Finger Vein Recognition Research [J]. Computer Science, 2022, 49(6A): 1-11.
[5] GAO Yuan-hao, LUO Xiao-qing, ZHANG Zhan-cheng. Infrared and Visible Image Fusion Based on Feature Separation [J]. Computer Science, 2022, 49(5): 58-63.
[6] ZUO Jie-ge, LIU Xiao-ming, CAI Bing. Outdoor Image Weather Recognition Based on Image Blocks and Feature Fusion [J]. Computer Science, 2022, 49(3): 197-203.
[7] REN Shou-peng, LI Jin, WANG Jing-ru, YUE Kun. Ensemble Regression Decision Trees-based lncRNA-disease Association Prediction [J]. Computer Science, 2022, 49(2): 265-271.
[8] HE Yu-lin, LI Xu, JIN Yi, HUANG Zhe-xue. Handwritten Character Recognition Based on Decomposition Extreme Learning Machine [J]. Computer Science, 2022, 49(11): 148-155.
[9] ZHANG Min, YU Zeng, HAN Yun-xing, LI Tian-rui. Overview of Person Re-identification for Complex Scenes [J]. Computer Science, 2022, 49(10): 138-150.
[10] ZHANG Shi-peng, LI Yong-zhong. Intrusion Detection Method Based on Denoising Autoencoder and Three-way Decisions [J]. Computer Science, 2021, 48(9): 345-351.
[11] FENG Xia, HU Zhi-yi, LIU Cai-hua. Survey of Research Progress on Cross-modal Retrieval [J]. Computer Science, 2021, 48(8): 13-23.
[12] ZHANG Li-qian, LI Meng-hang, GAO Shan-shan, ZHANG Cai-ming. Summary of Computer-assisted Tongue Diagnosis Solutions for Key Problems [J]. Computer Science, 2021, 48(7): 256-269.
[13] BAO Yu-xuan, LU Tian-liang, DU Yan-hui, SHI Da. Deepfake Videos Detection Method Based on i_ResNet34 Model and Data Augmentation [J]. Computer Science, 2021, 48(7): 77-85.
[14] LI Na-na, WANG Yong, ZHOU Lin, ZOU Chun-ming, TIAN Ying-jie, GUO Nai-wang. DDoS Attack Random Forest Detection Method Based on Secondary Screening of Feature Importance [J]. Computer Science, 2021, 48(6A): 464-467.
[15] CHEN Yang, WANG Jin-liang, XIA Wei, YANG Hao, ZHU Run, XI Xue-feng. Footprint Image Clustering Method Based on Automatic Feature Extraction [J]. Computer Science, 2021, 48(6A): 255-259.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!