计算机科学 ›› 2023, Vol. 50 ›› Issue (6A): 220700030-7.doi: 10.11896/jsjkx.220700030

• 图像处理&多媒体技术 • 上一篇    下一篇

基于双门控-残差特征融合的跨模态图文检索

张昌凡1, 马远远1, 刘建华2, 何静1   

  1. 1 湖南工业大学电气与信息工程学院 湖南 株洲 412007;
    2 湖南工业大学轨道交通学院 湖南 株洲 412007
  • 出版日期:2023-06-10 发布日期:2023-06-12
  • 通讯作者: 何静(hejing@263.net)
  • 作者简介:(zhangchangfan@263.net)
  • 基金资助:
    国家自然科学基金(52172403,62173137,52272347);湖南省自然科学基金(2021JJ50001,2021JJ30217)

Dual Gating-Residual Feature Fusion for Image-Text Cross-modal Retrieval

ZHANG Changfan1, MA Yuanyuan1, LIU Jianhua2, HE Jing1   

  1. 1 School of Electrical and Information Engineering,Hunan University of Technology,Zhuzhou,Hunan 412007,China;
    2 School of Rail Transit,Hunan University of Technology,Zhuzhou,Hunan 412007,China
  • Online:2023-06-10 Published:2023-06-12
  • About author:ZHANG Changfang,born in 1960,Ph.D,professor,Ph.D supervisor.His main research interests include fault diagnosis on electrical machines and industrial process control. HE Jing,born in 1971,Ph.D,professor,master supervisor.Her main research interests include fault diagnosis on mechatronics machinesand industrial process control.
  • Supported by:
    National Natural Science Foundation of China(52172403,62170137,52272347) and Natural Science Foundation of Hunan Province,China(2021JJ50001,2021JJ30217).

摘要: 由于互联网和社交媒体的快速发展,跨模态检索引起了广泛关注,跨模态检索学习的目的是实现不同模态的灵活检索。不同模态数据之间存在异质性差距,不能直接计算不同模态特征的相似度,使得跨模态检索任务的准确率很难提高。为缩小图像和文本数据间的异质性差距,文中提出了一种双门控-残差特征融合的跨模态图文检索方法(DGRFF),该方法通过设计门控特征和残差特征来融合图像模态和文本的特征,能够从相反的模态中获得更有效的特征信息,使得语义特征信息更全面。同时,采用对抗损失来对齐两个模态特征的分布,以保持融合特征模态不变性以及在公共潜在空间中得到更有辨识力的特征表示。最后,联合标签预测损失、跨模态相似性损失和对抗损失对模型进行训练学习。在Wikipedia和Pascal Sentence数据集上进行实验,结果证明,DGRFF在跨模态检索任务上获得了良好的效果。

关键词: 跨模态检索, 异质性差距, 门控特征, 残差特征, 特征融合

Abstract: Due to the rapid development of the Internet and social media,cross-modal retrieval has attracted extensive attention.The purpose of cross-modal retrieval is to achieve flexible retrieval of different modalities.The heterogeneity gap between diffe-rent modal suggests that the similarity of different modal features cannot be calculated directly,making it difficult to improve the accuracy of cross-modal retrieval.This paper proposes an image-text cross-modal retrieval method for dual gating-residual feature fusion(DGRFF),to narrow the heterogeneity gap between the image and text.By designing gating features and residual features to fusion the features of image modality and text modality,this method can gain more effective feature information from the opposite modality,making semantic feature information more comprehensive.At the same time,the adversarial loss is adopted to align the feature distribution of the two modalities,to maintain the modality invariance of the fusion feature and obtain a more recogni-zable feature representation in the public potential space.Finally,the model is trained by combining label prediction loss,cross-modal similarity loss and adversarial loss.Experiments on Wikipedia and Pascal Sentence datasets show that DGRFF performs well on cross-modal retrieval tasks.

Key words: Cross-modal retrieval, Heterogeneity gap, Gating features, Residual features, Feature fusion

中图分类号: 

  • TP391
[1]PENG Y X,HUANG X,ZHAO Y Z.An overview of cross-media retrieval:Concepts,methodologies,benchmarks,and challenges[J].IEEE Transactions on Circuits and Systems for Video Technology,2017,28(9):2372-2385.
[2]ZHANG L,MA B P,LI G R,et al.Generalized semi-supervised and structured subspace learning for cross-modal retrieval[J].IEEE Transactions on Multimedia,2017,20(1):128-141.
[3]ZHANG L,MA B P,LI G R,et al.PL-ranking:A novel ranking method for cross-modal retrieval[C]//Proceedings of the 24th ACM International Conference on Multimedia.New York:ACM,2016:1355-1364.
[4]PENG X,HUANG Z Y,LV J C,et al.COMIC:Multi-view clustering without parameter selection[C]//International Confe-rence on Machine Learning.New York:PMLR,2019:5092-5101.
[5]WEI Y C,ZHAO Y,LU C Y,et al.Cross-modal retrieval with CNN visual features:A new baseline[J].Piscataway:IEEE Transactions on Cybernetics,2016,47(2):449-460.
[6]ZENG D H,OYAMA K.Learning joint embedding for cross-modal retrieval[C]//2019 International Conference on Data Mining Workshops(ICDMW).IEEE,2019:1070-1071.
[7]QIANG B H,ZHAO T,WANG Y F,et al.Cross-modal Retrie-val Based on Stacked Bimodal Auto-Encoder[C]//2019 Eleventh International Conference on Advanced Computational Intelligence(ICACI).IEEE,2019:256-261.
[8]RASIWASIA N,PEREIRA J C,COVIELLO E,et al.A new approach to cross-modal multimedia retrieval[C]//Proceedings of the 18th ACM International Conference on Multimedia.New York:ACM,2010:251-260.
[9]HWANG S J,GRAUMAN K.Accounting for the relative importance of objects in image retrieval[C]//BMVC.2010:5.
[10]LEE K H,CHEN X,HUA G,et al.Stacked cross attention for image-text matching[C]//Proceedings of the European Conference on Computer Vision(ECCV).Berlin:Springer,2018:201-216.
[11]CORNIA M,BARALDI L,TAVAKOLI H R,et al.A unifiedcycle-consistent neural model for text and image retrieval[J].Multimedia Tools and Applications,2020,79(35):25697-25721.
[12]WANG B K,YANG Y,XU X,et al.Adversarial cross-modal retrieval[C]//Proceedings of the 25th ACM International Confe-rence on Multimedia.New York:ACM,2017:154-162.
[13]REN S H,LIN J Y,ZHAO G X,et al.Learning relation alignment for calibrated cross-modal retrieval[J].arXiv:2105.13868,2021.
[14]PENG Y X,QI J W,YUAN Y X.Modality-specific cross-modal similarity measurement with recurrent attention network[J].IEEE Transactions on Image Processing,2018,27(11):5585-5599.
[15]CAO Y,LONG M S,WANG J M,et al.Deep visual-semantic hashing for cross-modal retrieval[C]//Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Disco-very and Data Mining.New York:ACM,2016:1445-1454.
[16]LIN Q B,CAO W M,HE Z H,et al.Semantic deep cross-modal hashing[J].Neurocomputing,2020,396:113-122.
[17]WANG H,SAHOO D,LIU C H,et al.Learning cross-modalembeddings with adversarial networks for cooking recipes and food images[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.IEEE,2019:11572-11581.
[18]WU F,JING X Y,WU Z Y,et al.Modality-specific and shared generative adversarial network for cross-modal retrieval[J].Pattern Recognition,2020,104:107335.
[19]ZHEN L L,HU P,WANG X,et al.Deep supervised cross-modal retrieval[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.IEEE,2019:10394-10403.
[20]GUO M,YUAN Y,LU X Q.Deep cross-modal retrieval for remote sensing image and audio[C]//2018 10th IAPR Workshop on Pattern Recognition in Remote Sensing(PRRS).IEEE,2018:1-7.
[21]BELTAN L V B,CAICEDO J C,JOURNET N,et al.Deep multimodal learning for cross-modal retrieval:One model for all tasks[J].Pattern Recognition Letters,2021,146:38-45.
[22]DONG X F,ZHANG H X,DONG X,et al.Iterative graph attention memory network for cross-modal retrieval[J].Know-ledge-Based Systems,2021,226:107138.
[23]WANG X,HU P,ZHENG L L,et al.DRSL:Deep relationalsimilarity learning for cross-modal retrieval[J].Information Sciences,2021,546:298-311.
[24]VO N,JIANG L,SUN C,et al.Composing text and image for image retrieval-an empirical odyssey[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.IEEE,2019:6439-6448.
[25]DONG Z H,WAN J,LI C Y,et al.Feature Fusion based Cross-modal Retrieval for Traditional Chinese Painting[C]//2020 International Conference on Culture-oriented Science & Technology(ICCST).IEEE,2020:383-387.
[26]PENG Y X,QI J W.CM-GANs:Cross-modal generative adversarial networks for common representation learning[J].ACM Transactions on Multimedia Computing,Communications,and Applications(TOMM),2019,15(1):1-24.
[27]SIMONYAN K,ZISSERMAN A.Very deep convolutional networks for large-scale image recognition[J].Computer Science,2014.
[28]PEREIRA J C,COVIELLO E,DOYLE G,et al.On the role of correlation and abstraction in cross-modal multimedia retrieval[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2013,36(3):521-535.
[29]RASHTCHIAN C,YOUNG P,HODOSH M,et al.Collecting image annotations using amazon’s mechanical turk[C]//Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk.Stroudsburg,PA:ACL,2010:139-147.
[30]WANG K,HE R,WANG W,et al.Learning coupled feature spaces for cross-modal matching[C]//Proceedings of the IEEE International Conference on Computer Vision.Piscataway:IEEE,2013:2088-2095.
[31]SHARMA A,KUMAR A,DAUME H,et al.Generalized Multiview analysis:A discriminative latent space[C]//2012 IEEE Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE,2012:2160-2167.
[32]LI M Y,LI Y,HUANG S L,et al.Semantically supervised maximal correlation for cross-modal retrieval[C]//2020 IEEE International Conference on Image Processing(ICIP).IEEE,2020:2291-2295.
[33]WANG Y B,PENG Y X.MARS:Learning Modality-AgnosticRepresentation for Scalable Cross-media Retrieval[J].IEEE Transactions on Circuits and Systems for Video Technology,2022,32(7):4765-4777.
[34]LAURENS V D M,HINTON G.Visualizing data using t-SNE[J].Journal of Machine Learning Research,2008,9(11):2579-2605.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!