计算机科学 ›› 2023, Vol. 50 ›› Issue (6A): 220700030-7.doi: 10.11896/jsjkx.220700030
张昌凡1, 马远远1, 刘建华2, 何静1
ZHANG Changfan1, MA Yuanyuan1, LIU Jianhua2, HE Jing1
摘要: 由于互联网和社交媒体的快速发展,跨模态检索引起了广泛关注,跨模态检索学习的目的是实现不同模态的灵活检索。不同模态数据之间存在异质性差距,不能直接计算不同模态特征的相似度,使得跨模态检索任务的准确率很难提高。为缩小图像和文本数据间的异质性差距,文中提出了一种双门控-残差特征融合的跨模态图文检索方法(DGRFF),该方法通过设计门控特征和残差特征来融合图像模态和文本的特征,能够从相反的模态中获得更有效的特征信息,使得语义特征信息更全面。同时,采用对抗损失来对齐两个模态特征的分布,以保持融合特征模态不变性以及在公共潜在空间中得到更有辨识力的特征表示。最后,联合标签预测损失、跨模态相似性损失和对抗损失对模型进行训练学习。在Wikipedia和Pascal Sentence数据集上进行实验,结果证明,DGRFF在跨模态检索任务上获得了良好的效果。
中图分类号:
[1]PENG Y X,HUANG X,ZHAO Y Z.An overview of cross-media retrieval:Concepts,methodologies,benchmarks,and challenges[J].IEEE Transactions on Circuits and Systems for Video Technology,2017,28(9):2372-2385. [2]ZHANG L,MA B P,LI G R,et al.Generalized semi-supervised and structured subspace learning for cross-modal retrieval[J].IEEE Transactions on Multimedia,2017,20(1):128-141. [3]ZHANG L,MA B P,LI G R,et al.PL-ranking:A novel ranking method for cross-modal retrieval[C]//Proceedings of the 24th ACM International Conference on Multimedia.New York:ACM,2016:1355-1364. [4]PENG X,HUANG Z Y,LV J C,et al.COMIC:Multi-view clustering without parameter selection[C]//International Confe-rence on Machine Learning.New York:PMLR,2019:5092-5101. [5]WEI Y C,ZHAO Y,LU C Y,et al.Cross-modal retrieval with CNN visual features:A new baseline[J].Piscataway:IEEE Transactions on Cybernetics,2016,47(2):449-460. [6]ZENG D H,OYAMA K.Learning joint embedding for cross-modal retrieval[C]//2019 International Conference on Data Mining Workshops(ICDMW).IEEE,2019:1070-1071. [7]QIANG B H,ZHAO T,WANG Y F,et al.Cross-modal Retrie-val Based on Stacked Bimodal Auto-Encoder[C]//2019 Eleventh International Conference on Advanced Computational Intelligence(ICACI).IEEE,2019:256-261. [8]RASIWASIA N,PEREIRA J C,COVIELLO E,et al.A new approach to cross-modal multimedia retrieval[C]//Proceedings of the 18th ACM International Conference on Multimedia.New York:ACM,2010:251-260. [9]HWANG S J,GRAUMAN K.Accounting for the relative importance of objects in image retrieval[C]//BMVC.2010:5. [10]LEE K H,CHEN X,HUA G,et al.Stacked cross attention for image-text matching[C]//Proceedings of the European Conference on Computer Vision(ECCV).Berlin:Springer,2018:201-216. [11]CORNIA M,BARALDI L,TAVAKOLI H R,et al.A unifiedcycle-consistent neural model for text and image retrieval[J].Multimedia Tools and Applications,2020,79(35):25697-25721. [12]WANG B K,YANG Y,XU X,et al.Adversarial cross-modal retrieval[C]//Proceedings of the 25th ACM International Confe-rence on Multimedia.New York:ACM,2017:154-162. [13]REN S H,LIN J Y,ZHAO G X,et al.Learning relation alignment for calibrated cross-modal retrieval[J].arXiv:2105.13868,2021. [14]PENG Y X,QI J W,YUAN Y X.Modality-specific cross-modal similarity measurement with recurrent attention network[J].IEEE Transactions on Image Processing,2018,27(11):5585-5599. [15]CAO Y,LONG M S,WANG J M,et al.Deep visual-semantic hashing for cross-modal retrieval[C]//Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Disco-very and Data Mining.New York:ACM,2016:1445-1454. [16]LIN Q B,CAO W M,HE Z H,et al.Semantic deep cross-modal hashing[J].Neurocomputing,2020,396:113-122. [17]WANG H,SAHOO D,LIU C H,et al.Learning cross-modalembeddings with adversarial networks for cooking recipes and food images[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.IEEE,2019:11572-11581. [18]WU F,JING X Y,WU Z Y,et al.Modality-specific and shared generative adversarial network for cross-modal retrieval[J].Pattern Recognition,2020,104:107335. [19]ZHEN L L,HU P,WANG X,et al.Deep supervised cross-modal retrieval[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.IEEE,2019:10394-10403. [20]GUO M,YUAN Y,LU X Q.Deep cross-modal retrieval for remote sensing image and audio[C]//2018 10th IAPR Workshop on Pattern Recognition in Remote Sensing(PRRS).IEEE,2018:1-7. [21]BELTAN L V B,CAICEDO J C,JOURNET N,et al.Deep multimodal learning for cross-modal retrieval:One model for all tasks[J].Pattern Recognition Letters,2021,146:38-45. [22]DONG X F,ZHANG H X,DONG X,et al.Iterative graph attention memory network for cross-modal retrieval[J].Know-ledge-Based Systems,2021,226:107138. [23]WANG X,HU P,ZHENG L L,et al.DRSL:Deep relationalsimilarity learning for cross-modal retrieval[J].Information Sciences,2021,546:298-311. [24]VO N,JIANG L,SUN C,et al.Composing text and image for image retrieval-an empirical odyssey[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.IEEE,2019:6439-6448. [25]DONG Z H,WAN J,LI C Y,et al.Feature Fusion based Cross-modal Retrieval for Traditional Chinese Painting[C]//2020 International Conference on Culture-oriented Science & Technology(ICCST).IEEE,2020:383-387. [26]PENG Y X,QI J W.CM-GANs:Cross-modal generative adversarial networks for common representation learning[J].ACM Transactions on Multimedia Computing,Communications,and Applications(TOMM),2019,15(1):1-24. [27]SIMONYAN K,ZISSERMAN A.Very deep convolutional networks for large-scale image recognition[J].Computer Science,2014. [28]PEREIRA J C,COVIELLO E,DOYLE G,et al.On the role of correlation and abstraction in cross-modal multimedia retrieval[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2013,36(3):521-535. [29]RASHTCHIAN C,YOUNG P,HODOSH M,et al.Collecting image annotations using amazon’s mechanical turk[C]//Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk.Stroudsburg,PA:ACL,2010:139-147. [30]WANG K,HE R,WANG W,et al.Learning coupled feature spaces for cross-modal matching[C]//Proceedings of the IEEE International Conference on Computer Vision.Piscataway:IEEE,2013:2088-2095. [31]SHARMA A,KUMAR A,DAUME H,et al.Generalized Multiview analysis:A discriminative latent space[C]//2012 IEEE Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE,2012:2160-2167. [32]LI M Y,LI Y,HUANG S L,et al.Semantically supervised maximal correlation for cross-modal retrieval[C]//2020 IEEE International Conference on Image Processing(ICIP).IEEE,2020:2291-2295. [33]WANG Y B,PENG Y X.MARS:Learning Modality-AgnosticRepresentation for Scalable Cross-media Retrieval[J].IEEE Transactions on Circuits and Systems for Video Technology,2022,32(7):4765-4777. [34]LAURENS V D M,HINTON G.Visualizing data using t-SNE[J].Journal of Machine Learning Research,2008,9(11):2579-2605. |
|