基于双门控-残差特征融合的跨模态图文检索

doi:10.11896/jsjkx.220700030

Abstract

Abstract: Due to the rapid development of the Internet and social media,cross-modal retrieval has attracted extensive attention.The purpose of cross-modal retrieval is to achieve flexible retrieval of different modalities.The heterogeneity gap between diffe-rent modal suggests that the similarity of different modal features cannot be calculated directly,making it difficult to improve the accuracy of cross-modal retrieval.This paper proposes an image-text cross-modal retrieval method for dual gating-residual feature fusion(DGRFF),to narrow the heterogeneity gap between the image and text.By designing gating features and residual features to fusion the features of image modality and text modality,this method can gain more effective feature information from the opposite modality,making semantic feature information more comprehensive.At the same time,the adversarial loss is adopted to align the feature distribution of the two modalities,to maintain the modality invariance of the fusion feature and obtain a more recogni-zable feature representation in the public potential space.Finally,the model is trained by combining label prediction loss,cross-modal similarity loss and adversarial loss.Experiments on Wikipedia and Pascal Sentence datasets show that DGRFF performs well on cross-modal retrieval tasks.

Key words: Cross-modal retrieval, Heterogeneity gap, Gating features, Residual features, Feature fusion

CLC Number:

TP391

ZHANG Changfan, MA Yuanyuan, LIU Jianhua, HE Jing. Dual Gating-Residual Feature Fusion for Image-Text Cross-modal Retrieval[J].Computer Science, 2023, 50(6A): 220700030-7.

References

[1]PENG Y X,HUANG X,ZHAO Y Z.An overview of cross-media retrieval:Concepts,methodologies,benchmarks,and challenges[J].IEEE Transactions on Circuits and Systems for Video Technology,2017,28(9):2372-2385.
[2]ZHANG L,MA B P,LI G R,et al.Generalized semi-supervised and structured subspace learning for cross-modal retrieval[J].IEEE Transactions on Multimedia,2017,20(1):128-141.
[3]ZHANG L,MA B P,LI G R,et al.PL-ranking:A novel ranking method for cross-modal retrieval[C]//Proceedings of the 24th ACM International Conference on Multimedia.New York:ACM,2016:1355-1364.
[4]PENG X,HUANG Z Y,LV J C,et al.COMIC:Multi-view clustering without parameter selection[C]//International Confe-rence on Machine Learning.New York:PMLR,2019:5092-5101.
[5]WEI Y C,ZHAO Y,LU C Y,et al.Cross-modal retrieval with CNN visual features:A new baseline[J].Piscataway:IEEE Transactions on Cybernetics,2016,47(2):449-460.
[6]ZENG D H,OYAMA K.Learning joint embedding for cross-modal retrieval[C]//2019 International Conference on Data Mining Workshops(ICDMW).IEEE,2019:1070-1071.
[7]QIANG B H,ZHAO T,WANG Y F,et al.Cross-modal Retrie-val Based on Stacked Bimodal Auto-Encoder[C]//2019 Eleventh International Conference on Advanced Computational Intelligence(ICACI).IEEE,2019:256-261.
[8]RASIWASIA N,PEREIRA J C,COVIELLO E,et al.A new approach to cross-modal multimedia retrieval[C]//Proceedings of the 18th ACM International Conference on Multimedia.New York:ACM,2010:251-260.
[9]HWANG S J,GRAUMAN K.Accounting for the relative importance of objects in image retrieval[C]//BMVC.2010:5.
[10]LEE K H,CHEN X,HUA G,et al.Stacked cross attention for image-text matching[C]//Proceedings of the European Conference on Computer Vision(ECCV).Berlin:Springer,2018:201-216.
[11]CORNIA M,BARALDI L,TAVAKOLI H R,et al.A unifiedcycle-consistent neural model for text and image retrieval[J].Multimedia Tools and Applications,2020,79(35):25697-25721.
[12]WANG B K,YANG Y,XU X,et al.Adversarial cross-modal retrieval[C]//Proceedings of the 25th ACM International Confe-rence on Multimedia.New York:ACM,2017:154-162.
[13]REN S H,LIN J Y,ZHAO G X,et al.Learning relation alignment for calibrated cross-modal retrieval[J].arXiv:2105.13868,2021.
[14]PENG Y X,QI J W,YUAN Y X.Modality-specific cross-modal similarity measurement with recurrent attention network[J].IEEE Transactions on Image Processing,2018,27(11):5585-5599.
[15]CAO Y,LONG M S,WANG J M,et al.Deep visual-semantic hashing for cross-modal retrieval[C]//Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Disco-very and Data Mining.New York:ACM,2016:1445-1454.
[16]LIN Q B,CAO W M,HE Z H,et al.Semantic deep cross-modal hashing[J].Neurocomputing,2020,396:113-122.
[17]WANG H,SAHOO D,LIU C H,et al.Learning cross-modalembeddings with adversarial networks for cooking recipes and food images[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.IEEE,2019:11572-11581.
[18]WU F,JING X Y,WU Z Y,et al.Modality-specific and shared generative adversarial network for cross-modal retrieval[J].Pattern Recognition,2020,104:107335.
[19]ZHEN L L,HU P,WANG X,et al.Deep supervised cross-modal retrieval[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.IEEE,2019:10394-10403.
[20]GUO M,YUAN Y,LU X Q.Deep cross-modal retrieval for remote sensing image and audio[C]//2018 10th IAPR Workshop on Pattern Recognition in Remote Sensing(PRRS).IEEE,2018:1-7.
[21]BELTAN L V B,CAICEDO J C,JOURNET N,et al.Deep multimodal learning for cross-modal retrieval:One model for all tasks[J].Pattern Recognition Letters,2021,146:38-45.
[22]DONG X F,ZHANG H X,DONG X,et al.Iterative graph attention memory network for cross-modal retrieval[J].Know-ledge-Based Systems,2021,226:107138.
[23]WANG X,HU P,ZHENG L L,et al.DRSL:Deep relationalsimilarity learning for cross-modal retrieval[J].Information Sciences,2021,546:298-311.
[24]VO N,JIANG L,SUN C,et al.Composing text and image for image retrieval-an empirical odyssey[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.IEEE,2019:6439-6448.
[25]DONG Z H,WAN J,LI C Y,et al.Feature Fusion based Cross-modal Retrieval for Traditional Chinese Painting[C]//2020 International Conference on Culture-oriented Science & Technology(ICCST).IEEE,2020:383-387.
[26]PENG Y X,QI J W.CM-GANs:Cross-modal generative adversarial networks for common representation learning[J].ACM Transactions on Multimedia Computing,Communications,and Applications(TOMM),2019,15(1):1-24.
[27]SIMONYAN K,ZISSERMAN A.Very deep convolutional networks for large-scale image recognition[J].Computer Science,2014.
[28]PEREIRA J C,COVIELLO E,DOYLE G,et al.On the role of correlation and abstraction in cross-modal multimedia retrieval[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2013,36(3):521-535.
[29]RASHTCHIAN C,YOUNG P,HODOSH M,et al.Collecting image annotations using amazon’s mechanical turk[C]//Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk.Stroudsburg,PA:ACL,2010:139-147.
[30]WANG K,HE R,WANG W,et al.Learning coupled feature spaces for cross-modal matching[C]//Proceedings of the IEEE International Conference on Computer Vision.Piscataway:IEEE,2013:2088-2095.
[31]SHARMA A,KUMAR A,DAUME H,et al.Generalized Multiview analysis:A discriminative latent space[C]//2012 IEEE Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE,2012:2160-2167.
[32]LI M Y,LI Y,HUANG S L,et al.Semantically supervised maximal correlation for cross-modal retrieval[C]//2020 IEEE International Conference on Image Processing(ICIP).IEEE,2020:2291-2295.
[33]WANG Y B,PENG Y X.MARS:Learning Modality-AgnosticRepresentation for Scalable Cross-media Retrieval[J].IEEE Transactions on Circuits and Systems for Video Technology,2022,32(7):4765-4777.
[34]LAURENS V D M,HINTON G.Visualizing data using t-SNE[J].Journal of Machine Learning Research,2008,9(11):2579-2605.

Related Articles 15

[1]	ZHOU Fengfan, LING Hefei, ZHANG Jinyuan, XIA Ziwei, SHI Yuxuan, LI Ping. Facial Physical Adversarial Example Performance Prediction Algorithm Based on Multi-modal Feature Fusion [J]. Computer Science, 2023, 50(8): 280-285.
[2]	SHAN Xiaohuan, SONG Rui, LI Haihai, SONG Baoyan. Event Recommendation Method with Multi-factor Feature Fusion in EBSN [J]. Computer Science, 2023, 50(7): 60-65.
[3]	WANG Tianran, WANG Qi, WANG Qingshan. Transfer Learning Based Cross-object Sign Language Gesture Recognition Method [J]. Computer Science, 2023, 50(6A): 220300232-5.
[4]	WU Liuchen, ZHANG Hui, LIU Jiaxuan, ZHAO Chenyang. Defect Detection of Transmission Line Bolt Based on Region Attention Mechanism andMulti-scale Feature Fusion [J]. Computer Science, 2023, 50(6A): 220200096-7.
[5]	LUO Huilan, LONG Jun, LIANG Miaomiao. Attentional Feature Fusion Approach for Siamese Network Based Object Tracking [J]. Computer Science, 2023, 50(6A): 220300237-9.
[6]	DOU Zhi, HU Chenguang, LIANG Jingyi, ZHENG Liming, LIU Guoqi. Lightweight Target Detection Algorithm Based on Improved Yolov4-tiny [J]. Computer Science, 2023, 50(6A): 220700006-7.
[7]	WANG Wei, BAI Long, MA Huanchang, LIU Yanheng. Study on Safety Warning Method of Driver’s Blind Area Based on Machine Vision [J]. Computer Science, 2023, 50(6A): 220700141-7.
[8]	RUAN Wang, HAO Guosheng, WANG Xia, HU Xiaoting, YANG Zihao. Fusion Multi-feature Fuzzy Model for Target Recognition and Its Application [J]. Computer Science, 2023, 50(6A): 220100138-7.
[9]	LIU Zhe, LIANG Yudong, LI Jiaying. Adaptive Image Dehazing Algorithm Based on Dynamic Convolution Kernels [J]. Computer Science, 2023, 50(6): 200-208.
[10]	JIA Tianhao, PENG Li. SSD Object Detection Algorithm with Residual Learning and Cyclic Attention [J]. Computer Science, 2023, 50(5): 170-176.
[11]	YANG Xiaoyu, LI Chao, CHEN Shunyao, LI Haoliang, YIN Guangqiang. Text-Image Cross-modal Retrieval Based on Transformer [J]. Computer Science, 2023, 50(4): 141-148.
[12]	BAI Xuefei, MA Yanan, WANG Wenjian. Segmentation Method of Edge-guided Breast Ultrasound Images Based on Feature Fusion [J]. Computer Science, 2023, 50(3): 199-207.
[13]	XIE Qinqin, HE Lang, XU Ruli. Classification of Oil Painting Art Style Based on Multi-feature Fusion [J]. Computer Science, 2023, 50(3): 223-230.
[14]	LIU Zejing, WU Nan, HUANG Fuqun, SONG You. Hybrid Programming Task Recommendation Model Based on Knowledge Graph and Collaborative Filtering for Online Judge [J]. Computer Science, 2023, 50(2): 106-114.
[15]	ZOU Yunzhu, DU Shengdong, TENG Fei, LI Tianrui. Visual Question Answering Model Based on Multi-modal Deep Feature Fusion [J]. Computer Science, 2023, 50(2): 123-129.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Dual Gating-Residual Feature Fusion for Image-Text Cross-modal Retrieval

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 15

Metrics

Comments

Recommended 0