跨模态检索研究进展综述

doi:10.11896/jsjkx.200800165

摘要/Abstract

摘要： 随着互联网上多媒体数据的爆炸式增长,单一模态的检索已经无法满足用户需求,跨模态检索应运而生。跨模态检索旨在以一种模态的数据去检索另一种模态的相关数据,其核心任务是数据特征提取和不同模态间数据的相关性度量。文中梳理了跨模态检索领域近期的研究进展,从传统方法、深度学习方法、手工特征的哈希编码方法以及深度学习的哈希编码方法等角度归纳论述了跨模态检索领域的研究成果。在此基础上,对比分析了各类算法在跨模态检索常用标准数据集上的性能。最后,分析了跨模态检索研究存在的问题,并对该领域未来发展趋势以及应用进行了展望。

关键词: 跨模态检索, 深度学习, 特征提取, 相关性度量

Abstract: With the explosive growth of multimedia data on the Internet,single-modal retrieval has been unable to meet the needs of users,and cross-modal retrieval has emerged.Cross-modal retrieval aims to retrieve related data of one modality with data of another modality.Its core task is to extract data features and measure data correlation between different modality.This paper summarizes the recent research progress in the field of cross-modal retrieval,and summarizes the research results in the field of cross-modal retrieval from the perspectives of traditional methods,deep learning methods,manual feature hash coding methods and deep learning hash coding methods.On this basis,the performance of various algorithms in cross-modal retrieval of commonly used standard data sets is compared and analyzed.Finally,the problems of cross-modal retrieval research are analyzed and the future development trend of the field is prospected.

Key words: Correlation measure, Cross-modal retrieval, Deep learning, Feature extraction

中图分类号:

TP391

冯霞, 胡志毅, 刘才华. 跨模态检索研究进展综述[J]. 计算机科学, 2021, 48(8): 13-23. https://doi.org/10.11896/jsjkx.200800165

FENG Xia, HU Zhi-yi, LIU Cai-hua. Survey of Research Progress on Cross-modal Retrieval[J]. Computer Science, 2021, 48(8): 13-23. https://doi.org/10.11896/jsjkx.200800165

参考文献

[1]LIU J,XU C,LU H.Cross-media retrieval:state-of-the-art and open issues[J].International Journal of Multimedia Intelligence and Security,2010,1(1):33-52.
[2]WANG K,YIN Q,WANG W,et al.A comprehensive survey on cross-modal retrieval[J].arXiv:1607.06215,2016.
[3]SALTON G,FOX E A,WU H.Extended Boolean information retrieval[R].Cornell University,1982.
[4]ZHU C Z,JÉGOU H,SATOH S.Query-adaptive asymmetrical dissimilarities for visual object retrieval[C]//Proceedings of the IEEE International Conference on Computer Vision.2013:1705-1712.
[5]AIZAWA A.An information-theoretic perspective of tf-idfmeasures[J].Information Processing & Management,2003,39(1):45-65.
[6]BLEI D M,NG A Y,JORDAN M I.Latent dirichlet allocation[J].Journal of Machine Learning Research,2003,3(4/5):993-1022.
[7]DALAL N,TRIGGS B.Histograms of oriented gradients forhuman detection[C]//2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.2005:886-893.
[8]MISHRA A,ALAHARI K,JAWAHAR C V.Image retrievalusing textual cues[C]//Proceedings of the IEEE International Conference on Computer Vision.2013:3040-3047.
[9]ELIZALDE B,ZARAR S,RAJ B.Cross modal audio search and retrieval with joint embeddings based on text and audio[C]//ICASSP 2019－2019 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP).IEEE,2019:4095-4099.
[10]KAMPER H,SHAKHNAROVICH G,LIVESCU K.Semantic speech retrieval with a visually grounded model of untranscribed speech[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops.2018:2514-2517.
[11]GUO M,ZHOU C,LIU J.Jointly Learning of Visual and Auditory:A New Approach for RS Image and Audio Cross-Modal Retrieval[J].IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing,2019,12(11):4644-4654.
[12]HARDOON D R,SZEDMAK S,SHAWE-TAYLOR J.Canonical correlation analysis:An overview with application to learning methods[J].Neural Computation,2004,16(12):2639-2664.
[13]RUPNIK J,SHAWE-TAYLOR J.Multi-view canonical correlation analysis[C]//Conference on Data Mining and Data Warehouses (SiKDD 2010).2010:1-4.
[14]TENENBAUM J B,FREEMAN W T.Separating style and content with bilinear models[J].Neural computation,2000,12(6):1247-1283.
[15]RANJAN V,RASIWASIA N,JAWAHAR C V.Multi-labelcross-modal retrieval[C]//Proceedings of the IEEE Internatio-nal Conference on Computer Vision.2015:4094-4102.
[16]HWANG S J,GRAUMAN K.Learning the relative importance of objects from tagged images for retrieval and cross-modal search[J].International Journal of Computer Vision,2012,100(2):134-153.
[17]JIA Y,BAI L,LIU S,et al.Semantically-enhanced kernel cano-nical correlation analysis:a multi-label cross-modal retrieval[J].Multimedia Tools and Applications,2019,78(10):13169-13188.
[18]RASIWASIA N,MAHAJAN D,MAHADEVAN V,et al.Cluster canonical correlation analysis[C]//Artificial intelligence and statistics.2014:823-831.
[19]ANDREW G,ARORA R,BILMES J,et al.Deep canonical correlation analysis[C]//International Conference on Machine Learning.PMLR,2013:1247-1255.
[20]HU R,XU H,ROHRBACH M,et al.Natural language object retrieval[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:4555-4564.
[21]VO N,JIANG L,SUN C,et al.Composing text and image for image retrieval－an empirical odyssey[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:6439-6448.
[22]WEHRMANN J,BARROS R C.Bidirectional retrieval madesimple[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:7718-7726.
[23]SALVADOR A,HYNES N,AYTAR Y,et al.Learning cross-modal embeddings for cooking recipes and food images[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:3020-3028.
[24]YAMAGUCHI M,SAITO K,USHIKU Y,et al.Spatio-temporal person retrieval via natural language queries[C]//Procee-dings of the IEEE International Conference on Computer Vision.2017:1453-1462.
[25]HERSHEY S,CHAUDHURI S,ELLIS D P W,et al.CNN architectures for large-scale audio classification[C]//2017 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP).IEEE,2017:131-135.
[26]HU D,NIE F,LI X.Deep multimodal clustering for unsupervised audiovisual learning[C]//Proceedings of the IEEE Confe-rence on Computer Vision and Pattern Recognition.2019:9248-9257.
[27]SCHWARTZ I,SCHWING A G,HAZAN T.A simple baseline for audio-visual scene-aware dialog[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:12548-12558.
[28]DENG Y J,ZHANG F L,CHEN X Q,et al.Collaborative attention network model for cross-modal retrieval[J].Computer Science,2020,47(4):54-59.
[29]LI S,XIAO T,LI H,et al.Person search with natural language description[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:1970-1979.
[30]LI H,WANG P,SHEN C,et al.Visual Question Answering as Reading Comprehension[C]//Proceedings of the IEEEConfe-rence on Computer Vision and Pattern Recognition.2019:6319-6328.
[31]DEY S,DUTTA A,GHOSH S K,et al.Learning cross-modal deep embeddings for multi-object image retrieval using text and sketch[C]//2018 24th International Conference on Pattern Re-cognition(ICPR).IEEE,2018:916-921.
[32]YAN Y,ZHANG Q,NI B,et al.Learning Context Graph forPerson Search[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:2158-2167.
[33]MITHUN N C,PAUL S,ROY-CHOWDHURY A K.Weakly supervised video moment retrieval from text queries[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:11592-11601.
[34]SONG Y,SOLEYMANI M.Polysemous Visual-Semantic Em-bedding for Cross-Modal Retrieval[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:1979-1988.
[35]CHEN K,BUI T,FANG C,et al.AMC:Attention guided multi-modal correlation learning for image search[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:2644-2652.
[36]LIU X,WANG Z,SHAO J,et al.Improving Referring Expression Grounding with Cross-modal Attention-guided Erasing[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:1950-1959.
[37]JOHNSON J,KRISHNA R,STARK M,et al.Image retrieval using scene graphs[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2015:3668-3678.
[38]YANG J,LU J,LEE S,et al.Graph r-cnn for scene graph genera-tion[C]//Proceedings of the European Conference on Compu-ter Vision (ECCV).2018:670-685.
[39]HU R,ANDREAS J,ROHRBACH M,et al.Learning to reason:End-to-end module networks for visual question answering[C]//Proceedings of the IEEE International Conference on Computer Vision.2017:804-813.
[40]JOHNSON J,HARIHARAN B,VAN DER MAATEN L,et al.Inferring and executing programs for visual reasoning[C]//Proceedings of the IEEE International Conference on Computer Vision.2017:2989-2998.
[41]XIAO F Y,SIGAL L,LEE Y J.Weakly-supervised visualgrounding of phrases with linguistic structures[C]//Procee-dings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:5945-5954.
[42]LIU B,YEUNG S,CHOU E,et al.Temporal modular networks for retrieving complex compositional activities in videos[C]//Proceedings of the European Conference on Computer Vision (ECCV).2018:552-568.
[43]ZHANG D,DAI X,WANG X,et al.Man:Moment alignment network for natural language moment retrieval via iterative graph adjustment[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:1247-1257.
[44]WANG B,YANG Y,XU X,et al.Adversarial cross-modal retrieval[C]//Proceedings of the 25th ACM InternationalConfe-rence on Multimedia.ACM,2017:154-162.
[45]PENG Y,QI J.CM-GANs:cross-modal generative adversarial networks for common representation learning[J].ACM Tran-sactions on Multimedia Computing,Communications,and Applications (TOMM),2019,15(1):22.
[46]WANG H,SAHOO D,LIU C,et al.Learning Cross-Modal Embeddings with Adversarial Networks for Cooking Recipes and Food Images[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:11572-11581.
[47]CHEN Y,CHEN H K.Speaker recognition based on multimodal generation adversarial networks and triple loss [J].Journal of Electronics Information Technology,2020,42(2):379-385.
[48]GU J,CAI J,JOTY S R,et al.Look,imagine and match:Improving textual-visual cross-modal retrieval with generative models[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:7181-7189.
[49]ZHU B,NGO C W,CHEN J,et al.R2GAN:Cross-Modal Recipe Retrieval with Generative Adversarial Network[C]//Procee-dings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:11477-11486.
[50]WEISS Y,TORRALBA A,FERGUS R.Spectral hashing[C]//Advances in Neural Information Processing Systems.2009:1753-1760.
[51]LIU W,WANG J,JI R,et al.Supervised hashing with kernels[C]//2012 IEEE Conference on Computer Vision and Pattern Recognition.IEEE,2012:2074-2081.
[52]LIU Y Y,LIU H Z,YUAN J Z.Video Hashing AlgorithmBased on 3D Convolutional Neural Network [J].Application Research of Computers,2020,37(3):887-890,900.
[53]PAN Y,YAO T,LI H,et al.Semi-supervised hashing with semantic confidence for large scale visual search[C]//Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval.ACM,2015:53-62.
[54]WANG J,KUMAR S,CHANG S F.Semi-supervised hashing for large-scale search[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2012,34(12):2393-2406.
[55]SALAKHUTDINOV R,HINTON G.Semantic hashing[J].International Journal of Approximate Reasoning,2009,50(7):969-978.
[56]XIA R,PAN Y,LAI H,et al.Supervised hashing for image retrieval via image representation learning[C]//Twenty-eighth AAAI Conference on Artificial Intelligence.2014.
[57]LIONG V E,LU J,TAN Y P,et al.Cross-modal deep variatio-nal hashing[C]//2017 IEEE International Conference on Computer Vision (ICCV).IEEE,2017:4097-4105.
[58]DONG Z,PEI M T.Cross-modal face retrieval method based on heterogeneous hash network[J].Chinese Journal of Compu-ters,2019,42(1):75-86.
[59]DAI Q,LI J,WANG J,et al.Binary optimized hashing[C]//Proceedings of the 24th ACM International Conference on Multimedia.ACM,2016:1247-1256.
[60]LONG F,YAO T,DAI Q,et al.Deep domain adaptation hashing with adversarial learning[C]//The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval.ACM,2018:725-734.
[61]YAO T,LONG F,MEI T,et al.Deep Semantic-Preserving and Ranking-Based Hashing for Image Retrieval[C]//IJCAI.2016:3931-3937.
[62]QIU Z,PAN Y,YAO T,et al.Deep semantic hashing with ge-nerative adversarial networks[C]//Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval.ACM,2017:225-234.
[63]WU D,DAI Q,LIU J,et al.Deep Incremental Hashing Network for Efficient Image Retrieval[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:9069-9077.
[64]JIANG Q Y,LI W J.Asymmetric deep supervised hashing[C]//Thirty-Second AAAI Conference on Artificial Intelligence.2018.
[65]PANYAPANUWAT P,KAMONSANTIROJ S.PerformanceComparison of Unsupervised Deep Hashing with Data-indepen-dent Hashing for Content-Based Audio Retrieval[C]//Procee-dings of the 2019 2nd International Conference on Electronics,Communications and Control Engineering.2019:16-20.
[66]ARIN J,BISWAS A,OFLI F,et al.Recipe1m+:A dataset for learning cross-modal embeddings for cooking recipes and food images[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2019,43(1):187-203.
[67]XIAO T,LI S,WANG B,et al.End-to-end deep learning forperson search[J].arXiv:1604.01850.
[68]PEREIRA J C,COVIELLO E,DOYLE G,et al.On the role of correlation and abstraction in cross-modal multimedia retrieval[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2013,36(3):521-535.
[69]CHUA T S,TANG J,HONG R,et al.NUS-WIDE:a real-world web image database from National University of Singapore[C]//Proceedings of the ACM international Conference on Ima-ge and Video Retrieval.ACM,2009:48.
[70]RASHTCHIAN C,YOUNG P,HODOSH M,et al.Collecting image annotations using Amazon’s Mechanical Turk[C]//Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk.Association for Computational Linguistics,2010:139-147.
[71]ZHENG L,ZHANG H,SUN S,et al.Person re-identification in the wild[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:1367-1376.
[72]HENDRICKS L A,WANG O,SHECHTMAN E,et al.Localizing moments in video with natural language[C]//Proceedings of the IEEE International Conference on Computer Vision.2017:5803-5812.
[73]GAO J,SUN C,YANG Z,et al.Tall:Temporal activity localization via language query[C]//Proceedings of the IEEE International Conference on Computer Vision.2017:5267-5275.
[74]ZENG D,YU Y,OYAMA K.Audio-Visual Embedding forCross-Modal Music Video Retrieval through Supervised Deep CCA[C]//2018 IEEE International Symposium on Multimedia (ISM).IEEE,2018:143-150.
[75]ZHOU Y,WANG Z,FANG C,et al.Visual to sound:Generating natural sound for videos in the wild[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:3550-3558.
[76]XU H,HE K,SIGAL L,et al.Text-to-clip video retrieval with early fusion and re-captioning[J].arXiv:1804.05113.
[77]XU X,HE L,LU H,et al.Deep adversarial metric learning for cross-modal retrieval[J].World Wide Web,2019,22(2):657-672.
[78]PENG Y,QI J,HUANG X,et al.CCL:Cross-modal correlation learning with multigrained fusion by hierarchical network[J].IEEE Transactions on Multimedia,2017,20(2):405-420.
[79]ZHEN L,HU P,WANG X,et al.Deep supervised cross-modal retrieval[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:10394-10403.
[80]LIU X,HU Z,LING H,et al.MTFH:A matrix tri-factorization hashing framework for efficient cross-modal retrieval[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2019,43(3):964-981.
[81]CAO W,LIN Q,HE Z,et al.Hybrid representation learning for cross-modal retrieval[J].Neurocomputing,2019,345:45-57.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed