计算机科学 ›› 2021, Vol. 48 ›› Issue (8): 13-23.doi: 10.11896/jsjkx.200800165

• 数据库&大数据&数据科学* • 上一篇    下一篇

跨模态检索研究进展综述

冯霞, 胡志毅, 刘才华   

  1. 中国民航大学计算机科学与技术学院 天津300300; 民航智慧机场理论与系统重点实验室 天津300300
  • 收稿日期:2020-08-26 修回日期:2020-10-15 发布日期:2021-08-10
  • 通讯作者: 刘才华(chliu@cauc.edu.cn)
  • 基金资助:
    中央高校基本科研业务经费中国民航大学专项资金项目(3122021052);天津市自然科学基金(18JCYBJC885100)

Survey of Research Progress on Cross-modal Retrieval

FENG Xia, HU Zhi-yi, LIU Cai-hua   

  1. College of Computer Science and Technology,Civil Aviation University of China,Tianjin 300300,China; Key Laboratory of Smart Airport Theory and System,CAAC,Tianjin 300300,China
  • Received:2020-08-26 Revised:2020-10-15 Published:2021-08-10
  • About author:FENG Xia,born in 1970,Ph.D,professor,is a member of China Computer Federation.Her main research interests include intelligent information proces-sing and artificial intelligence aviation application.(xfeng@cauc.edu.cn)LIU Cai-hua,born in 1987,Ph.D,lectu-rer.Her main research interests include computer vision and machine learning.
  • Supported by:
    Fundamental Research Funds for the Central Universities from Civil Aviation University of China(3122021052) and Natural Science Foundation of Tianjin,China(18JCYBJC85100).

摘要: 随着互联网上多媒体数据的爆炸式增长,单一模态的检索已经无法满足用户需求,跨模态检索应运而生。跨模态检索旨在以一种模态的数据去检索另一种模态的相关数据,其核心任务是数据特征提取和不同模态间数据的相关性度量。文中梳理了跨模态检索领域近期的研究进展,从传统方法、深度学习方法、手工特征的哈希编码方法以及深度学习的哈希编码方法等角度归纳论述了跨模态检索领域的研究成果。在此基础上,对比分析了各类算法在跨模态检索常用标准数据集上的性能。最后,分析了跨模态检索研究存在的问题,并对该领域未来发展趋势以及应用进行了展望。

关键词: 跨模态检索, 深度学习, 特征提取, 相关性度量

Abstract: With the explosive growth of multimedia data on the Internet,single-modal retrieval has been unable to meet the needs of users,and cross-modal retrieval has emerged.Cross-modal retrieval aims to retrieve related data of one modality with data of another modality.Its core task is to extract data features and measure data correlation between different modality.This paper summarizes the recent research progress in the field of cross-modal retrieval,and summarizes the research results in the field of cross-modal retrieval from the perspectives of traditional methods,deep learning methods,manual feature hash coding methods and deep learning hash coding methods.On this basis,the performance of various algorithms in cross-modal retrieval of commonly used standard data sets is compared and analyzed.Finally,the problems of cross-modal retrieval research are analyzed and the future development trend of the field is prospected.

Key words: Cross-modal retrieval, Deep learning, Feature extraction, Correlation measure

中图分类号: 

  • TP391
[1]LIU J,XU C,LU H.Cross-media retrieval:state-of-the-art and open issues[J].International Journal of Multimedia Intelligence and Security,2010,1(1):33-52.
[2]WANG K,YIN Q,WANG W,et al.A comprehensive survey on cross-modal retrieval[J].arXiv:1607.06215,2016.
[3]SALTON G,FOX E A,WU H.Extended Boolean information retrieval[R].Cornell University,1982.
[4]ZHU C Z,JÉGOU H,SATOH S.Query-adaptive asymmetrical dissimilarities for visual object retrieval[C]//Proceedings of the IEEE International Conference on Computer Vision.2013:1705-1712.
[5]AIZAWA A.An information-theoretic perspective of tf-idfmeasures[J].Information Processing & Management,2003,39(1):45-65.
[6]BLEI D M,NG A Y,JORDAN M I.Latent dirichlet allocation[J].Journal of Machine Learning Research,2003,3(4/5):993-1022.
[7]DALAL N,TRIGGS B.Histograms of oriented gradients forhuman detection[C]//2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.2005:886-893.
[8]MISHRA A,ALAHARI K,JAWAHAR C V.Image retrievalusing textual cues[C]//Proceedings of the IEEE International Conference on Computer Vision.2013:3040-3047.
[9]ELIZALDE B,ZARAR S,RAJ B.Cross modal audio search and retrieval with joint embeddings based on text and audio[C]//ICASSP 2019-2019 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP).IEEE,2019:4095-4099.
[10]KAMPER H,SHAKHNAROVICH G,LIVESCU K.Semantic speech retrieval with a visually grounded model of untranscribed speech[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops.2018:2514-2517.
[11]GUO M,ZHOU C,LIU J.Jointly Learning of Visual and Auditory:A New Approach for RS Image and Audio Cross-Modal Retrieval[J].IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing,2019,12(11):4644-4654.
[12]HARDOON D R,SZEDMAK S,SHAWE-TAYLOR J.Canonical correlation analysis:An overview with application to learning methods[J].Neural Computation,2004,16(12):2639-2664.
[13]RUPNIK J,SHAWE-TAYLOR J.Multi-view canonical correlation analysis[C]//Conference on Data Mining and Data Warehouses (SiKDD 2010).2010:1-4.
[14]TENENBAUM J B,FREEMAN W T.Separating style and content with bilinear models[J].Neural computation,2000,12(6):1247-1283.
[15]RANJAN V,RASIWASIA N,JAWAHAR C V.Multi-labelcross-modal retrieval[C]//Proceedings of the IEEE Internatio-nal Conference on Computer Vision.2015:4094-4102.
[16]HWANG S J,GRAUMAN K.Learning the relative importance of objects from tagged images for retrieval and cross-modal search[J].International Journal of Computer Vision,2012,100(2):134-153.
[17]JIA Y,BAI L,LIU S,et al.Semantically-enhanced kernel cano-nical correlation analysis:a multi-label cross-modal retrieval[J].Multimedia Tools and Applications,2019,78(10):13169-13188.
[18]RASIWASIA N,MAHAJAN D,MAHADEVAN V,et al.Cluster canonical correlation analysis[C]//Artificial intelligence and statistics.2014:823-831.
[19]ANDREW G,ARORA R,BILMES J,et al.Deep canonical correlation analysis[C]//International Conference on Machine Learning.PMLR,2013:1247-1255.
[20]HU R,XU H,ROHRBACH M,et al.Natural language object retrieval[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:4555-4564.
[21]VO N,JIANG L,SUN C,et al.Composing text and image for image retrieval-an empirical odyssey[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:6439-6448.
[22]WEHRMANN J,BARROS R C.Bidirectional retrieval madesimple[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:7718-7726.
[23]SALVADOR A,HYNES N,AYTAR Y,et al.Learning cross-modal embeddings for cooking recipes and food images[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:3020-3028.
[24]YAMAGUCHI M,SAITO K,USHIKU Y,et al.Spatio-temporal person retrieval via natural language queries[C]//Procee-dings of the IEEE International Conference on Computer Vision.2017:1453-1462.
[25]HERSHEY S,CHAUDHURI S,ELLIS D P W,et al.CNN architectures for large-scale audio classification[C]//2017 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP).IEEE,2017:131-135.
[26]HU D,NIE F,LI X.Deep multimodal clustering for unsupervised audiovisual learning[C]//Proceedings of the IEEE Confe-rence on Computer Vision and Pattern Recognition.2019:9248-9257.
[27]SCHWARTZ I,SCHWING A G,HAZAN T.A simple baseline for audio-visual scene-aware dialog[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:12548-12558.
[28]DENG Y J,ZHANG F L,CHEN X Q,et al.Collaborative attention network model for cross-modal retrieval[J].Computer Science,2020,47(4):54-59.
[29]LI S,XIAO T,LI H,et al.Person search with natural language description[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:1970-1979.
[30]LI H,WANG P,SHEN C,et al.Visual Question Answering as Reading Comprehension[C]//Proceedings of the IEEEConfe-rence on Computer Vision and Pattern Recognition.2019:6319-6328.
[31]DEY S,DUTTA A,GHOSH S K,et al.Learning cross-modal deep embeddings for multi-object image retrieval using text and sketch[C]//2018 24th International Conference on Pattern Re-cognition(ICPR).IEEE,2018:916-921.
[32]YAN Y,ZHANG Q,NI B,et al.Learning Context Graph forPerson Search[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:2158-2167.
[33]MITHUN N C,PAUL S,ROY-CHOWDHURY A K.Weakly supervised video moment retrieval from text queries[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:11592-11601.
[34]SONG Y,SOLEYMANI M.Polysemous Visual-Semantic Em-bedding for Cross-Modal Retrieval[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:1979-1988.
[35]CHEN K,BUI T,FANG C,et al.AMC:Attention guided multi-modal correlation learning for image search[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:2644-2652.
[36]LIU X,WANG Z,SHAO J,et al.Improving Referring Expression Grounding with Cross-modal Attention-guided Erasing[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:1950-1959.
[37]JOHNSON J,KRISHNA R,STARK M,et al.Image retrieval using scene graphs[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2015:3668-3678.
[38]YANG J,LU J,LEE S,et al.Graph r-cnn for scene graph genera-tion[C]//Proceedings of the European Conference on Compu-ter Vision (ECCV).2018:670-685.
[39]HU R,ANDREAS J,ROHRBACH M,et al.Learning to reason:End-to-end module networks for visual question answering[C]//Proceedings of the IEEE International Conference on Computer Vision.2017:804-813.
[40]JOHNSON J,HARIHARAN B,VAN DER MAATEN L,et al.Inferring and executing programs for visual reasoning[C]//Proceedings of the IEEE International Conference on Computer Vision.2017:2989-2998.
[41]XIAO F Y,SIGAL L,LEE Y J.Weakly-supervised visualgrounding of phrases with linguistic structures[C]//Procee-dings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:5945-5954.
[42]LIU B,YEUNG S,CHOU E,et al.Temporal modular networks for retrieving complex compositional activities in videos[C]//Proceedings of the European Conference on Computer Vision (ECCV).2018:552-568.
[43]ZHANG D,DAI X,WANG X,et al.Man:Moment alignment network for natural language moment retrieval via iterative graph adjustment[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:1247-1257.
[44]WANG B,YANG Y,XU X,et al.Adversarial cross-modal retrieval[C]//Proceedings of the 25th ACM InternationalConfe-rence on Multimedia.ACM,2017:154-162.
[45]PENG Y,QI J.CM-GANs:cross-modal generative adversarial networks for common representation learning[J].ACM Tran-sactions on Multimedia Computing,Communications,and Applications (TOMM),2019,15(1):22.
[46]WANG H,SAHOO D,LIU C,et al.Learning Cross-Modal Embeddings with Adversarial Networks for Cooking Recipes and Food Images[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:11572-11581.
[47]CHEN Y,CHEN H K.Speaker recognition based on multimodal generation adversarial networks and triple loss [J].Journal of Electronics Information Technology,2020,42(2):379-385.
[48]GU J,CAI J,JOTY S R,et al.Look,imagine and match:Improving textual-visual cross-modal retrieval with generative models[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:7181-7189.
[49]ZHU B,NGO C W,CHEN J,et al.R2GAN:Cross-Modal Recipe Retrieval with Generative Adversarial Network[C]//Procee-dings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:11477-11486.
[50]WEISS Y,TORRALBA A,FERGUS R.Spectral hashing[C]//Advances in Neural Information Processing Systems.2009:1753-1760.
[51]LIU W,WANG J,JI R,et al.Supervised hashing with kernels[C]//2012 IEEE Conference on Computer Vision and Pattern Recognition.IEEE,2012:2074-2081.
[52]LIU Y Y,LIU H Z,YUAN J Z.Video Hashing AlgorithmBased on 3D Convolutional Neural Network [J].Application Research of Computers,2020,37(3):887-890,900.
[53]PAN Y,YAO T,LI H,et al.Semi-supervised hashing with semantic confidence for large scale visual search[C]//Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval.ACM,2015:53-62.
[54]WANG J,KUMAR S,CHANG S F.Semi-supervised hashing for large-scale search[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2012,34(12):2393-2406.
[55]SALAKHUTDINOV R,HINTON G.Semantic hashing[J].International Journal of Approximate Reasoning,2009,50(7):969-978.
[56]XIA R,PAN Y,LAI H,et al.Supervised hashing for image retrieval via image representation learning[C]//Twenty-eighth AAAI Conference on Artificial Intelligence.2014.
[57]LIONG V E,LU J,TAN Y P,et al.Cross-modal deep variatio-nal hashing[C]//2017 IEEE International Conference on Computer Vision (ICCV).IEEE,2017:4097-4105.
[58]DONG Z,PEI M T.Cross-modal face retrieval method based on heterogeneous hash network[J].Chinese Journal of Compu-ters,2019,42(1):75-86.
[59]DAI Q,LI J,WANG J,et al.Binary optimized hashing[C]//Proceedings of the 24th ACM International Conference on Multimedia.ACM,2016:1247-1256.
[60]LONG F,YAO T,DAI Q,et al.Deep domain adaptation hashing with adversarial learning[C]//The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval.ACM,2018:725-734.
[61]YAO T,LONG F,MEI T,et al.Deep Semantic-Preserving and Ranking-Based Hashing for Image Retrieval[C]//IJCAI.2016:3931-3937.
[62]QIU Z,PAN Y,YAO T,et al.Deep semantic hashing with ge-nerative adversarial networks[C]//Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval.ACM,2017:225-234.
[63]WU D,DAI Q,LIU J,et al.Deep Incremental Hashing Network for Efficient Image Retrieval[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:9069-9077.
[64]JIANG Q Y,LI W J.Asymmetric deep supervised hashing[C]//Thirty-Second AAAI Conference on Artificial Intelligence.2018.
[65]PANYAPANUWAT P,KAMONSANTIROJ S.PerformanceComparison of Unsupervised Deep Hashing with Data-indepen-dent Hashing for Content-Based Audio Retrieval[C]//Procee-dings of the 2019 2nd International Conference on Electronics,Communications and Control Engineering.2019:16-20.
[66]ARIN J,BISWAS A,OFLI F,et al.Recipe1m+:A dataset for learning cross-modal embeddings for cooking recipes and food images[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2019,43(1):187-203.
[67]XIAO T,LI S,WANG B,et al.End-to-end deep learning forperson search[J].arXiv:1604.01850.
[68]PEREIRA J C,COVIELLO E,DOYLE G,et al.On the role of correlation and abstraction in cross-modal multimedia retrieval[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2013,36(3):521-535.
[69]CHUA T S,TANG J,HONG R,et al.NUS-WIDE:a real-world web image database from National University of Singapore[C]//Proceedings of the ACM international Conference on Ima-ge and Video Retrieval.ACM,2009:48.
[70]RASHTCHIAN C,YOUNG P,HODOSH M,et al.Collecting image annotations using Amazon’s Mechanical Turk[C]//Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk.Association for Computational Linguistics,2010:139-147.
[71]ZHENG L,ZHANG H,SUN S,et al.Person re-identification in the wild[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:1367-1376.
[72]HENDRICKS L A,WANG O,SHECHTMAN E,et al.Localizing moments in video with natural language[C]//Proceedings of the IEEE International Conference on Computer Vision.2017:5803-5812.
[73]GAO J,SUN C,YANG Z,et al.Tall:Temporal activity localization via language query[C]//Proceedings of the IEEE International Conference on Computer Vision.2017:5267-5275.
[74]ZENG D,YU Y,OYAMA K.Audio-Visual Embedding forCross-Modal Music Video Retrieval through Supervised Deep CCA[C]//2018 IEEE International Symposium on Multimedia (ISM).IEEE,2018:143-150.
[75]ZHOU Y,WANG Z,FANG C,et al.Visual to sound:Generating natural sound for videos in the wild[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:3550-3558.
[76]XU H,HE K,SIGAL L,et al.Text-to-clip video retrieval with early fusion and re-captioning[J].arXiv:1804.05113.
[77]XU X,HE L,LU H,et al.Deep adversarial metric learning for cross-modal retrieval[J].World Wide Web,2019,22(2):657-672.
[78]PENG Y,QI J,HUANG X,et al.CCL:Cross-modal correlation learning with multigrained fusion by hierarchical network[J].IEEE Transactions on Multimedia,2017,20(2):405-420.
[79]ZHEN L,HU P,WANG X,et al.Deep supervised cross-modal retrieval[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:10394-10403.
[80]LIU X,HU Z,LING H,et al.MTFH:A matrix tri-factorization hashing framework for efficient cross-modal retrieval[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2019,43(3):964-981.
[81]CAO W,LIN Q,HE Z,et al.Hybrid representation learning for cross-modal retrieval[J].Neurocomputing,2019,345:45-57.
[1] 王立梅, 朱旭光, 汪德嘉, 张勇, 邢春晓. 基于深度学习的民事案件判决结果分类方法研究[J]. 计算机科学, 2021, 48(8): 80-85.
[2] 郭琳, 李晨, 陈晨, 赵睿, 范仕霖, 徐星雨. 基于通道注意递归残差网络的图像超分辨率重建[J]. 计算机科学, 2021, 48(8): 139-144.
[3] 刘帅, 芮挺, 胡育成, 杨成松, 王东. 基于深度学习SuperGlue算法的单目视觉里程计[J]. 计算机科学, 2021, 48(8): 157-161.
[4] 王施云, 杨帆. 基于U-Net特征融合优化策略的遥感影像语义分割方法[J]. 计算机科学, 2021, 48(8): 162-168.
[5] 田嵩旺, 蔺素珍, 杨博. 基于多判别器的多波段图像自监督融合方法[J]. 计算机科学, 2021, 48(8): 185-190.
[6] 潘孝勤, 芦天亮, 杜彦辉, 仝鑫. 基于深度学习的语音合成与转换技术综述[J]. 计算机科学, 2021, 48(8): 200-208.
[7] 汤世征, 张岩峰. DragDL:一种易用的深度学习模型可视化构建系统[J]. 计算机科学, 2021, 48(8): 220-225.
[8] 张瑾, 段利国, 李爱萍, 郝晓燕. 基于注意力与门控机制相结合的细粒度情感分析[J]. 计算机科学, 2021, 48(8): 226-233.
[9] 刘文洋, 郭延哺, 李维华. 识别关键蛋白质的混合深度学习模型[J]. 计算机科学, 2021, 48(8): 240-245.
[10] 王超, 魏祥麟, 田青, 焦翔, 魏楠, 段强. 基于特征梯度的调制识别深度网络对抗攻击方法[J]. 计算机科学, 2021, 48(7): 25-32.
[11] 羊洋, 陈伟, 张丹懿, 王丹妮, 宋爽. 对抗攻击威胁基于卷积神经网络的网络流量分类[J]. 计算机科学, 2021, 48(7): 55-61.
[12] 暴雨轩, 芦天亮, 杜彦辉, 石达. 基于i_ResNet34模型和数据增强的深度伪造视频检测方法[J]. 计算机科学, 2021, 48(7): 77-85.
[13] 桑春艳, 胥文, 贾朝龙, 文俊浩. 社交网络中基于注意力机制的网络舆情事件演化趋势预测[J]. 计算机科学, 2021, 48(7): 118-123.
[14] 徐浩, 刘岳镭. 基于深度学习的无人机声音识别算法[J]. 计算机科学, 2021, 48(7): 225-232.
[15] 张丽倩, 李孟航, 高珊珊, 张彩明. 面向计算机辅助舌诊关键问题的解决方案综述[J]. 计算机科学, 2021, 48(7): 256-269.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 韩奎奎,谢在鹏,吕鑫. 一种基于改进遗传算法的雾计算任务调度策略[J]. 计算机科学, 2018, 45(4): 137 -142 .
[2] 翁理国,孔维斌,夏旻,仇学飞. 基于深度极限学习机的卫星云图云量计算[J]. 计算机科学, 2018, 45(4): 227 -232 .
[3] 李珊,饶文碧. 基于视频的矿井中人体运动区域检测[J]. 计算机科学, 2018, 45(4): 291 -295 .
[4] 刘景玮, 刘京菊, 陆余良, 杨斌, 朱凯龙. 基于网络攻防博弈模型的最优防御策略选取方法[J]. 计算机科学, 2018, 45(6): 117 -123 .
[5] 韩立, 刘正捷. CAUXT:帮助研究人员在感兴趣的情境中采集用户体验数据[J]. 计算机科学, 2018, 45(7): 278 -285 .
[6] 付文博, 孙涛, 梁藉, 闫宝伟, 范福新. 深度学习原理及应用综述[J]. 计算机科学, 2018, 45(6A): 11 -15 .
[7] 陈永飞,崔艳鹏,胡建伟. 基于9_7提升小波和区域生长的目标检测算法[J]. 计算机科学, 2018, 45(6A): 157 -161 .
[8] 张旋,周乐,侯爱华. 一种适用于MLC闪存的CCI噪声均衡化算法[J]. 计算机科学, 2018, 45(6A): 541 -544 .
[9] 方正, , 曹铁勇, 付铁连. 基于Bottom-hat频谱方法的运动模糊参数估计方法[J]. 计算机科学, 2018, 45(8): 36 -40 .
[10] 王智慧, 李佳桐, 谢斯言, 周佳, 李豪杰, 樊鑫. 两阶段的视频字幕检测和提取算法[J]. 计算机科学, 2018, 45(8): 50 -53 .