计算机科学 ›› 2021, Vol. 48 ›› Issue (8): 13-23.doi: 10.11896/jsjkx.200800165
冯霞, 胡志毅, 刘才华
FENG Xia, HU Zhi-yi, LIU Cai-hua
摘要: 随着互联网上多媒体数据的爆炸式增长,单一模态的检索已经无法满足用户需求,跨模态检索应运而生。跨模态检索旨在以一种模态的数据去检索另一种模态的相关数据,其核心任务是数据特征提取和不同模态间数据的相关性度量。文中梳理了跨模态检索领域近期的研究进展,从传统方法、深度学习方法、手工特征的哈希编码方法以及深度学习的哈希编码方法等角度归纳论述了跨模态检索领域的研究成果。在此基础上,对比分析了各类算法在跨模态检索常用标准数据集上的性能。最后,分析了跨模态检索研究存在的问题,并对该领域未来发展趋势以及应用进行了展望。
中图分类号:
[1]LIU J,XU C,LU H.Cross-media retrieval:state-of-the-art and open issues[J].International Journal of Multimedia Intelligence and Security,2010,1(1):33-52. [2]WANG K,YIN Q,WANG W,et al.A comprehensive survey on cross-modal retrieval[J].arXiv:1607.06215,2016. [3]SALTON G,FOX E A,WU H.Extended Boolean information retrieval[R].Cornell University,1982. [4]ZHU C Z,JÉGOU H,SATOH S.Query-adaptive asymmetrical dissimilarities for visual object retrieval[C]//Proceedings of the IEEE International Conference on Computer Vision.2013:1705-1712. [5]AIZAWA A.An information-theoretic perspective of tf-idfmeasures[J].Information Processing & Management,2003,39(1):45-65. [6]BLEI D M,NG A Y,JORDAN M I.Latent dirichlet allocation[J].Journal of Machine Learning Research,2003,3(4/5):993-1022. [7]DALAL N,TRIGGS B.Histograms of oriented gradients forhuman detection[C]//2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.2005:886-893. [8]MISHRA A,ALAHARI K,JAWAHAR C V.Image retrievalusing textual cues[C]//Proceedings of the IEEE International Conference on Computer Vision.2013:3040-3047. [9]ELIZALDE B,ZARAR S,RAJ B.Cross modal audio search and retrieval with joint embeddings based on text and audio[C]//ICASSP 2019-2019 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP).IEEE,2019:4095-4099. [10]KAMPER H,SHAKHNAROVICH G,LIVESCU K.Semantic speech retrieval with a visually grounded model of untranscribed speech[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops.2018:2514-2517. [11]GUO M,ZHOU C,LIU J.Jointly Learning of Visual and Auditory:A New Approach for RS Image and Audio Cross-Modal Retrieval[J].IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing,2019,12(11):4644-4654. [12]HARDOON D R,SZEDMAK S,SHAWE-TAYLOR J.Canonical correlation analysis:An overview with application to learning methods[J].Neural Computation,2004,16(12):2639-2664. [13]RUPNIK J,SHAWE-TAYLOR J.Multi-view canonical correlation analysis[C]//Conference on Data Mining and Data Warehouses (SiKDD 2010).2010:1-4. [14]TENENBAUM J B,FREEMAN W T.Separating style and content with bilinear models[J].Neural computation,2000,12(6):1247-1283. [15]RANJAN V,RASIWASIA N,JAWAHAR C V.Multi-labelcross-modal retrieval[C]//Proceedings of the IEEE Internatio-nal Conference on Computer Vision.2015:4094-4102. [16]HWANG S J,GRAUMAN K.Learning the relative importance of objects from tagged images for retrieval and cross-modal search[J].International Journal of Computer Vision,2012,100(2):134-153. [17]JIA Y,BAI L,LIU S,et al.Semantically-enhanced kernel cano-nical correlation analysis:a multi-label cross-modal retrieval[J].Multimedia Tools and Applications,2019,78(10):13169-13188. [18]RASIWASIA N,MAHAJAN D,MAHADEVAN V,et al.Cluster canonical correlation analysis[C]//Artificial intelligence and statistics.2014:823-831. [19]ANDREW G,ARORA R,BILMES J,et al.Deep canonical correlation analysis[C]//International Conference on Machine Learning.PMLR,2013:1247-1255. [20]HU R,XU H,ROHRBACH M,et al.Natural language object retrieval[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:4555-4564. [21]VO N,JIANG L,SUN C,et al.Composing text and image for image retrieval-an empirical odyssey[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:6439-6448. [22]WEHRMANN J,BARROS R C.Bidirectional retrieval madesimple[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:7718-7726. [23]SALVADOR A,HYNES N,AYTAR Y,et al.Learning cross-modal embeddings for cooking recipes and food images[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:3020-3028. [24]YAMAGUCHI M,SAITO K,USHIKU Y,et al.Spatio-temporal person retrieval via natural language queries[C]//Procee-dings of the IEEE International Conference on Computer Vision.2017:1453-1462. [25]HERSHEY S,CHAUDHURI S,ELLIS D P W,et al.CNN architectures for large-scale audio classification[C]//2017 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP).IEEE,2017:131-135. [26]HU D,NIE F,LI X.Deep multimodal clustering for unsupervised audiovisual learning[C]//Proceedings of the IEEE Confe-rence on Computer Vision and Pattern Recognition.2019:9248-9257. [27]SCHWARTZ I,SCHWING A G,HAZAN T.A simple baseline for audio-visual scene-aware dialog[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:12548-12558. [28]DENG Y J,ZHANG F L,CHEN X Q,et al.Collaborative attention network model for cross-modal retrieval[J].Computer Science,2020,47(4):54-59. [29]LI S,XIAO T,LI H,et al.Person search with natural language description[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:1970-1979. [30]LI H,WANG P,SHEN C,et al.Visual Question Answering as Reading Comprehension[C]//Proceedings of the IEEEConfe-rence on Computer Vision and Pattern Recognition.2019:6319-6328. [31]DEY S,DUTTA A,GHOSH S K,et al.Learning cross-modal deep embeddings for multi-object image retrieval using text and sketch[C]//2018 24th International Conference on Pattern Re-cognition(ICPR).IEEE,2018:916-921. [32]YAN Y,ZHANG Q,NI B,et al.Learning Context Graph forPerson Search[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:2158-2167. [33]MITHUN N C,PAUL S,ROY-CHOWDHURY A K.Weakly supervised video moment retrieval from text queries[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:11592-11601. [34]SONG Y,SOLEYMANI M.Polysemous Visual-Semantic Em-bedding for Cross-Modal Retrieval[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:1979-1988. [35]CHEN K,BUI T,FANG C,et al.AMC:Attention guided multi-modal correlation learning for image search[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:2644-2652. [36]LIU X,WANG Z,SHAO J,et al.Improving Referring Expression Grounding with Cross-modal Attention-guided Erasing[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:1950-1959. [37]JOHNSON J,KRISHNA R,STARK M,et al.Image retrieval using scene graphs[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2015:3668-3678. [38]YANG J,LU J,LEE S,et al.Graph r-cnn for scene graph genera-tion[C]//Proceedings of the European Conference on Compu-ter Vision (ECCV).2018:670-685. [39]HU R,ANDREAS J,ROHRBACH M,et al.Learning to reason:End-to-end module networks for visual question answering[C]//Proceedings of the IEEE International Conference on Computer Vision.2017:804-813. [40]JOHNSON J,HARIHARAN B,VAN DER MAATEN L,et al.Inferring and executing programs for visual reasoning[C]//Proceedings of the IEEE International Conference on Computer Vision.2017:2989-2998. [41]XIAO F Y,SIGAL L,LEE Y J.Weakly-supervised visualgrounding of phrases with linguistic structures[C]//Procee-dings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:5945-5954. [42]LIU B,YEUNG S,CHOU E,et al.Temporal modular networks for retrieving complex compositional activities in videos[C]//Proceedings of the European Conference on Computer Vision (ECCV).2018:552-568. [43]ZHANG D,DAI X,WANG X,et al.Man:Moment alignment network for natural language moment retrieval via iterative graph adjustment[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:1247-1257. [44]WANG B,YANG Y,XU X,et al.Adversarial cross-modal retrieval[C]//Proceedings of the 25th ACM InternationalConfe-rence on Multimedia.ACM,2017:154-162. [45]PENG Y,QI J.CM-GANs:cross-modal generative adversarial networks for common representation learning[J].ACM Tran-sactions on Multimedia Computing,Communications,and Applications (TOMM),2019,15(1):22. [46]WANG H,SAHOO D,LIU C,et al.Learning Cross-Modal Embeddings with Adversarial Networks for Cooking Recipes and Food Images[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:11572-11581. [47]CHEN Y,CHEN H K.Speaker recognition based on multimodal generation adversarial networks and triple loss [J].Journal of Electronics Information Technology,2020,42(2):379-385. [48]GU J,CAI J,JOTY S R,et al.Look,imagine and match:Improving textual-visual cross-modal retrieval with generative models[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:7181-7189. [49]ZHU B,NGO C W,CHEN J,et al.R2GAN:Cross-Modal Recipe Retrieval with Generative Adversarial Network[C]//Procee-dings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:11477-11486. [50]WEISS Y,TORRALBA A,FERGUS R.Spectral hashing[C]//Advances in Neural Information Processing Systems.2009:1753-1760. [51]LIU W,WANG J,JI R,et al.Supervised hashing with kernels[C]//2012 IEEE Conference on Computer Vision and Pattern Recognition.IEEE,2012:2074-2081. [52]LIU Y Y,LIU H Z,YUAN J Z.Video Hashing AlgorithmBased on 3D Convolutional Neural Network [J].Application Research of Computers,2020,37(3):887-890,900. [53]PAN Y,YAO T,LI H,et al.Semi-supervised hashing with semantic confidence for large scale visual search[C]//Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval.ACM,2015:53-62. [54]WANG J,KUMAR S,CHANG S F.Semi-supervised hashing for large-scale search[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2012,34(12):2393-2406. [55]SALAKHUTDINOV R,HINTON G.Semantic hashing[J].International Journal of Approximate Reasoning,2009,50(7):969-978. [56]XIA R,PAN Y,LAI H,et al.Supervised hashing for image retrieval via image representation learning[C]//Twenty-eighth AAAI Conference on Artificial Intelligence.2014. [57]LIONG V E,LU J,TAN Y P,et al.Cross-modal deep variatio-nal hashing[C]//2017 IEEE International Conference on Computer Vision (ICCV).IEEE,2017:4097-4105. [58]DONG Z,PEI M T.Cross-modal face retrieval method based on heterogeneous hash network[J].Chinese Journal of Compu-ters,2019,42(1):75-86. [59]DAI Q,LI J,WANG J,et al.Binary optimized hashing[C]//Proceedings of the 24th ACM International Conference on Multimedia.ACM,2016:1247-1256. [60]LONG F,YAO T,DAI Q,et al.Deep domain adaptation hashing with adversarial learning[C]//The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval.ACM,2018:725-734. [61]YAO T,LONG F,MEI T,et al.Deep Semantic-Preserving and Ranking-Based Hashing for Image Retrieval[C]//IJCAI.2016:3931-3937. [62]QIU Z,PAN Y,YAO T,et al.Deep semantic hashing with ge-nerative adversarial networks[C]//Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval.ACM,2017:225-234. [63]WU D,DAI Q,LIU J,et al.Deep Incremental Hashing Network for Efficient Image Retrieval[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:9069-9077. [64]JIANG Q Y,LI W J.Asymmetric deep supervised hashing[C]//Thirty-Second AAAI Conference on Artificial Intelligence.2018. [65]PANYAPANUWAT P,KAMONSANTIROJ S.PerformanceComparison of Unsupervised Deep Hashing with Data-indepen-dent Hashing for Content-Based Audio Retrieval[C]//Procee-dings of the 2019 2nd International Conference on Electronics,Communications and Control Engineering.2019:16-20. [66]ARIN J,BISWAS A,OFLI F,et al.Recipe1m+:A dataset for learning cross-modal embeddings for cooking recipes and food images[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2019,43(1):187-203. [67]XIAO T,LI S,WANG B,et al.End-to-end deep learning forperson search[J].arXiv:1604.01850. [68]PEREIRA J C,COVIELLO E,DOYLE G,et al.On the role of correlation and abstraction in cross-modal multimedia retrieval[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2013,36(3):521-535. [69]CHUA T S,TANG J,HONG R,et al.NUS-WIDE:a real-world web image database from National University of Singapore[C]//Proceedings of the ACM international Conference on Ima-ge and Video Retrieval.ACM,2009:48. [70]RASHTCHIAN C,YOUNG P,HODOSH M,et al.Collecting image annotations using Amazon’s Mechanical Turk[C]//Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk.Association for Computational Linguistics,2010:139-147. [71]ZHENG L,ZHANG H,SUN S,et al.Person re-identification in the wild[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:1367-1376. [72]HENDRICKS L A,WANG O,SHECHTMAN E,et al.Localizing moments in video with natural language[C]//Proceedings of the IEEE International Conference on Computer Vision.2017:5803-5812. [73]GAO J,SUN C,YANG Z,et al.Tall:Temporal activity localization via language query[C]//Proceedings of the IEEE International Conference on Computer Vision.2017:5267-5275. [74]ZENG D,YU Y,OYAMA K.Audio-Visual Embedding forCross-Modal Music Video Retrieval through Supervised Deep CCA[C]//2018 IEEE International Symposium on Multimedia (ISM).IEEE,2018:143-150. [75]ZHOU Y,WANG Z,FANG C,et al.Visual to sound:Generating natural sound for videos in the wild[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:3550-3558. [76]XU H,HE K,SIGAL L,et al.Text-to-clip video retrieval with early fusion and re-captioning[J].arXiv:1804.05113. [77]XU X,HE L,LU H,et al.Deep adversarial metric learning for cross-modal retrieval[J].World Wide Web,2019,22(2):657-672. [78]PENG Y,QI J,HUANG X,et al.CCL:Cross-modal correlation learning with multigrained fusion by hierarchical network[J].IEEE Transactions on Multimedia,2017,20(2):405-420. [79]ZHEN L,HU P,WANG X,et al.Deep supervised cross-modal retrieval[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:10394-10403. [80]LIU X,HU Z,LING H,et al.MTFH:A matrix tri-factorization hashing framework for efficient cross-modal retrieval[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2019,43(3):964-981. [81]CAO W,LIN Q,HE Z,et al.Hybrid representation learning for cross-modal retrieval[J].Neurocomputing,2019,345:45-57. |
[1] | 徐涌鑫, 赵俊峰, 王亚沙, 谢冰, 杨恺. 时序知识图谱表示学习 Temporal Knowledge Graph Representation Learning 计算机科学, 2022, 49(9): 162-171. https://doi.org/10.11896/jsjkx.220500204 |
[2] | 饶志双, 贾真, 张凡, 李天瑞. 基于Key-Value关联记忆网络的知识图谱问答方法 Key-Value Relational Memory Networks for Question Answering over Knowledge Graph 计算机科学, 2022, 49(9): 202-207. https://doi.org/10.11896/jsjkx.220300277 |
[3] | 汤凌韬, 王迪, 张鲁飞, 刘盛云. 基于安全多方计算和差分隐私的联邦学习方案 Federated Learning Scheme Based on Secure Multi-party Computation and Differential Privacy 计算机科学, 2022, 49(9): 297-305. https://doi.org/10.11896/jsjkx.210800108 |
[4] | 王剑, 彭雨琦, 赵宇斐, 杨健. 基于深度学习的社交网络舆情信息抽取方法综述 Survey of Social Network Public Opinion Information Extraction Based on Deep Learning 计算机科学, 2022, 49(8): 279-293. https://doi.org/10.11896/jsjkx.220300099 |
[5] | 郝志荣, 陈龙, 黄嘉成. 面向文本分类的类别区分式通用对抗攻击方法 Class Discriminative Universal Adversarial Attack for Text Classification 计算机科学, 2022, 49(8): 323-329. https://doi.org/10.11896/jsjkx.220200077 |
[6] | 姜梦函, 李邵梅, 郑洪浩, 张建朋. 基于改进位置编码的谣言检测模型 Rumor Detection Model Based on Improved Position Embedding 计算机科学, 2022, 49(8): 330-335. https://doi.org/10.11896/jsjkx.210600046 |
[7] | 孙奇, 吉根林, 张杰. 基于非局部注意力生成对抗网络的视频异常事件检测方法 Non-local Attention Based Generative Adversarial Network for Video Abnormal Event Detection 计算机科学, 2022, 49(8): 172-177. https://doi.org/10.11896/jsjkx.210600061 |
[8] | 侯钰涛, 阿布都克力木·阿布力孜, 哈里旦木·阿布都克里木. 中文预训练模型研究进展 Advances in Chinese Pre-training Models 计算机科学, 2022, 49(7): 148-163. https://doi.org/10.11896/jsjkx.211200018 |
[9] | 周慧, 施皓晨, 屠要峰, 黄圣君. 基于主动采样的深度鲁棒神经网络学习 Robust Deep Neural Network Learning Based on Active Sampling 计算机科学, 2022, 49(7): 164-169. https://doi.org/10.11896/jsjkx.210600044 |
[10] | 苏丹宁, 曹桂涛, 王燕楠, 王宏, 任赫. 小样本雷达辐射源识别的深度学习方法综述 Survey of Deep Learning for Radar Emitter Identification Based on Small Sample 计算机科学, 2022, 49(7): 226-235. https://doi.org/10.11896/jsjkx.210600138 |
[11] | 张源, 康乐, 宫朝辉, 张志鸿. 基于Bi-LSTM的期货市场关联交易行为检测方法 Related Transaction Behavior Detection in Futures Market Based on Bi-LSTM 计算机科学, 2022, 49(7): 31-39. https://doi.org/10.11896/jsjkx.210400304 |
[12] | 胡艳羽, 赵龙, 董祥军. 一种用于癌症分类的两阶段深度特征选择提取算法 Two-stage Deep Feature Selection Extraction Algorithm for Cancer Classification 计算机科学, 2022, 49(7): 73-78. https://doi.org/10.11896/jsjkx.210500092 |
[13] | 曾志贤, 曹建军, 翁年凤, 蒋国权, 徐滨. 基于注意力机制的细粒度语义关联视频-文本跨模态实体分辨 Fine-grained Semantic Association Video-Text Cross-modal Entity Resolution Based on Attention Mechanism 计算机科学, 2022, 49(7): 106-112. https://doi.org/10.11896/jsjkx.210500224 |
[14] | 程成, 降爱莲. 基于多路径特征提取的实时语义分割方法 Real-time Semantic Segmentation Method Based on Multi-path Feature Extraction 计算机科学, 2022, 49(7): 120-126. https://doi.org/10.11896/jsjkx.210500157 |
[15] | 周志豪, 陈磊, 伍翔, 丘东亮, 梁广升, 曾凡巧. 基于SMOTE-SDSAE-SVM的车载CAN总线入侵检测算法 SMOTE-SDSAE-SVM Based Vehicle CAN Bus Intrusion Detection Algorithm 计算机科学, 2022, 49(6A): 562-570. https://doi.org/10.11896/jsjkx.210700106 |
|