Computer Science ›› 2021, Vol. 48 ›› Issue (8): 13-23.doi: 10.11896/jsjkx.200800165

• Database & Big Data & Data Science • Previous Articles     Next Articles

Survey of Research Progress on Cross-modal Retrieval

FENG Xia, HU Zhi-yi, LIU Cai-hua   

  1. College of Computer Science and Technology,Civil Aviation University of China,Tianjin 300300,China; Key Laboratory of Smart Airport Theory and System,CAAC,Tianjin 300300,China
  • Received:2020-08-26 Revised:2020-10-15 Published:2021-08-10
  • About author:FENG Xia,born in 1970,Ph.D,professor,is a member of China Computer Federation.Her main research interests include intelligent information proces-sing and artificial intelligence aviation application.(xfeng@cauc.edu.cn)LIU Cai-hua,born in 1987,Ph.D,lectu-rer.Her main research interests include computer vision and machine learning.
  • Supported by:
    Fundamental Research Funds for the Central Universities from Civil Aviation University of China(3122021052) and Natural Science Foundation of Tianjin,China(18JCYBJC85100).

Abstract: With the explosive growth of multimedia data on the Internet,single-modal retrieval has been unable to meet the needs of users,and cross-modal retrieval has emerged.Cross-modal retrieval aims to retrieve related data of one modality with data of another modality.Its core task is to extract data features and measure data correlation between different modality.This paper summarizes the recent research progress in the field of cross-modal retrieval,and summarizes the research results in the field of cross-modal retrieval from the perspectives of traditional methods,deep learning methods,manual feature hash coding methods and deep learning hash coding methods.On this basis,the performance of various algorithms in cross-modal retrieval of commonly used standard data sets is compared and analyzed.Finally,the problems of cross-modal retrieval research are analyzed and the future development trend of the field is prospected.

Key words: Correlation measure, Cross-modal retrieval, Deep learning, Feature extraction

CLC Number: 

  • TP391
[1]LIU J,XU C,LU H.Cross-media retrieval:state-of-the-art and open issues[J].International Journal of Multimedia Intelligence and Security,2010,1(1):33-52.
[2]WANG K,YIN Q,WANG W,et al.A comprehensive survey on cross-modal retrieval[J].arXiv:1607.06215,2016.
[3]SALTON G,FOX E A,WU H.Extended Boolean information retrieval[R].Cornell University,1982.
[4]ZHU C Z,JÉGOU H,SATOH S.Query-adaptive asymmetrical dissimilarities for visual object retrieval[C]//Proceedings of the IEEE International Conference on Computer Vision.2013:1705-1712.
[5]AIZAWA A.An information-theoretic perspective of tf-idfmeasures[J].Information Processing & Management,2003,39(1):45-65.
[6]BLEI D M,NG A Y,JORDAN M I.Latent dirichlet allocation[J].Journal of Machine Learning Research,2003,3(4/5):993-1022.
[7]DALAL N,TRIGGS B.Histograms of oriented gradients forhuman detection[C]//2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.2005:886-893.
[8]MISHRA A,ALAHARI K,JAWAHAR C V.Image retrievalusing textual cues[C]//Proceedings of the IEEE International Conference on Computer Vision.2013:3040-3047.
[9]ELIZALDE B,ZARAR S,RAJ B.Cross modal audio search and retrieval with joint embeddings based on text and audio[C]//ICASSP 2019-2019 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP).IEEE,2019:4095-4099.
[10]KAMPER H,SHAKHNAROVICH G,LIVESCU K.Semantic speech retrieval with a visually grounded model of untranscribed speech[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops.2018:2514-2517.
[11]GUO M,ZHOU C,LIU J.Jointly Learning of Visual and Auditory:A New Approach for RS Image and Audio Cross-Modal Retrieval[J].IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing,2019,12(11):4644-4654.
[12]HARDOON D R,SZEDMAK S,SHAWE-TAYLOR J.Canonical correlation analysis:An overview with application to learning methods[J].Neural Computation,2004,16(12):2639-2664.
[13]RUPNIK J,SHAWE-TAYLOR J.Multi-view canonical correlation analysis[C]//Conference on Data Mining and Data Warehouses (SiKDD 2010).2010:1-4.
[14]TENENBAUM J B,FREEMAN W T.Separating style and content with bilinear models[J].Neural computation,2000,12(6):1247-1283.
[15]RANJAN V,RASIWASIA N,JAWAHAR C V.Multi-labelcross-modal retrieval[C]//Proceedings of the IEEE Internatio-nal Conference on Computer Vision.2015:4094-4102.
[16]HWANG S J,GRAUMAN K.Learning the relative importance of objects from tagged images for retrieval and cross-modal search[J].International Journal of Computer Vision,2012,100(2):134-153.
[17]JIA Y,BAI L,LIU S,et al.Semantically-enhanced kernel cano-nical correlation analysis:a multi-label cross-modal retrieval[J].Multimedia Tools and Applications,2019,78(10):13169-13188.
[18]RASIWASIA N,MAHAJAN D,MAHADEVAN V,et al.Cluster canonical correlation analysis[C]//Artificial intelligence and statistics.2014:823-831.
[19]ANDREW G,ARORA R,BILMES J,et al.Deep canonical correlation analysis[C]//International Conference on Machine Learning.PMLR,2013:1247-1255.
[20]HU R,XU H,ROHRBACH M,et al.Natural language object retrieval[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:4555-4564.
[21]VO N,JIANG L,SUN C,et al.Composing text and image for image retrieval-an empirical odyssey[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:6439-6448.
[22]WEHRMANN J,BARROS R C.Bidirectional retrieval madesimple[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:7718-7726.
[23]SALVADOR A,HYNES N,AYTAR Y,et al.Learning cross-modal embeddings for cooking recipes and food images[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:3020-3028.
[24]YAMAGUCHI M,SAITO K,USHIKU Y,et al.Spatio-temporal person retrieval via natural language queries[C]//Procee-dings of the IEEE International Conference on Computer Vision.2017:1453-1462.
[25]HERSHEY S,CHAUDHURI S,ELLIS D P W,et al.CNN architectures for large-scale audio classification[C]//2017 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP).IEEE,2017:131-135.
[26]HU D,NIE F,LI X.Deep multimodal clustering for unsupervised audiovisual learning[C]//Proceedings of the IEEE Confe-rence on Computer Vision and Pattern Recognition.2019:9248-9257.
[27]SCHWARTZ I,SCHWING A G,HAZAN T.A simple baseline for audio-visual scene-aware dialog[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:12548-12558.
[28]DENG Y J,ZHANG F L,CHEN X Q,et al.Collaborative attention network model for cross-modal retrieval[J].Computer Science,2020,47(4):54-59.
[29]LI S,XIAO T,LI H,et al.Person search with natural language description[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:1970-1979.
[30]LI H,WANG P,SHEN C,et al.Visual Question Answering as Reading Comprehension[C]//Proceedings of the IEEEConfe-rence on Computer Vision and Pattern Recognition.2019:6319-6328.
[31]DEY S,DUTTA A,GHOSH S K,et al.Learning cross-modal deep embeddings for multi-object image retrieval using text and sketch[C]//2018 24th International Conference on Pattern Re-cognition(ICPR).IEEE,2018:916-921.
[32]YAN Y,ZHANG Q,NI B,et al.Learning Context Graph forPerson Search[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:2158-2167.
[33]MITHUN N C,PAUL S,ROY-CHOWDHURY A K.Weakly supervised video moment retrieval from text queries[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:11592-11601.
[34]SONG Y,SOLEYMANI M.Polysemous Visual-Semantic Em-bedding for Cross-Modal Retrieval[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:1979-1988.
[35]CHEN K,BUI T,FANG C,et al.AMC:Attention guided multi-modal correlation learning for image search[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:2644-2652.
[36]LIU X,WANG Z,SHAO J,et al.Improving Referring Expression Grounding with Cross-modal Attention-guided Erasing[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:1950-1959.
[37]JOHNSON J,KRISHNA R,STARK M,et al.Image retrieval using scene graphs[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2015:3668-3678.
[38]YANG J,LU J,LEE S,et al.Graph r-cnn for scene graph genera-tion[C]//Proceedings of the European Conference on Compu-ter Vision (ECCV).2018:670-685.
[39]HU R,ANDREAS J,ROHRBACH M,et al.Learning to reason:End-to-end module networks for visual question answering[C]//Proceedings of the IEEE International Conference on Computer Vision.2017:804-813.
[40]JOHNSON J,HARIHARAN B,VAN DER MAATEN L,et al.Inferring and executing programs for visual reasoning[C]//Proceedings of the IEEE International Conference on Computer Vision.2017:2989-2998.
[41]XIAO F Y,SIGAL L,LEE Y J.Weakly-supervised visualgrounding of phrases with linguistic structures[C]//Procee-dings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:5945-5954.
[42]LIU B,YEUNG S,CHOU E,et al.Temporal modular networks for retrieving complex compositional activities in videos[C]//Proceedings of the European Conference on Computer Vision (ECCV).2018:552-568.
[43]ZHANG D,DAI X,WANG X,et al.Man:Moment alignment network for natural language moment retrieval via iterative graph adjustment[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:1247-1257.
[44]WANG B,YANG Y,XU X,et al.Adversarial cross-modal retrieval[C]//Proceedings of the 25th ACM InternationalConfe-rence on Multimedia.ACM,2017:154-162.
[45]PENG Y,QI J.CM-GANs:cross-modal generative adversarial networks for common representation learning[J].ACM Tran-sactions on Multimedia Computing,Communications,and Applications (TOMM),2019,15(1):22.
[46]WANG H,SAHOO D,LIU C,et al.Learning Cross-Modal Embeddings with Adversarial Networks for Cooking Recipes and Food Images[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:11572-11581.
[47]CHEN Y,CHEN H K.Speaker recognition based on multimodal generation adversarial networks and triple loss [J].Journal of Electronics Information Technology,2020,42(2):379-385.
[48]GU J,CAI J,JOTY S R,et al.Look,imagine and match:Improving textual-visual cross-modal retrieval with generative models[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:7181-7189.
[49]ZHU B,NGO C W,CHEN J,et al.R2GAN:Cross-Modal Recipe Retrieval with Generative Adversarial Network[C]//Procee-dings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:11477-11486.
[50]WEISS Y,TORRALBA A,FERGUS R.Spectral hashing[C]//Advances in Neural Information Processing Systems.2009:1753-1760.
[51]LIU W,WANG J,JI R,et al.Supervised hashing with kernels[C]//2012 IEEE Conference on Computer Vision and Pattern Recognition.IEEE,2012:2074-2081.
[52]LIU Y Y,LIU H Z,YUAN J Z.Video Hashing AlgorithmBased on 3D Convolutional Neural Network [J].Application Research of Computers,2020,37(3):887-890,900.
[53]PAN Y,YAO T,LI H,et al.Semi-supervised hashing with semantic confidence for large scale visual search[C]//Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval.ACM,2015:53-62.
[54]WANG J,KUMAR S,CHANG S F.Semi-supervised hashing for large-scale search[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2012,34(12):2393-2406.
[55]SALAKHUTDINOV R,HINTON G.Semantic hashing[J].International Journal of Approximate Reasoning,2009,50(7):969-978.
[56]XIA R,PAN Y,LAI H,et al.Supervised hashing for image retrieval via image representation learning[C]//Twenty-eighth AAAI Conference on Artificial Intelligence.2014.
[57]LIONG V E,LU J,TAN Y P,et al.Cross-modal deep variatio-nal hashing[C]//2017 IEEE International Conference on Computer Vision (ICCV).IEEE,2017:4097-4105.
[58]DONG Z,PEI M T.Cross-modal face retrieval method based on heterogeneous hash network[J].Chinese Journal of Compu-ters,2019,42(1):75-86.
[59]DAI Q,LI J,WANG J,et al.Binary optimized hashing[C]//Proceedings of the 24th ACM International Conference on Multimedia.ACM,2016:1247-1256.
[60]LONG F,YAO T,DAI Q,et al.Deep domain adaptation hashing with adversarial learning[C]//The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval.ACM,2018:725-734.
[61]YAO T,LONG F,MEI T,et al.Deep Semantic-Preserving and Ranking-Based Hashing for Image Retrieval[C]//IJCAI.2016:3931-3937.
[62]QIU Z,PAN Y,YAO T,et al.Deep semantic hashing with ge-nerative adversarial networks[C]//Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval.ACM,2017:225-234.
[63]WU D,DAI Q,LIU J,et al.Deep Incremental Hashing Network for Efficient Image Retrieval[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:9069-9077.
[64]JIANG Q Y,LI W J.Asymmetric deep supervised hashing[C]//Thirty-Second AAAI Conference on Artificial Intelligence.2018.
[65]PANYAPANUWAT P,KAMONSANTIROJ S.PerformanceComparison of Unsupervised Deep Hashing with Data-indepen-dent Hashing for Content-Based Audio Retrieval[C]//Procee-dings of the 2019 2nd International Conference on Electronics,Communications and Control Engineering.2019:16-20.
[66]ARIN J,BISWAS A,OFLI F,et al.Recipe1m+:A dataset for learning cross-modal embeddings for cooking recipes and food images[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2019,43(1):187-203.
[67]XIAO T,LI S,WANG B,et al.End-to-end deep learning forperson search[J].arXiv:1604.01850.
[68]PEREIRA J C,COVIELLO E,DOYLE G,et al.On the role of correlation and abstraction in cross-modal multimedia retrieval[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2013,36(3):521-535.
[69]CHUA T S,TANG J,HONG R,et al.NUS-WIDE:a real-world web image database from National University of Singapore[C]//Proceedings of the ACM international Conference on Ima-ge and Video Retrieval.ACM,2009:48.
[70]RASHTCHIAN C,YOUNG P,HODOSH M,et al.Collecting image annotations using Amazon’s Mechanical Turk[C]//Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk.Association for Computational Linguistics,2010:139-147.
[71]ZHENG L,ZHANG H,SUN S,et al.Person re-identification in the wild[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:1367-1376.
[72]HENDRICKS L A,WANG O,SHECHTMAN E,et al.Localizing moments in video with natural language[C]//Proceedings of the IEEE International Conference on Computer Vision.2017:5803-5812.
[73]GAO J,SUN C,YANG Z,et al.Tall:Temporal activity localization via language query[C]//Proceedings of the IEEE International Conference on Computer Vision.2017:5267-5275.
[74]ZENG D,YU Y,OYAMA K.Audio-Visual Embedding forCross-Modal Music Video Retrieval through Supervised Deep CCA[C]//2018 IEEE International Symposium on Multimedia (ISM).IEEE,2018:143-150.
[75]ZHOU Y,WANG Z,FANG C,et al.Visual to sound:Generating natural sound for videos in the wild[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:3550-3558.
[76]XU H,HE K,SIGAL L,et al.Text-to-clip video retrieval with early fusion and re-captioning[J].arXiv:1804.05113.
[77]XU X,HE L,LU H,et al.Deep adversarial metric learning for cross-modal retrieval[J].World Wide Web,2019,22(2):657-672.
[78]PENG Y,QI J,HUANG X,et al.CCL:Cross-modal correlation learning with multigrained fusion by hierarchical network[J].IEEE Transactions on Multimedia,2017,20(2):405-420.
[79]ZHEN L,HU P,WANG X,et al.Deep supervised cross-modal retrieval[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:10394-10403.
[80]LIU X,HU Z,LING H,et al.MTFH:A matrix tri-factorization hashing framework for efficient cross-modal retrieval[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2019,43(3):964-981.
[81]CAO W,LIN Q,HE Z,et al.Hybrid representation learning for cross-modal retrieval[J].Neurocomputing,2019,345:45-57.
[1] XU Yong-xin, ZHAO Jun-feng, WANG Ya-sha, XIE Bing, YANG Kai. Temporal Knowledge Graph Representation Learning [J]. Computer Science, 2022, 49(9): 162-171.
[2] RAO Zhi-shuang, JIA Zhen, ZHANG Fan, LI Tian-rui. Key-Value Relational Memory Networks for Question Answering over Knowledge Graph [J]. Computer Science, 2022, 49(9): 202-207.
[3] TANG Ling-tao, WANG Di, ZHANG Lu-fei, LIU Sheng-yun. Federated Learning Scheme Based on Secure Multi-party Computation and Differential Privacy [J]. Computer Science, 2022, 49(9): 297-305.
[4] SUN Qi, JI Gen-lin, ZHANG Jie. Non-local Attention Based Generative Adversarial Network for Video Abnormal Event Detection [J]. Computer Science, 2022, 49(8): 172-177.
[5] WANG Jian, PENG Yu-qi, ZHAO Yu-fei, YANG Jian. Survey of Social Network Public Opinion Information Extraction Based on Deep Learning [J]. Computer Science, 2022, 49(8): 279-293.
[6] HAO Zhi-rong, CHEN Long, HUANG Jia-cheng. Class Discriminative Universal Adversarial Attack for Text Classification [J]. Computer Science, 2022, 49(8): 323-329.
[7] JIANG Meng-han, LI Shao-mei, ZHENG Hong-hao, ZHANG Jian-peng. Rumor Detection Model Based on Improved Position Embedding [J]. Computer Science, 2022, 49(8): 330-335.
[8] ZHANG Yuan, KANG Le, GONG Zhao-hui, ZHANG Zhi-hong. Related Transaction Behavior Detection in Futures Market Based on Bi-LSTM [J]. Computer Science, 2022, 49(7): 31-39.
[9] HU Yan-yu, ZHAO Long, DONG Xiang-jun. Two-stage Deep Feature Selection Extraction Algorithm for Cancer Classification [J]. Computer Science, 2022, 49(7): 73-78.
[10] ZENG Zhi-xian, CAO Jian-jun, WENG Nian-feng, JIANG Guo-quan, XU Bin. Fine-grained Semantic Association Video-Text Cross-modal Entity Resolution Based on Attention Mechanism [J]. Computer Science, 2022, 49(7): 106-112.
[11] CHENG Cheng, JIANG Ai-lian. Real-time Semantic Segmentation Method Based on Multi-path Feature Extraction [J]. Computer Science, 2022, 49(7): 120-126.
[12] HOU Yu-tao, ABULIZI Abudukelimu, ABUDUKELIMU Halidanmu. Advances in Chinese Pre-training Models [J]. Computer Science, 2022, 49(7): 148-163.
[13] ZHOU Hui, SHI Hao-chen, TU Yao-feng, HUANG Sheng-jun. Robust Deep Neural Network Learning Based on Active Sampling [J]. Computer Science, 2022, 49(7): 164-169.
[14] SU Dan-ning, CAO Gui-tao, WANG Yan-nan, WANG Hong, REN He. Survey of Deep Learning for Radar Emitter Identification Based on Small Sample [J]. Computer Science, 2022, 49(7): 226-235.
[15] WANG Jun-feng, LIU Fan, YANG Sai, LYU Tan-yue, CHEN Zhi-yu, XU Feng. Dam Crack Detection Based on Multi-source Transfer Learning [J]. Computer Science, 2022, 49(6A): 319-324.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!