计算机科学 ›› 2020, Vol. 47 ›› Issue (4): 54-59.doi: 10.11896/jsjkx.190600181
邓一姣, 张凤荔, 陈学勤, 艾擎, 余苏喆
DENG Yi-jiao, ZHANG Feng-li, CHEN Xue-qin, AI Qing, YU Su-zhe
摘要: 随着图像、文本、声音、视频等多模态网络数据的急剧增长,人们对多样化的检索需求日益强烈,其中的跨模态检索受到广泛关注。然而,由于其存在异构性差异,在不同的数据模态之间寻找内容相似性仍然具有挑战性。现有方法大都将异构数据通过映射矩阵或深度模型投射到公共子空间,来挖掘成对的关联关系,即图像和文本的全局信息对应关系,而忽略了数据内局部的上下文信息和数据间细粒度的交互信息,无法充分挖掘跨模态关联。为此,文中提出文本-图像协同注意力网络模型(CoAN),通过选择性地关注多模态数据的关键信息部分来增强内容相似性的度量。CoAN利用预训练的VGGNet模型和循环神经网络深层次地提取图像和文本的细粒度特征,利用文本-视觉注意力机制捕捉语言和视觉之间的细微交互作用;同时,该模型分别学习文本和图像的哈希表示,利用哈希方法的低存储特性和计算的高效性来提高检索速度。在实验得出,在两个广泛使用的跨模态数据集上,CoAN的平均准确率均值(mAP)超过所有对比方法,文本检索图像和图像检索文本的mAP值分别达到0.807和0.769。实验结果说明,CoAN有助于检测多模态数据的关键信息区域和数据间细粒度的交互信息,充分挖掘跨模态数据的内容相似性,提高检索精度。
中图分类号:
[1]OU W H,LIU B,ZHOU Y H,et al.Research review of cross-modal retrieval [J].Journal of Guizhou normal university:natural science edition,2018,36(2):114-120. [2]FAN H,CHEN H H.Research progress of cross-modalretrieval based on hash method [J].Data communication,2018,184(3):43-49. [3]KUMAR S,UDUPA R.Learning Hash Functions for CrossView Similarity Search[C]//Proceedings International Joint Conference on Artificial Intelligence.2011:1360-1365. [4]WEISS Y,TORRALBA A,FERGUS R.Spectral hashing[C]//International Conference on Neural Information Processing Systems.2008. [5]DING G,GUO Y,ZHOU J.Collective Matrix Factorization Hashing for Multimodal Data[C]//2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).IEEE,2014. [6]ZHANG D,LI W J.Large-scale supervised multimodal hashing with semantic correlation maximization[C]//Twenty-eighth AAAI Conference on Artificial Intelligence.AAAI Press,2014. [7]LIN Z,DING G,HU M,et al.Semantics-preserving hashing for cross-view retrieval[C]//2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) .IEEE,2015. [8]JIANG Q Y,LI W J.Deep Cross-Modal Hashing[C]//IEEE Conference on Computer Vision & Pattern Recognition.IEEE,2017. [9]YANG E,DENG C,LIU W,et al.Pairwise Relationship Guided Deep Hashing for Cross-Modal Retrieval[C]//Thirty-First AAAI Conference on Artificial Intelligence.AAAI,2017. [10]MNIH V,HEESS N,GRAVES A,et al.Recurrent Models of Visual Attention[J].arXiv:1406.6247,2014. [11]STOLLENGA M,MASCI J,GOMEZ F,et al.Deep Networks with Internal Selective Attention through Feedback Connections[J].Advances in Neural Information Processing Systems,2014,4(2):3545-3553. [12]GREGOR K,DANIHELKA I,GRAVES A,et al.DRAW:A Recurrent Neural Network For Image Generation[J].arXiv:1502.04623,2015. [13]XU K,BA J,KIROS R,et al.Show,Attend and Tell:Neural Image Caption Generation with Visual Attention[J].arXiv:1502.03044,2015. [14]YANG Z,HE X,GAO J,et al.Stacked Attention Networks for Image Question Answering[J].arXiv:1511.02274 ,2015. [15]SHIH K J,SINGH S,HOIEM D.Where To Look:Focus Regions for Visual Question Answering[J].arXiv:1511.07394 ,2015. [16]BAHDANAU D,CHO K,BENGIO Y.Neural Machine Translation by Jointly Learning to Align and Translate[J].arXiv:1409.0473,2014. [17]LI J W,LUONG M T,JURAFSKY D.A hierarchical neural autoencoder for paragraphs and documents[J].arXiv:1506.01057,2015. [18]RUSH A M,CHOPRA S,WESTON J.A Neural Attention Model for Abstractive Sentence Summarization[J].arXiv:1509.00685,2015. [19]KUMAR A,IRSOY O,SU J,et al.Ask Me Anything:Dynamic Memory Networks for Natural Language Processing[J].arXiv:1506.07285,2015. [20]XIONG C,MERITY S,SOCHER R.Dynamic Memory Networks for Visual and Textual Question Answering[J].arXiv:1603.01417,2016. [21]HUANG Y,WANG W,WANG L.Instance-aware Image and Sentence Matching with Selective Multimodal LSTM[J].arXiv:1611.05588,2016. [22]NAM H,HA J W,KIM J.Dual Attention Networks for Multimodal Reasoning and Matching[J].arXiv:1611.00471,2016. [23]ZHANG X,LAI H,FENG J.Attention-Aware Deep AdversarialHashing for Cross-Modal Retrieval[M]//Computer Vision-ECCV 2018.Cham:Springer,2018. [24]LIU J W,DING X H,LUO X L.Review of multimodal deep learning [J].Computer Application Research,2019,37(6). [25]RUSSAKOVSKY O,DENG J,SU H,et al.ImageNet LargeScale Visual Recognition Challenge[J].International Journal of Computer Vision,2015,115(3):211-252. [26]SIMONYAN K,ZISSERMAN A.Very Deep Convolutional Networks for Large-Scale Image Recognition[J].arXiv:1409.1556,2014. [27]LAI H,PAN Y,LIU Y,et al.Simultaneous feature learning and hash coding with deep neural networks[J].arXiv:1504.03410 ,2015. [28]HUISKES M J,THOMEE B,LEW M S.New trends and ideas in visual concept detection the MIR Flickr retrieval evaluation initiative[C]//International Conference on Multimedia Information Retrieval.ACM,2010. [29]CHUA T S,TANG J,HONG R,et al.Nus-wide:a real-world web image database from national university of Singapore[C]//International Conference on Multimedia Information Retrieval.ACM,2009. [30]RASIWASIA N,PEREIRA J C,COVIELLO E,et al.A New Approach to Cross-Modal Multimedia Retrieval[C]//International Conference on Multimedia.ACM,2010. |
[1] | 朱承璋, 黄嘉儿, 肖亚龙, 王晗, 邹北骥. 基于注意力机制的医学影像深度哈希检索算法 Deep Hash Retrieval Algorithm for Medical Images Based on Attention Mechanism 计算机科学, 2022, 49(8): 113-119. https://doi.org/10.11896/jsjkx.210700153 |
[2] | 赵亮, 张洁, 陈志奎. 基于双图正则化的自适应多模态鲁棒特征学习 Adaptive Multimodal Robust Feature Learning Based on Dual Graph-regularization 计算机科学, 2022, 49(4): 124-133. https://doi.org/10.11896/jsjkx.210300078 |
[3] | 周新民, 胡宜桂, 刘文洁, 孙荣俊. 基于多模态多层级数据融合方法的城市功能识别研究 Research on Urban Function Recognition Based on Multi-modal and Multi-level Data Fusion Method 计算机科学, 2021, 48(9): 50-58. https://doi.org/10.11896/jsjkx.210500220 |
[4] | 刘立波, 苟婷婷. 融合深度典型相关分析和对抗学习的跨模态检索 Cross-modal Retrieval Combining Deep Canonical Correlation Analysis and Adversarial Learning 计算机科学, 2021, 48(9): 200-207. https://doi.org/10.11896/jsjkx.200600119 |
[5] | 冯霞, 胡志毅, 刘才华. 跨模态检索研究进展综述 Survey of Research Progress on Cross-modal Retrieval 计算机科学, 2021, 48(8): 13-23. https://doi.org/10.11896/jsjkx.200800165 |
[6] | 邹傲, 郝文宁, 靳大尉, 陈刚, 田媛. 基于预训练和深度哈希的大规模文本检索研究 Study on Text Retrieval Based on Pre-training and Deep Hash 计算机科学, 2021, 48(11): 300-306. https://doi.org/10.11896/jsjkx.210300266 |
[7] | 蹇松雷, 卢凯. 复杂异构数据的表征学习综述 Survey on Representation Learning of Complex Heterogeneous Data 计算机科学, 2020, 47(2): 1-9. https://doi.org/10.11896/jsjkx.190600180 |
[8] | 邵阳雪, 孟伟, 孔德珍, 韩林轩, 刘扬. 基于深度学习的特种车辆跨模态检索方法 Cross-modal Retrieval Method for Special Vehicles Based on Deep Learning 计算机科学, 2020, 47(12): 205-209. https://doi.org/10.11896/jsjkx.191000132 |
[9] | 林敏鸿, 蒙祖强. 基于注意力神经网络的多模态情感分析 Multimodal Sentiment Analysis Based on Attention Neural Network 计算机科学, 2020, 47(11A): 508-514. https://doi.org/10.11896/jsjkx.191100041 |
[10] | 曾燕, 陈岳林, 蔡晓东. 一种基于权重哈希化的深度人脸识别算法 Deep Face Recognition Algorithm Based on Weighted Hashing 计算机科学, 2019, 46(6): 277-281. https://doi.org/10.11896/j.issn.1002-137X.2019.06.041 |
[11] | 何霞, 汤一平, 王丽冉, 陈朋, 袁公萍. 基于Faster RCNNH的多任务分层图像检索技术 Multitask Hierarchical Image Retrieval Technology Based on Faster RCNNH 计算机科学, 2019, 46(3): 303-313. https://doi.org/10.11896/j.issn.1002-137X.2019.03.045 |
[12] | 徐程浩,郭斌,欧阳逸,翟书颖,於志文. 基于社交媒体的事件感知与多模态事件脉络生成 Event Sensing and Multimodal Event Vein Generation Leveraging Social Media 计算机科学, 2017, 44(Z6): 33-36. https://doi.org/10.11896/j.issn.1002-137X.2017.6A.007 |
|