计算机科学 ›› 2020, Vol. 47 ›› Issue (4): 54-59.doi: 10.11896/jsjkx.190600181

• 数据库&大数据&数据科学 • 上一篇    下一篇

面向跨模态检索的协同注意力网络模型

邓一姣, 张凤荔, 陈学勤, 艾擎, 余苏喆   

  1. 电子科技大学信息与软件工程学院 成都610054
  • 收稿日期:2019-06-28 出版日期:2020-04-15 发布日期:2020-04-15
  • 通讯作者: 张凤荔(fzhang@uestc.edu.cn)
  • 基金资助:
    国家自然科学基金项目(61272527);四川省科技计划项目(2016GZ0063)

Collaborative Attention Network Model for Cross-modal Retrieval

DENG Yi-jiao, ZHANG Feng-li, CHEN Xue-qin, AI Qing, YU Su-zhe   

  1. School of Information and Software Engineering,University of Electronic Science and Technology of China,610054,Chengdu
  • Received:2019-06-28 Online:2020-04-15 Published:2020-04-15
  • Contact: ZHANG Feng-li,born in 1963,Ph.D,professor,Ph.D supervisor,is a member of China Computer Federation.Her main research interests include network security and network engineering,cloud computing and big data and machine learning
  • About author:DENG Yi-jiao,born in 1995,postgradua-te,is a member of China Computer Federation.Her main research interests include machine learning and data mining.
  • Supported by:
    This work was supported by the National Natural Science Foundation of China (61272527) and Science and Technology Program of Sichuan Pro-vince,China (2016GZ0063)

摘要: 随着图像、文本、声音、视频等多模态网络数据的急剧增长,人们对多样化的检索需求日益强烈,其中的跨模态检索受到广泛关注。然而,由于其存在异构性差异,在不同的数据模态之间寻找内容相似性仍然具有挑战性。现有方法大都将异构数据通过映射矩阵或深度模型投射到公共子空间,来挖掘成对的关联关系,即图像和文本的全局信息对应关系,而忽略了数据内局部的上下文信息和数据间细粒度的交互信息,无法充分挖掘跨模态关联。为此,文中提出文本-图像协同注意力网络模型(CoAN),通过选择性地关注多模态数据的关键信息部分来增强内容相似性的度量。CoAN利用预训练的VGGNet模型和循环神经网络深层次地提取图像和文本的细粒度特征,利用文本-视觉注意力机制捕捉语言和视觉之间的细微交互作用;同时,该模型分别学习文本和图像的哈希表示,利用哈希方法的低存储特性和计算的高效性来提高检索速度。在实验得出,在两个广泛使用的跨模态数据集上,CoAN的平均准确率均值(mAP)超过所有对比方法,文本检索图像和图像检索文本的mAP值分别达到0.807和0.769。实验结果说明,CoAN有助于检测多模态数据的关键信息区域和数据间细粒度的交互信息,充分挖掘跨模态数据的内容相似性,提高检索精度。

关键词: 多模态数据, 跨模态检索, 深度哈希, 细粒度特征提取, 协同注意力机制

Abstract: With the rapid growth of image,text,sound,video and other multi-modal network data,the demand for diversified retrieval is increasingly strong.And cross-modal retrieval has been widely concerned.However,there are heterogeneity differences among different modes.It is still a challenging to find the content similarity of heterogeneous data.Most of the existing methods project heterogeneous data into a common subspace by a mapping matrix or a deep model.In this way,a pair of correlation relation is mined,and the global information correspondence relation between image and text is obtained.However,these methods ignore the local context information and the fine-grained interaction information between the data,so the cross-modal correlation cannot be fully mined.Therefore,a text-image collaborative attention network model (CoAN) is proposed.In order to enhance the measurement of content similarity,we selectively focus on key information parts of multi-modal data.The pre-trained VGGNet model and LSTM model are used to extract the fine-grained features of image and text,and the CoAN model is used to capture the subtle interaction between text and image by using text-image attention mechanism.At the same time,this model studies the hash representation of text and image respectively.The retrieval speed is improved by using the low storage and high efficiency of hashing method.Experiments show that,on two widely used cross-modal data sets,the mean Average Precision (mAP) of CoAN model is higher than that of all other comparative methods,and the mAP value of text retrieval image and image retrieval text reaches 0.807 and 0.769.Experimental data show that CoAN model is helpful to detect key information and fine-grained interactive information of multi-modal data,and the retrieval accuracy is improved by fully mining the content similarity of cross-modal data.

Key words: Collaborative attention mechanism, Cross-modal retrieval, Deep hash, Fine-grained feature extraction, Multi-modal data

中图分类号: 

  • TP391
[1]OU W H,LIU B,ZHOU Y H,et al.Research review of cross-modal retrieval [J].Journal of Guizhou normal university:natural science edition,2018,36(2):114-120.
[2]FAN H,CHEN H H.Research progress of cross-modalretrieval based on hash method [J].Data communication,2018,184(3):43-49.
[3]KUMAR S,UDUPA R.Learning Hash Functions for CrossView Similarity Search[C]//Proceedings International Joint Conference on Artificial Intelligence.2011:1360-1365.
[4]WEISS Y,TORRALBA A,FERGUS R.Spectral hashing[C]//International Conference on Neural Information Processing Systems.2008.
[5]DING G,GUO Y,ZHOU J.Collective Matrix Factorization Hashing for Multimodal Data[C]//2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).IEEE,2014.
[6]ZHANG D,LI W J.Large-scale supervised multimodal hashing with semantic correlation maximization[C]//Twenty-eighth AAAI Conference on Artificial Intelligence.AAAI Press,2014.
[7]LIN Z,DING G,HU M,et al.Semantics-preserving hashing for cross-view retrieval[C]//2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) .IEEE,2015.
[8]JIANG Q Y,LI W J.Deep Cross-Modal Hashing[C]//IEEE Conference on Computer Vision & Pattern Recognition.IEEE,2017.
[9]YANG E,DENG C,LIU W,et al.Pairwise Relationship Guided Deep Hashing for Cross-Modal Retrieval[C]//Thirty-First AAAI Conference on Artificial Intelligence.AAAI,2017.
[10]MNIH V,HEESS N,GRAVES A,et al.Recurrent Models of Visual Attention[J].arXiv:1406.6247,2014.
[11]STOLLENGA M,MASCI J,GOMEZ F,et al.Deep Networks with Internal Selective Attention through Feedback Connections[J].Advances in Neural Information Processing Systems,2014,4(2):3545-3553.
[12]GREGOR K,DANIHELKA I,GRAVES A,et al.DRAW:A Recurrent Neural Network For Image Generation[J].arXiv:1502.04623,2015.
[13]XU K,BA J,KIROS R,et al.Show,Attend and Tell:Neural Image Caption Generation with Visual Attention[J].arXiv:1502.03044,2015.
[14]YANG Z,HE X,GAO J,et al.Stacked Attention Networks for Image Question Answering[J].arXiv:1511.02274 ,2015.
[15]SHIH K J,SINGH S,HOIEM D.Where To Look:Focus Regions for Visual Question Answering[J].arXiv:1511.07394 ,2015.
[16]BAHDANAU D,CHO K,BENGIO Y.Neural Machine Translation by Jointly Learning to Align and Translate[J].arXiv:1409.0473,2014.
[17]LI J W,LUONG M T,JURAFSKY D.A hierarchical neural autoencoder for paragraphs and documents[J].arXiv:1506.01057,2015.
[18]RUSH A M,CHOPRA S,WESTON J.A Neural Attention Model for Abstractive Sentence Summarization[J].arXiv:1509.00685,2015.
[19]KUMAR A,IRSOY O,SU J,et al.Ask Me Anything:Dynamic Memory Networks for Natural Language Processing[J].arXiv:1506.07285,2015.
[20]XIONG C,MERITY S,SOCHER R.Dynamic Memory Networks for Visual and Textual Question Answering[J].arXiv:1603.01417,2016.
[21]HUANG Y,WANG W,WANG L.Instance-aware Image and Sentence Matching with Selective Multimodal LSTM[J].arXiv:1611.05588,2016.
[22]NAM H,HA J W,KIM J.Dual Attention Networks for Multimodal Reasoning and Matching[J].arXiv:1611.00471,2016.
[23]ZHANG X,LAI H,FENG J.Attention-Aware Deep AdversarialHashing for Cross-Modal Retrieval[M]//Computer Vision-ECCV 2018.Cham:Springer,2018.
[24]LIU J W,DING X H,LUO X L.Review of multimodal deep learning [J].Computer Application Research,2019,37(6).
[25]RUSSAKOVSKY O,DENG J,SU H,et al.ImageNet LargeScale Visual Recognition Challenge[J].International Journal of Computer Vision,2015,115(3):211-252.
[26]SIMONYAN K,ZISSERMAN A.Very Deep Convolutional Networks for Large-Scale Image Recognition[J].arXiv:1409.1556,2014.
[27]LAI H,PAN Y,LIU Y,et al.Simultaneous feature learning and hash coding with deep neural networks[J].arXiv:1504.03410 ,2015.
[28]HUISKES M J,THOMEE B,LEW M S.New trends and ideas in visual concept detection the MIR Flickr retrieval evaluation initiative[C]//International Conference on Multimedia Information Retrieval.ACM,2010.
[29]CHUA T S,TANG J,HONG R,et al.Nus-wide:a real-world web image database from national university of Singapore[C]//International Conference on Multimedia Information Retrieval.ACM,2009.
[30]RASIWASIA N,PEREIRA J C,COVIELLO E,et al.A New Approach to Cross-Modal Multimedia Retrieval[C]//International Conference on Multimedia.ACM,2010.
[1] 朱承璋, 黄嘉儿, 肖亚龙, 王晗, 邹北骥.
基于注意力机制的医学影像深度哈希检索算法
Deep Hash Retrieval Algorithm for Medical Images Based on Attention Mechanism
计算机科学, 2022, 49(8): 113-119. https://doi.org/10.11896/jsjkx.210700153
[2] 赵亮, 张洁, 陈志奎.
基于双图正则化的自适应多模态鲁棒特征学习
Adaptive Multimodal Robust Feature Learning Based on Dual Graph-regularization
计算机科学, 2022, 49(4): 124-133. https://doi.org/10.11896/jsjkx.210300078
[3] 周新民, 胡宜桂, 刘文洁, 孙荣俊.
基于多模态多层级数据融合方法的城市功能识别研究
Research on Urban Function Recognition Based on Multi-modal and Multi-level Data Fusion Method
计算机科学, 2021, 48(9): 50-58. https://doi.org/10.11896/jsjkx.210500220
[4] 刘立波, 苟婷婷.
融合深度典型相关分析和对抗学习的跨模态检索
Cross-modal Retrieval Combining Deep Canonical Correlation Analysis and Adversarial Learning
计算机科学, 2021, 48(9): 200-207. https://doi.org/10.11896/jsjkx.200600119
[5] 冯霞, 胡志毅, 刘才华.
跨模态检索研究进展综述
Survey of Research Progress on Cross-modal Retrieval
计算机科学, 2021, 48(8): 13-23. https://doi.org/10.11896/jsjkx.200800165
[6] 邹傲, 郝文宁, 靳大尉, 陈刚, 田媛.
基于预训练和深度哈希的大规模文本检索研究
Study on Text Retrieval Based on Pre-training and Deep Hash
计算机科学, 2021, 48(11): 300-306. https://doi.org/10.11896/jsjkx.210300266
[7] 蹇松雷, 卢凯.
复杂异构数据的表征学习综述
Survey on Representation Learning of Complex Heterogeneous Data
计算机科学, 2020, 47(2): 1-9. https://doi.org/10.11896/jsjkx.190600180
[8] 邵阳雪, 孟伟, 孔德珍, 韩林轩, 刘扬.
基于深度学习的特种车辆跨模态检索方法
Cross-modal Retrieval Method for Special Vehicles Based on Deep Learning
计算机科学, 2020, 47(12): 205-209. https://doi.org/10.11896/jsjkx.191000132
[9] 林敏鸿, 蒙祖强.
基于注意力神经网络的多模态情感分析
Multimodal Sentiment Analysis Based on Attention Neural Network
计算机科学, 2020, 47(11A): 508-514. https://doi.org/10.11896/jsjkx.191100041
[10] 曾燕, 陈岳林, 蔡晓东.
一种基于权重哈希化的深度人脸识别算法
Deep Face Recognition Algorithm Based on Weighted Hashing
计算机科学, 2019, 46(6): 277-281. https://doi.org/10.11896/j.issn.1002-137X.2019.06.041
[11] 何霞, 汤一平, 王丽冉, 陈朋, 袁公萍.
基于Faster RCNNH的多任务分层图像检索技术
Multitask Hierarchical Image Retrieval Technology Based on Faster RCNNH
计算机科学, 2019, 46(3): 303-313. https://doi.org/10.11896/j.issn.1002-137X.2019.03.045
[12] 徐程浩,郭斌,欧阳逸,翟书颖,於志文.
基于社交媒体的事件感知与多模态事件脉络生成
Event Sensing and Multimodal Event Vein Generation Leveraging Social Media
计算机科学, 2017, 44(Z6): 33-36. https://doi.org/10.11896/j.issn.1002-137X.2017.6A.007
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!