Computer Science ›› 2020, Vol. 47 ›› Issue (4): 54-59.doi: 10.11896/jsjkx.190600181

• Database & Big Data & Data Science • Previous Articles     Next Articles

Collaborative Attention Network Model for Cross-modal Retrieval

DENG Yi-jiao, ZHANG Feng-li, CHEN Xue-qin, AI Qing, YU Su-zhe   

  1. School of Information and Software Engineering,University of Electronic Science and Technology of China,610054,Chengdu
  • Received:2019-06-28 Online:2020-04-15 Published:2020-04-15
  • Contact: ZHANG Feng-li,born in 1963,Ph.D,professor,Ph.D supervisor,is a member of China Computer Federation.Her main research interests include network security and network engineering,cloud computing and big data and machine learning
  • About author:DENG Yi-jiao,born in 1995,postgradua-te,is a member of China Computer Federation.Her main research interests include machine learning and data mining.
  • Supported by:
    This work was supported by the National Natural Science Foundation of China (61272527) and Science and Technology Program of Sichuan Pro-vince,China (2016GZ0063)

Abstract: With the rapid growth of image,text,sound,video and other multi-modal network data,the demand for diversified retrieval is increasingly strong.And cross-modal retrieval has been widely concerned.However,there are heterogeneity differences among different modes.It is still a challenging to find the content similarity of heterogeneous data.Most of the existing methods project heterogeneous data into a common subspace by a mapping matrix or a deep model.In this way,a pair of correlation relation is mined,and the global information correspondence relation between image and text is obtained.However,these methods ignore the local context information and the fine-grained interaction information between the data,so the cross-modal correlation cannot be fully mined.Therefore,a text-image collaborative attention network model (CoAN) is proposed.In order to enhance the measurement of content similarity,we selectively focus on key information parts of multi-modal data.The pre-trained VGGNet model and LSTM model are used to extract the fine-grained features of image and text,and the CoAN model is used to capture the subtle interaction between text and image by using text-image attention mechanism.At the same time,this model studies the hash representation of text and image respectively.The retrieval speed is improved by using the low storage and high efficiency of hashing method.Experiments show that,on two widely used cross-modal data sets,the mean Average Precision (mAP) of CoAN model is higher than that of all other comparative methods,and the mAP value of text retrieval image and image retrieval text reaches 0.807 and 0.769.Experimental data show that CoAN model is helpful to detect key information and fine-grained interactive information of multi-modal data,and the retrieval accuracy is improved by fully mining the content similarity of cross-modal data.

Key words: Cross-modal retrieval, Collaborative attention mechanism, Fine-grained feature extraction, Deep hash, Multi-modal data

CLC Number: 

  • TP391
[1]OU W H,LIU B,ZHOU Y H,et al.Research review of cross-modal retrieval [J].Journal of Guizhou normal university:natural science edition,2018,36(2):114-120.
[2]FAN H,CHEN H H.Research progress of cross-modalretrieval based on hash method [J].Data communication,2018,184(3):43-49.
[3]KUMAR S,UDUPA R.Learning Hash Functions for CrossView Similarity Search[C]//Proceedings International Joint Conference on Artificial Intelligence.2011:1360-1365.
[4]WEISS Y,TORRALBA A,FERGUS R.Spectral hashing[C]//International Conference on Neural Information Processing Systems.2008.
[5]DING G,GUO Y,ZHOU J.Collective Matrix Factorization Hashing for Multimodal Data[C]//2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).IEEE,2014.
[6]ZHANG D,LI W J.Large-scale supervised multimodal hashing with semantic correlation maximization[C]//Twenty-eighth AAAI Conference on Artificial Intelligence.AAAI Press,2014.
[7]LIN Z,DING G,HU M,et al.Semantics-preserving hashing for cross-view retrieval[C]//2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) .IEEE,2015.
[8]JIANG Q Y,LI W J.Deep Cross-Modal Hashing[C]//IEEE Conference on Computer Vision & Pattern Recognition.IEEE,2017.
[9]YANG E,DENG C,LIU W,et al.Pairwise Relationship Guided Deep Hashing for Cross-Modal Retrieval[C]//Thirty-First AAAI Conference on Artificial Intelligence.AAAI,2017.
[10]MNIH V,HEESS N,GRAVES A,et al.Recurrent Models of Visual Attention[J].arXiv:1406.6247,2014.
[11]STOLLENGA M,MASCI J,GOMEZ F,et al.Deep Networks with Internal Selective Attention through Feedback Connections[J].Advances in Neural Information Processing Systems,2014,4(2):3545-3553.
[12]GREGOR K,DANIHELKA I,GRAVES A,et al.DRAW:A Recurrent Neural Network For Image Generation[J].arXiv:1502.04623,2015.
[13]XU K,BA J,KIROS R,et al.Show,Attend and Tell:Neural Image Caption Generation with Visual Attention[J].arXiv:1502.03044,2015.
[14]YANG Z,HE X,GAO J,et al.Stacked Attention Networks for Image Question Answering[J].arXiv:1511.02274 ,2015.
[15]SHIH K J,SINGH S,HOIEM D.Where To Look:Focus Regions for Visual Question Answering[J].arXiv:1511.07394 ,2015.
[16]BAHDANAU D,CHO K,BENGIO Y.Neural Machine Translation by Jointly Learning to Align and Translate[J].arXiv:1409.0473,2014.
[17]LI J W,LUONG M T,JURAFSKY D.A hierarchical neural autoencoder for paragraphs and documents[J].arXiv:1506.01057,2015.
[18]RUSH A M,CHOPRA S,WESTON J.A Neural Attention Model for Abstractive Sentence Summarization[J].arXiv:1509.00685,2015.
[19]KUMAR A,IRSOY O,SU J,et al.Ask Me Anything:Dynamic Memory Networks for Natural Language Processing[J].arXiv:1506.07285,2015.
[20]XIONG C,MERITY S,SOCHER R.Dynamic Memory Networks for Visual and Textual Question Answering[J].arXiv:1603.01417,2016.
[21]HUANG Y,WANG W,WANG L.Instance-aware Image and Sentence Matching with Selective Multimodal LSTM[J].arXiv:1611.05588,2016.
[22]NAM H,HA J W,KIM J.Dual Attention Networks for Multimodal Reasoning and Matching[J].arXiv:1611.00471,2016.
[23]ZHANG X,LAI H,FENG J.Attention-Aware Deep AdversarialHashing for Cross-Modal Retrieval[M]//Computer Vision-ECCV 2018.Cham:Springer,2018.
[24]LIU J W,DING X H,LUO X L.Review of multimodal deep learning [J].Computer Application Research,2019,37(6).
[25]RUSSAKOVSKY O,DENG J,SU H,et al.ImageNet LargeScale Visual Recognition Challenge[J].International Journal of Computer Vision,2015,115(3):211-252.
[26]SIMONYAN K,ZISSERMAN A.Very Deep Convolutional Networks for Large-Scale Image Recognition[J].arXiv:1409.1556,2014.
[27]LAI H,PAN Y,LIU Y,et al.Simultaneous feature learning and hash coding with deep neural networks[J].arXiv:1504.03410 ,2015.
[28]HUISKES M J,THOMEE B,LEW M S.New trends and ideas in visual concept detection the MIR Flickr retrieval evaluation initiative[C]//International Conference on Multimedia Information Retrieval.ACM,2010.
[29]CHUA T S,TANG J,HONG R,et al.Nus-wide:a real-world web image database from national university of Singapore[C]//International Conference on Multimedia Information Retrieval.ACM,2009.
[30]RASIWASIA N,PEREIRA J C,COVIELLO E,et al.A New Approach to Cross-Modal Multimedia Retrieval[C]//International Conference on Multimedia.ACM,2010.
[1] ZENG Yan, CHEN Yue-lin, CAI Xiao-dong. Deep Face Recognition Algorithm Based on Weighted Hashing [J]. Computer Science, 2019, 46(6): 277-281.
[2] HE Xia, TANG Yi-ping, WANG Li-ran, CHEN Peng, YUAN Gong-ping. Multitask Hierarchical Image Retrieval Technology Based on Faster RCNNH [J]. Computer Science, 2019, 46(3): 303-313.
Full text



[1] SUN Jin, CHEN Ruo-yu, LUO Heng-li. Research on Face Tagging Based on Active Learning[J]. Computer Science, 2018, 45(9): 299 -302 .
[2] ZHAO Yang, WANG Wei, DONG Rong, WANG Jing-shi and TANG Min. Compressed Sensing Recovery Algorithm for Region of Interests of MRI/MRA Images Based on NLTV and NESTA[J]. Computer Science, 2017, 44(9): 308 -314 .
[3] XU Hui-qing, WANG Gao-cai and MIN Ren-jiang. Energy-consumption Optimization Strategy Based on Cooperative Caching for Content-centric Network[J]. Computer Science, 2017, 44(8): 76 -81, 106 .
[4] ZHANG Bin,TENG Jun-jie,MAN Yi. Application Research of Improved Parallel Fp-growth Algorithm in Fault Diagnosis
of Industrial Equipment
[J]. Computer Science, 2018, 45(6A): 508 -512 .
[5] CUI Tie-jun, LI Sha-sha and WANG Lai-gui. System Function Structure Analysis in Complete and Incomplete Background Relationship[J]. Computer Science, 2017, 44(3): 268 -273, 306 .
[6] ZHANG Jing and ZHU Guo-bin. Hot Topic Discovery Research of Stack Overflow Programming Website Based on CBOW-LDA Topic Model[J]. Computer Science, 2018, 45(4): 208 -214 .
[7] DUAN Xi,YANG Qun,CHEN Bing and LI Yuan-zhen. Perturbation Guided Ant Colony Optimization[J]. Computer Science, 2014, 41(12): 151 -154 .
[8] LI Lang and LIU Bo-tao. Surge:A New Low-resource and Efficient Lightweight Block Cipher[J]. Computer Science, 2018, 45(2): 236 -240 .
[9] HUANG Qi-fa,ZHU Jian-ming,SONG Biao and ZHANG Ning. Game Model of User’s Privacy-preserving in Social Networks[J]. Computer Science, 2014, 41(10): 184 -190 .
[10] REN Xue-fang, ZHANG Ling, SHI Kai-quan. Two Types of Dynamic Information Law Models and Their Applications in Information Camouflage and Risk Identification[J]. Computer Science, 2018, 45(9): 230 -236 .