Computer Science ›› 2020, Vol. 47 ›› Issue (4): 54-59.doi: 10.11896/jsjkx.190600181

• Database & Big Data & Data Science • Previous Articles     Next Articles

Collaborative Attention Network Model for Cross-modal Retrieval

DENG Yi-jiao, ZHANG Feng-li, CHEN Xue-qin, AI Qing, YU Su-zhe   

  1. School of Information and Software Engineering,University of Electronic Science and Technology of China,610054,Chengdu
  • Received:2019-06-28 Online:2020-04-15 Published:2020-04-15
  • Contact: ZHANG Feng-li,born in 1963,Ph.D,professor,Ph.D supervisor,is a member of China Computer Federation.Her main research interests include network security and network engineering,cloud computing and big data and machine learning
  • About author:DENG Yi-jiao,born in 1995,postgradua-te,is a member of China Computer Federation.Her main research interests include machine learning and data mining.
  • Supported by:
    This work was supported by the National Natural Science Foundation of China (61272527) and Science and Technology Program of Sichuan Pro-vince,China (2016GZ0063)

Abstract: With the rapid growth of image,text,sound,video and other multi-modal network data,the demand for diversified retrieval is increasingly strong.And cross-modal retrieval has been widely concerned.However,there are heterogeneity differences among different modes.It is still a challenging to find the content similarity of heterogeneous data.Most of the existing methods project heterogeneous data into a common subspace by a mapping matrix or a deep model.In this way,a pair of correlation relation is mined,and the global information correspondence relation between image and text is obtained.However,these methods ignore the local context information and the fine-grained interaction information between the data,so the cross-modal correlation cannot be fully mined.Therefore,a text-image collaborative attention network model (CoAN) is proposed.In order to enhance the measurement of content similarity,we selectively focus on key information parts of multi-modal data.The pre-trained VGGNet model and LSTM model are used to extract the fine-grained features of image and text,and the CoAN model is used to capture the subtle interaction between text and image by using text-image attention mechanism.At the same time,this model studies the hash representation of text and image respectively.The retrieval speed is improved by using the low storage and high efficiency of hashing method.Experiments show that,on two widely used cross-modal data sets,the mean Average Precision (mAP) of CoAN model is higher than that of all other comparative methods,and the mAP value of text retrieval image and image retrieval text reaches 0.807 and 0.769.Experimental data show that CoAN model is helpful to detect key information and fine-grained interactive information of multi-modal data,and the retrieval accuracy is improved by fully mining the content similarity of cross-modal data.

Key words: Collaborative attention mechanism, Cross-modal retrieval, Deep hash, Fine-grained feature extraction, Multi-modal data

CLC Number: 

  • TP391
[1]OU W H,LIU B,ZHOU Y H,et al.Research review of cross-modal retrieval [J].Journal of Guizhou normal university:natural science edition,2018,36(2):114-120.
[2]FAN H,CHEN H H.Research progress of cross-modalretrieval based on hash method [J].Data communication,2018,184(3):43-49.
[3]KUMAR S,UDUPA R.Learning Hash Functions for CrossView Similarity Search[C]//Proceedings International Joint Conference on Artificial Intelligence.2011:1360-1365.
[4]WEISS Y,TORRALBA A,FERGUS R.Spectral hashing[C]//International Conference on Neural Information Processing Systems.2008.
[5]DING G,GUO Y,ZHOU J.Collective Matrix Factorization Hashing for Multimodal Data[C]//2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).IEEE,2014.
[6]ZHANG D,LI W J.Large-scale supervised multimodal hashing with semantic correlation maximization[C]//Twenty-eighth AAAI Conference on Artificial Intelligence.AAAI Press,2014.
[7]LIN Z,DING G,HU M,et al.Semantics-preserving hashing for cross-view retrieval[C]//2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) .IEEE,2015.
[8]JIANG Q Y,LI W J.Deep Cross-Modal Hashing[C]//IEEE Conference on Computer Vision & Pattern Recognition.IEEE,2017.
[9]YANG E,DENG C,LIU W,et al.Pairwise Relationship Guided Deep Hashing for Cross-Modal Retrieval[C]//Thirty-First AAAI Conference on Artificial Intelligence.AAAI,2017.
[10]MNIH V,HEESS N,GRAVES A,et al.Recurrent Models of Visual Attention[J].arXiv:1406.6247,2014.
[11]STOLLENGA M,MASCI J,GOMEZ F,et al.Deep Networks with Internal Selective Attention through Feedback Connections[J].Advances in Neural Information Processing Systems,2014,4(2):3545-3553.
[12]GREGOR K,DANIHELKA I,GRAVES A,et al.DRAW:A Recurrent Neural Network For Image Generation[J].arXiv:1502.04623,2015.
[13]XU K,BA J,KIROS R,et al.Show,Attend and Tell:Neural Image Caption Generation with Visual Attention[J].arXiv:1502.03044,2015.
[14]YANG Z,HE X,GAO J,et al.Stacked Attention Networks for Image Question Answering[J].arXiv:1511.02274 ,2015.
[15]SHIH K J,SINGH S,HOIEM D.Where To Look:Focus Regions for Visual Question Answering[J].arXiv:1511.07394 ,2015.
[16]BAHDANAU D,CHO K,BENGIO Y.Neural Machine Translation by Jointly Learning to Align and Translate[J].arXiv:1409.0473,2014.
[17]LI J W,LUONG M T,JURAFSKY D.A hierarchical neural autoencoder for paragraphs and documents[J].arXiv:1506.01057,2015.
[18]RUSH A M,CHOPRA S,WESTON J.A Neural Attention Model for Abstractive Sentence Summarization[J].arXiv:1509.00685,2015.
[19]KUMAR A,IRSOY O,SU J,et al.Ask Me Anything:Dynamic Memory Networks for Natural Language Processing[J].arXiv:1506.07285,2015.
[20]XIONG C,MERITY S,SOCHER R.Dynamic Memory Networks for Visual and Textual Question Answering[J].arXiv:1603.01417,2016.
[21]HUANG Y,WANG W,WANG L.Instance-aware Image and Sentence Matching with Selective Multimodal LSTM[J].arXiv:1611.05588,2016.
[22]NAM H,HA J W,KIM J.Dual Attention Networks for Multimodal Reasoning and Matching[J].arXiv:1611.00471,2016.
[23]ZHANG X,LAI H,FENG J.Attention-Aware Deep AdversarialHashing for Cross-Modal Retrieval[M]//Computer Vision-ECCV 2018.Cham:Springer,2018.
[24]LIU J W,DING X H,LUO X L.Review of multimodal deep learning [J].Computer Application Research,2019,37(6).
[25]RUSSAKOVSKY O,DENG J,SU H,et al.ImageNet LargeScale Visual Recognition Challenge[J].International Journal of Computer Vision,2015,115(3):211-252.
[26]SIMONYAN K,ZISSERMAN A.Very Deep Convolutional Networks for Large-Scale Image Recognition[J].arXiv:1409.1556,2014.
[27]LAI H,PAN Y,LIU Y,et al.Simultaneous feature learning and hash coding with deep neural networks[J].arXiv:1504.03410 ,2015.
[28]HUISKES M J,THOMEE B,LEW M S.New trends and ideas in visual concept detection the MIR Flickr retrieval evaluation initiative[C]//International Conference on Multimedia Information Retrieval.ACM,2010.
[29]CHUA T S,TANG J,HONG R,et al.Nus-wide:a real-world web image database from national university of Singapore[C]//International Conference on Multimedia Information Retrieval.ACM,2009.
[30]RASIWASIA N,PEREIRA J C,COVIELLO E,et al.A New Approach to Cross-Modal Multimedia Retrieval[C]//International Conference on Multimedia.ACM,2010.
[1] ZHU Cheng-zhang, HUANG Jia-er, XIAO Ya-long, WANG Han, ZOU Bei-ji. Deep Hash Retrieval Algorithm for Medical Images Based on Attention Mechanism [J]. Computer Science, 2022, 49(8): 113-119.
[2] ZHOU Xin-min, HU Yi-gui, LIU Wen-jie, SUN Rong-jun. Research on Urban Function Recognition Based on Multi-modal and Multi-level Data Fusion Method [J]. Computer Science, 2021, 48(9): 50-58.
[3] LIU Li-bo, GOU Ting-ting. Cross-modal Retrieval Combining Deep Canonical Correlation Analysis and Adversarial Learning [J]. Computer Science, 2021, 48(9): 200-207.
[4] FENG Xia, HU Zhi-yi, LIU Cai-hua. Survey of Research Progress on Cross-modal Retrieval [J]. Computer Science, 2021, 48(8): 13-23.
[5] SUN Sheng-zi, GUO Bing-hui , YANG Xiao-bo. Embedding Consensus Autoencoder for Cross-modal Semantic Analysis [J]. Computer Science, 2021, 48(7): 93-98.
[6] ZOU Ao, HAO Wen-ning, JIN Da-wei, CHEN Gang, TIAN Yuan. Study on Text Retrieval Based on Pre-training and Deep Hash [J]. Computer Science, 2021, 48(11): 300-306.
[7] SHAO Yang-xue, MENG Wei, KONG Deng-zhen, HAN Lin-xuan, LIU Yang. Cross-modal Retrieval Method for Special Vehicles Based on Deep Learning [J]. Computer Science, 2020, 47(12): 205-209.
[8] ZENG Yan, CHEN Yue-lin, CAI Xiao-dong. Deep Face Recognition Algorithm Based on Weighted Hashing [J]. Computer Science, 2019, 46(6): 277-281.
[9] HE Xia, TANG Yi-ping, WANG Li-ran, CHEN Peng, YUAN Gong-ping. Multitask Hierarchical Image Retrieval Technology Based on Faster RCNNH [J]. Computer Science, 2019, 46(3): 303-313.
Full text



No Suggested Reading articles found!