Computer Science ›› 2019, Vol. 46 ›› Issue (4): 268-273.doi: 10.11896/j.issn.1002-137X.2019.04.042

• Graphics ,Image & Pattern Recognition • Previous Articles     Next Articles

Image Description Model Fusing Word2vec and Attention Mechanism

DENG Zhen-rong1,2, ZHANG Bao-jun1, JIANG Zhou-qin1, HUANG Wen-ming1,2   

  1. School of Computer and Information Security,Guilin University of Electronic Technology,Guilin,Guangxi 541004,China1
    Guangxi Colleges and Universities Keys Laboratory of cloud Computing and Complex Systems,Guilin,Guangxi 541004,China2
  • Received:2018-06-03 Online:2019-04-15 Published:2019-04-23

Abstract: For the overall quality of the sentence describing the generated image is not high in the current image description task,and an image description model fusing word2vec and attention mechanism was proposed.In the encoding stage,the word2vec model is used to describe the text vectorization operations to enhance the relationship among words.The VGGNet19 network is utilized to extract image features,and the attention mechanism is integrated in the image features,so that the corresponding image features can be highlighted when the words are generated at each time node.In the decoding stage,the GRU network is used as a language generation model for image description tasks to improve the efficiency of model training and the quality of generated sentences.Experimental results onFlickr8k and Flickr30k data sets show that under the same training environment,the GRU model saves 1/3 training time compared to the LSTM model.In the BLEU and METEOR evaluation standards,the performance of the proposed model in this paper is significantly improved.

Key words: Attention mechanism, GRU model, Image description, word2vec

CLC Number: 

  • TP391.41
[1] OLIVA A,TORRALBA A.The role of context in object recognition.Trends in Cognitive Sciences,2007,11(12):520-527.
[2] MAO J,XU W,YANG Y,et al.Deep captioning with multimodal recurrent neural networks (m-rnn).Preprint arXiv:1412.6632v5.
[3] KARPATHY A,LI F F.Deep visual-semantic alignments for generating image descriptions∥Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Boston,Massachusetts,2015:3128-3137.
[4] VINYALS O.TOSHEV A,BENGIO S,et al.Show and tell:a neural image caption generator∥Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Boston,Massachusetts,2015:3156-3164.
[5] DONAHUE J,HENDRICKS L A,GUADARRAMA S,et al.Long-term recurrent convolutional networks for visual recognition and description∥IEEE Conference on Computer Vision and Pattern Recognition(CVPR).2015:2625-2634.
[6] FARHADI A,HEJRATI M,SADEGHI M A,et al.Every picture tells a story:generating sentences from images∥Proceedings of the 11th European Conference on Computer Vision.Heraklion,Crete,reece:Springer,2010:15-29.
[7] MITCHELL M,HAN X F,DODGE J,et al.Midge:generating image descriptions from computer vision detections∥Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics.Avignon,France:ACL,2012:747-756.
[8] KULKARNI G,PREMRAJ V,ORDONEZ V,et al.BabyTalk:understanding and generating simple image descriptions.IEEE Transactions on Pattern Analysis and Machine Intelligence,2013,35 (12):2891-2903.
[9] KUZNETSOVA P,ORDONEZ V,BERG A C,et al.Generali- zing image captions for image-text parallel corpus∥Procee-dings of the 51st Annual Meeting of the Association for Computational Linguistics.Sofia,Bulgaria:ACL,2013:790-796.
[10] MASON R,CHARNIAK E.Nonparametric method for data driven image captioning∥Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics.Baltimore,Maryland,USA:ACL,2014:592-598.
[11] SOCHER R,KARPATHY A,LE Q V,et al.Grounded compositional semantics for finding and describing images with sentences.Transactions of the Association for Computational Linguistics,2014,2:207-218.
[12] OVINYALS A.TOSHEV S.BENGIO D.Erhan,Show and tell:a neural image caption generator∥Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Boston,Massachusetts,2015:3156-3164.
[13] SIMONYAN K,ZISSERMAN A.Very deep convolutional networks for large-scale image recognition∥International Conference on Learning Representations (ICLR).2014.
[14] JIA X,GAVVES E,FERNANDO B,et al.Guiding the Long- Short Term Memory model for Image Caption Generation∥IEEE International Conference on Computer Vision(ICCV).2015:2407-2415.
[15] XU K,BA J,KIROS R,et al.Show,attend and tell:Neural ima- ge caption generation with visual attention∥International Conference on Machine Learning(ICML).2015.
[16] MIKOLOV T,KOPECK J,BURGET L,et al.Neural network based language models for highly inflective languages∥IEEE International Conference on Acoustics.IEEE Computer Society,2009:126-129.
[17] HINTON G E,MCCLELLAND J L,RUMELHART D E.Distributed Representations∥Parallel Distributed Processing:Explorations in the Microstructure of Cognition.Cambridge:MIT Press,1986.
[18] CHO K,MERRIENBOER B V,GULCEHRE C,et al.Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation∥Proceedings of the 2014 Confe-rence on Empirical Methods in Natural Language Processing.Doha:Association for Computational Linguistics,2014:1724-1734.
[19] LIN T Y,MAIRE M,BELONGIE S et al.Microsoft coco:common objects in context∥Proceedings of the 13th European Conference on Computer Vision.Zurich,Switzerland:Springer,2014:740-755.
[20] YOUNG P,LAI A,HODOSH M,et al.From image descriptions to visual denotations:new similarity metrics for semantic infe-rence over event descriptions.Transactions of the Association for Computational Linguistics,2014,2:67-78.
[21] PAPINENI K,ROUKOS S,WARD T,et al.BLEU:a method for automatic evaluation of machine translation∥Procee-dings of the 40th Annual Meeting on Association for Computational Linguistics.Philadelphia,Pennsylvania:Association for Computational Linguistics,2002:311-318.
[22] BANERJEE S,LAVIE A.METEO R:an automatic metric for MT evaluation with improved correlation with human judgments∥Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and /or SUMMARIZATION.ANN ARBO:ACL,2005:65-72.
[1] RAO Zhi-shuang, JIA Zhen, ZHANG Fan, LI Tian-rui. Key-Value Relational Memory Networks for Question Answering over Knowledge Graph [J]. Computer Science, 2022, 49(9): 202-207.
[2] ZHOU Fang-quan, CHENG Wei-qing. Sequence Recommendation Based on Global Enhanced Graph Neural Network [J]. Computer Science, 2022, 49(9): 55-63.
[3] DAI Yu, XU Lin-feng. Cross-image Text Reading Method Based on Text Line Matching [J]. Computer Science, 2022, 49(9): 139-145.
[4] ZHOU Le-yuan, ZHANG Jian-hua, YUAN Tian-tian, CHEN Sheng-yong. Sequence-to-Sequence Chinese Continuous Sign Language Recognition and Translation with Multi- layer Attention Mechanism Fusion [J]. Computer Science, 2022, 49(9): 155-161.
[5] XIONG Li-qin, CAO Lei, LAI Jun, CHEN Xi-liang. Overview of Multi-agent Deep Reinforcement Learning Based on Value Factorization [J]. Computer Science, 2022, 49(9): 172-182.
[6] JIANG Meng-han, LI Shao-mei, ZHENG Hong-hao, ZHANG Jian-peng. Rumor Detection Model Based on Improved Position Embedding [J]. Computer Science, 2022, 49(8): 330-335.
[7] ZHU Cheng-zhang, HUANG Jia-er, XIAO Ya-long, WANG Han, ZOU Bei-ji. Deep Hash Retrieval Algorithm for Medical Images Based on Attention Mechanism [J]. Computer Science, 2022, 49(8): 113-119.
[8] SUN Qi, JI Gen-lin, ZHANG Jie. Non-local Attention Based Generative Adversarial Network for Video Abnormal Event Detection [J]. Computer Science, 2022, 49(8): 172-177.
[9] YAN Jia-dan, JIA Cai-yan. Text Classification Method Based on Information Fusion of Dual-graph Neural Network [J]. Computer Science, 2022, 49(8): 230-236.
[10] WANG Ming, PENG Jian, HUANG Fei-hu. Multi-time Scale Spatial-Temporal Graph Neural Network for Traffic Flow Prediction [J]. Computer Science, 2022, 49(8): 40-48.
[11] JIN Fang-yan, WANG Xiu-li. Implicit Causality Extraction of Financial Events Integrating RACNN and BiLSTM [J]. Computer Science, 2022, 49(7): 179-186.
[12] XIONG Luo-geng, ZHENG Shang, ZOU Hai-tao, YU Hua-long, GAO Shang. Software Self-admitted Technical Debt Identification with Bidirectional Gate Recurrent Unit and Attention Mechanism [J]. Computer Science, 2022, 49(7): 212-219.
[13] PENG Shuang, WU Jiang-jiang, CHEN Hao, DU Chun, LI Jun. Satellite Onboard Observation Task Planning Based on Attention Neural Network [J]. Computer Science, 2022, 49(7): 242-247.
[14] ZHANG Ying-tao, ZHANG Jie, ZHANG Rui, ZHANG Wen-qiang. Photorealistic Style Transfer Guided by Global Information [J]. Computer Science, 2022, 49(7): 100-105.
[15] ZENG Zhi-xian, CAO Jian-jun, WENG Nian-feng, JIANG Guo-quan, XU Bin. Fine-grained Semantic Association Video-Text Cross-modal Entity Resolution Based on Attention Mechanism [J]. Computer Science, 2022, 49(7): 106-112.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!