Computer Science ›› 2021, Vol. 48 ›› Issue (4): 157-163.doi: 10.11896/jsjkx.200300146

• Computer Graphics & Multimedia • Previous Articles     Next Articles

Generation of Image Caption of Joint Self-attention and Recurrent Neural Network

WANG Xi1, ZHANG Kai1, LI Jun-hui1, KONG Fang1, ZHANG Yi-tian2   

  1. 1 School of Computer Science and Technology,Soochow University,Suzhou,Jiangsu 215006,China
    2 China Industrial Control Systems Cyber Emergency Response Team,Beijing 100000,China
  • Received:2020-06-24 Revised:2020-07-12 Online:2021-04-15 Published:2021-04-09
  • About author:WANG Xi,born in 1995,postgraduate,is a member of China Computer Federation.Her main research interests include natural language processing and image caption.(
    LI Jun-hui,born in 1983,associate professor.His main research interests include natural language processing and machine translation.
  • Supported by:
    National Natural Science Foundation of China(61876120).

Abstract: At present,most image caption generation models consist of an image encoder based on convolutional neural network(CNN) and a caption decoder based on recurrent neural network(RNN).The image encoder is used to extract visual features from images,while the caption decoder generates captions based on visual features with an attention mechanism.Although the decoder uses RNN with an attention mechanism to model the interaction between image features and captions,it ignores the self-attention of the internal interaction of images or captions.Therefore,this paper proposes a novel model that combines the advantages of RNN and self-attention network for image caption generation.On the one hand,this model can capture interactions within and between modalities in the unified attention area through the self -attention simultaneously.On the other hand,it maintains the inherent advantages of RNN.Experimental results on the MSCOCO dataset show that the proposed model outperforms baseline by improving the performance from 1.135 to 1.166 in CIDEr.

Key words: Image caption, Recurrent neural network, Self-attention mechanism

CLC Number: 

  • TP391.1
[1]FARHADI A,HEJRATI M M,SADEGHI M A,et al.Every Picture Tells a Story:Generating Sentences from Images[C]//Proceedings Part IV of the 11th European Conference on Computer Vision.Heraklion,Crete,Greece:Springer,2010:15-29.
[2]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[C]//Advances in Neural Information Processing Systems.2017:5998-6008.
[3]DEHGHANI M,GOUWS S,VINYALS O,et al.UniversalTransformers[J].arXiv:1807.03819,2018.
[4]KULKARNI G,PREMRAJ V,ORDONEZ V,et al.Babytalk:Understanding and generating simple image descriptions[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2013,35(12):2891-2903.
[5]VINYALS O,TOSHEV A,BENGIO S,et al.Show and tell:A neural image caption generator[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Boston,MA,USA:IEEE,2015:3156-3164.
[6]KARPATHY A,LI F F.Deep visual-semantic alignments for generating image descriptions[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2015:3128-3137.
[7]MAO J H,XU W,YANG Y,et al.Deep captioning with multimo-dal recurrent neural networks(m-rnn)[J].arXiv:1412.6632,2014.
[8]XU J,GAVVES E,FERNANDO B,et al.Guiding the long-short term memory model for image caption generation[C]//Procee-dings of the IEEE International Conference on Computer Vision.2015:2407-2415.
[9]WU Q,SHEN C H,LIU L Q,et al.What value do explicit high level concepts have in vision to language problems?[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:203-212.
[10]XU K,BA J,KIROS R,et al.Show,Attend and Tell:NeuralImage Caption Generation with Visual Attention[C]//Procee-dings of the 32nd International Conference on Machine Lear-ning.Lille,France:JMLR,2015:2048-2057.
[11]LU J,XIONG C M,PARIKH D,et al.Knowing when to look:Adaptive attention via a visual sentinel for image captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:375-383.
[12]CHEN L,ZHANG H W,XIAO J,et al.Sca-cnn:Spatial andchannel-wise attention in convolutional networks for image captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:5659-5667.
[13]LI X R,LANW Y,DONG J F,et al.Adding Chinese Captions to Images[C]//Proceedings of the 2016 Association for Computing Machinery(ACM) on International Conference on Multimedia Retrieval.New York,USA:ACM,2016:271-275.
[14]SZEGEDY C,LIU W,JIA Y Q,et al.Going deeper with convolutions[C]//Proceedings of the 32nd International Conference on Machine Learning.Lille,France:JMLR,2015:1-9.
[15]RENNIE S J,MARCHERET E,MROUEH Y,et al.Self-critical sequence training for image captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:7008-7024.
[16]ANDERSON P,HE X D,BUEHLER C,et al.Bottom-up and top-down attention for image captioning and visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:6077-6086.
[17]HE X,YANG Y,SHI B,et al.VD-SAN:Visual-Densely Semantic Attention Network for Image Caption Generation[J].Neurocomputing,2019,328:48-55.
[18]LIU M,LI L,HU L,et al.Image caption generation with dual attention mechanism[J].Information Processing and Management,2020,57(2):102178.
[19]BA J L,KIROS J R,HINTON G E.Layer normalization[J].arXiv:1607.06450,2016.
[20]RENNIE S J,MARCHERET E,MROUEH Y,et al.Self-critical Sequence Training for Image Captioning[J].arXiv:1612.00563,2016.
[21]VEDANTAM R,ZITNICK C L,PARIKH D,et al.CIDEr:Consensus-based image description evaluation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2015:4566-4575.
[22]LIN T Y,MAIRE M,BELONGIE S,et al.Microsoft coco:Common objects in context[C]//European Conference on Computer Vision.Springer,Cham,2014:740-755.
[23]PAPINENI K,ROUKOS S,WARD T,et al.Bleu:a Method for Automatic Evaluation of Machine Translation[C]//Proceedings of the 40th Annual Meeting of the Association for Computatio-nal Linguistics.Philadelphia,PA,USA:ACL,2002:311-318.
[24]DENKOWSKI M,LAVI A.Meteor universal:Language specific translation evaluation for any target language[C]//Proceedings of the Ninth Workshop on Statistical Machine Translation.2014:376-380.
[25]LIN C Y.Rouge:A package for automatic evaluation of summaries[C]//Text Summarization Branches Out,Post-conference Workshop of ACL 2004.Barcelona,Spain,2004:74-81.
[26]ANDERSON P,FERNANDO B,JOHNSON M,et al.Spice:Semantic propositional image caption evaluation[C]//European Conference on Computer Vision.Springer,Cham,2016:382-398.
[27]HE K M,ZHANG X Y,RENS Q,et al.Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:770-778.
[28]REN S,HE K,GIRSHICK R,et al.Faster RCNN:Towardsreal-time object detection with region proposal networks[C]//
Advances in Neural Information Processing Systems.2015:91-99.
[29]RUSSAKOVSKY O,DENG J,SU H,et al.Imagenet large scale visual recognition challenge[J].International Journal of Computer Vision,2015,115(3):211-252.
[30]KINGMA D P,BA J.Adam:A method for stochastic optimization[J].arXiv:1412.6980,2014.
[31]WISEMAN S,RUSH A M.Sequence-to-sequence learning asbeam-search optimization[J].arXiv:1606.02960,2016.
[32]IOFFE S,SZEGEDY C.Batch Normalization.Accelerating Deep Network Training by Reducing Internal Covariate Shift[C]//Proceedings of the 32nd International Conference on Machine Learning.Lille,,2015:448-456.
[33]DEVLIN J,CHANG M W,LEE K,et al.BERT:Pre-training of Deep Bidirectional Transformers for Language Understanding[J].Proceedings of NACL-HLT,2019,1:4171-4186.
[1] JIN Fang-yan, WANG Xiu-li. Implicit Causality Extraction of Financial Events Integrating RACNN and BiLSTM [J]. Computer Science, 2022, 49(7): 179-186.
[2] PENG Shuang, WU Jiang-jiang, CHEN Hao, DU Chun, LI Jun. Satellite Onboard Observation Task Planning Based on Attention Neural Network [J]. Computer Science, 2022, 49(7): 242-247.
[3] ZHANG Jia-hao, LIU Feng, QI Jia-yin. Lightweight Micro-expression Recognition Architecture Based on Bottleneck Transformer [J]. Computer Science, 2022, 49(6A): 370-377.
[4] CHEN Zhang-hui, XIONG Yun. Stylized Image Captioning Model Based on Disentangle-Retrieve-Generate [J]. Computer Science, 2022, 49(6): 180-186.
[5] ZHAO Dan-dan, HUANG De-gen, MENG Jia-na, DONG Yu, ZHANG Pan. Chinese Entity Relations Classification Based on BERT-GRU-ATT [J]. Computer Science, 2022, 49(6): 319-325.
[6] YU Xin, LIN Zhi-liang. Novel Neural Network for Dealing with a Kind of Non-smooth Pseudoconvex Optimization Problems [J]. Computer Science, 2022, 49(5): 227-234.
[7] AN Xin, DAI Zi-biao, LI Yang, SUN Xiao, REN Fu-ji. End-to-End Speech Synthesis Based on BERT [J]. Computer Science, 2022, 49(4): 221-226.
[8] SHI Yu-tao, SUN Xiao. Conversational Comprehension Model for Question Generation [J]. Computer Science, 2022, 49(3): 232-238.
[9] LI Hao, CAO Shu-yu, CHEN Ya-qing, ZHANG Min. User Trajectory Identification Model via Attention Mechanism [J]. Computer Science, 2022, 49(3): 308-312.
[10] XIAO Ding, ZHANG Yu-fan, JI Hou-ye. Electricity Theft Detection Based on Multi-head Attention Mechanism [J]. Computer Science, 2022, 49(1): 140-145.
[11] HU Yan-li, TONG Tan-qian, ZHANG Xiao-yu, PENG Juan. Self-attention-based BGRU and CNN for Sentiment Analysis [J]. Computer Science, 2022, 49(1): 252-258.
[12] ZENG You-yu, XIE Qiang. Fault Prediction Method Based on Improved RNN and VAR for Ship Equipment [J]. Computer Science, 2021, 48(6): 184-189.
[13] CHEN Qian, CHE Miao-miao, GUO Xin, WANG Su-ge. Recurrent Convolution Attention Model for Sentiment Classification [J]. Computer Science, 2021, 48(2): 245-249.
[14] LYU Ming-qi, HONG Zhao-xiong, CHEN Tie-ming. Traffic Flow Forecasting Method Combining Spatio-Temporal Correlations and Social Events [J]. Computer Science, 2021, 48(2): 264-270.
[15] ZHOU Xiao-shi, ZHANG Zi-wei, WEN Juan. Natural Language Steganography Based on Neural Machine Translation [J]. Computer Science, 2021, 48(11A): 557-564.
Full text



No Suggested Reading articles found!