计算机科学 ›› 2021, Vol. 48 ›› Issue (4): 157-163.doi: 10.11896/jsjkx.200300146

• 计算机图形学&多媒体 • 上一篇    下一篇

联合自注意力和循环网络的图像标题生成

王习1, 张凯1, 李军辉1, 孔芳1, 张熠天2   

  1. 1 苏州大学计算机科学与技术学院 江苏 苏州215006
    2 国家工业信息安全发展研究中心 北京100000
  • 收稿日期:2020-06-24 修回日期:2020-07-12 出版日期:2021-04-15 发布日期:2021-04-09
  • 通讯作者: 李军辉(jhli@suda.edu.cn)
  • 基金资助:
    国家自然科学基金(61876120)

Generation of Image Caption of Joint Self-attention and Recurrent Neural Network

WANG Xi1, ZHANG Kai1, LI Jun-hui1, KONG Fang1, ZHANG Yi-tian2   

  1. 1 School of Computer Science and Technology,Soochow University,Suzhou,Jiangsu 215006,China
    2 China Industrial Control Systems Cyber Emergency Response Team,Beijing 100000,China
  • Received:2020-06-24 Revised:2020-07-12 Online:2021-04-15 Published:2021-04-09
  • About author:WANG Xi,born in 1995,postgraduate,is a member of China Computer Federation.Her main research interests include natural language processing and image caption.(20185427010@stu.suda.edu.cn)
    LI Jun-hui,born in 1983,associate professor.His main research interests include natural language processing and machine translation.
  • Supported by:
    National Natural Science Foundation of China(61876120).

摘要: 目前大多数图像标题生成模型都是由一个基于卷积神经网络(Convolutional Neural Network,CNN)的图像编码器和一个基于循环神经网络(Recurrent Neural Network,RNN)的标题解码器组成。其中图像编码器用于提取图像的视觉特征,标题解码器基于视觉特征通过注意力机制来生成标题。然而,使用基于注意力机制的RNN的问题在于,解码端虽然可以对图像特征和标题交互的部分进行注意力建模,但是却忽略了标题内部交互作用的自我注意。因此,针对图像标题生成任务,文中提出了一种能同时结合循环网络和自注意力网络优点的模型。该模型一方面能够通过自注意力模型在统一的注意力区域内同时捕获模态内和模态间的相互作用,另一方面又保持了循环网络固有的优点。在MSCOCO数据集上的实验结果表明,CIDEr值从1.135提高到了1.166,所提方法能够有效提升图像标题生成的性能。

关键词: 图像标题, 循环神经网络, 自注意力机制

Abstract: At present,most image caption generation models consist of an image encoder based on convolutional neural network(CNN) and a caption decoder based on recurrent neural network(RNN).The image encoder is used to extract visual features from images,while the caption decoder generates captions based on visual features with an attention mechanism.Although the decoder uses RNN with an attention mechanism to model the interaction between image features and captions,it ignores the self-attention of the internal interaction of images or captions.Therefore,this paper proposes a novel model that combines the advantages of RNN and self-attention network for image caption generation.On the one hand,this model can capture interactions within and between modalities in the unified attention area through the self -attention simultaneously.On the other hand,it maintains the inherent advantages of RNN.Experimental results on the MSCOCO dataset show that the proposed model outperforms baseline by improving the performance from 1.135 to 1.166 in CIDEr.

Key words: Image caption, Recurrent neural network, Self-attention mechanism

中图分类号: 

  • TP391.1
[1]FARHADI A,HEJRATI M M,SADEGHI M A,et al.Every Picture Tells a Story:Generating Sentences from Images[C]//Proceedings Part IV of the 11th European Conference on Computer Vision.Heraklion,Crete,Greece:Springer,2010:15-29.
[2]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[C]//Advances in Neural Information Processing Systems.2017:5998-6008.
[3]DEHGHANI M,GOUWS S,VINYALS O,et al.UniversalTransformers[J].arXiv:1807.03819,2018.
[4]KULKARNI G,PREMRAJ V,ORDONEZ V,et al.Babytalk:Understanding and generating simple image descriptions[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2013,35(12):2891-2903.
[5]VINYALS O,TOSHEV A,BENGIO S,et al.Show and tell:A neural image caption generator[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Boston,MA,USA:IEEE,2015:3156-3164.
[6]KARPATHY A,LI F F.Deep visual-semantic alignments for generating image descriptions[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2015:3128-3137.
[7]MAO J H,XU W,YANG Y,et al.Deep captioning with multimo-dal recurrent neural networks(m-rnn)[J].arXiv:1412.6632,2014.
[8]XU J,GAVVES E,FERNANDO B,et al.Guiding the long-short term memory model for image caption generation[C]//Procee-dings of the IEEE International Conference on Computer Vision.2015:2407-2415.
[9]WU Q,SHEN C H,LIU L Q,et al.What value do explicit high level concepts have in vision to language problems?[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:203-212.
[10]XU K,BA J,KIROS R,et al.Show,Attend and Tell:NeuralImage Caption Generation with Visual Attention[C]//Procee-dings of the 32nd International Conference on Machine Lear-ning.Lille,France:JMLR,2015:2048-2057.
[11]LU J,XIONG C M,PARIKH D,et al.Knowing when to look:Adaptive attention via a visual sentinel for image captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:375-383.
[12]CHEN L,ZHANG H W,XIAO J,et al.Sca-cnn:Spatial andchannel-wise attention in convolutional networks for image captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:5659-5667.
[13]LI X R,LANW Y,DONG J F,et al.Adding Chinese Captions to Images[C]//Proceedings of the 2016 Association for Computing Machinery(ACM) on International Conference on Multimedia Retrieval.New York,USA:ACM,2016:271-275.
[14]SZEGEDY C,LIU W,JIA Y Q,et al.Going deeper with convolutions[C]//Proceedings of the 32nd International Conference on Machine Learning.Lille,France:JMLR,2015:1-9.
[15]RENNIE S J,MARCHERET E,MROUEH Y,et al.Self-critical sequence training for image captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:7008-7024.
[16]ANDERSON P,HE X D,BUEHLER C,et al.Bottom-up and top-down attention for image captioning and visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:6077-6086.
[17]HE X,YANG Y,SHI B,et al.VD-SAN:Visual-Densely Semantic Attention Network for Image Caption Generation[J].Neurocomputing,2019,328:48-55.
[18]LIU M,LI L,HU L,et al.Image caption generation with dual attention mechanism[J].Information Processing and Management,2020,57(2):102178.
[19]BA J L,KIROS J R,HINTON G E.Layer normalization[J].arXiv:1607.06450,2016.
[20]RENNIE S J,MARCHERET E,MROUEH Y,et al.Self-critical Sequence Training for Image Captioning[J].arXiv:1612.00563,2016.
[21]VEDANTAM R,ZITNICK C L,PARIKH D,et al.CIDEr:Consensus-based image description evaluation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2015:4566-4575.
[22]LIN T Y,MAIRE M,BELONGIE S,et al.Microsoft coco:Common objects in context[C]//European Conference on Computer Vision.Springer,Cham,2014:740-755.
[23]PAPINENI K,ROUKOS S,WARD T,et al.Bleu:a Method for Automatic Evaluation of Machine Translation[C]//Proceedings of the 40th Annual Meeting of the Association for Computatio-nal Linguistics.Philadelphia,PA,USA:ACL,2002:311-318.
[24]DENKOWSKI M,LAVI A.Meteor universal:Language specific translation evaluation for any target language[C]//Proceedings of the Ninth Workshop on Statistical Machine Translation.2014:376-380.
[25]LIN C Y.Rouge:A package for automatic evaluation of summaries[C]//Text Summarization Branches Out,Post-conference Workshop of ACL 2004.Barcelona,Spain,2004:74-81.
[26]ANDERSON P,FERNANDO B,JOHNSON M,et al.Spice:Semantic propositional image caption evaluation[C]//European Conference on Computer Vision.Springer,Cham,2016:382-398.
[27]HE K M,ZHANG X Y,RENS Q,et al.Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:770-778.
[28]REN S,HE K,GIRSHICK R,et al.Faster RCNN:Towardsreal-time object detection with region proposal networks[C]//
Advances in Neural Information Processing Systems.2015:91-99.
[29]RUSSAKOVSKY O,DENG J,SU H,et al.Imagenet large scale visual recognition challenge[J].International Journal of Computer Vision,2015,115(3):211-252.
[30]KINGMA D P,BA J.Adam:A method for stochastic optimization[J].arXiv:1412.6980,2014.
[31]WISEMAN S,RUSH A M.Sequence-to-sequence learning asbeam-search optimization[J].arXiv:1606.02960,2016.
[32]IOFFE S,SZEGEDY C.Batch Normalization.Accelerating Deep Network Training by Reducing Internal Covariate Shift[C]//Proceedings of the 32nd International Conference on Machine Learning.Lille,France:JMLR.org,2015:448-456.
[33]DEVLIN J,CHANG M W,LEE K,et al.BERT:Pre-training of Deep Bidirectional Transformers for Language Understanding[J].Proceedings of NACL-HLT,2019,1:4171-4186.
[1] 金方焱, 王秀利.
融合RACNN和BiLSTM的金融领域事件隐式因果关系抽取
Implicit Causality Extraction of Financial Events Integrating RACNN and BiLSTM
计算机科学, 2022, 49(7): 179-186. https://doi.org/10.11896/jsjkx.210500190
[2] 彭双, 伍江江, 陈浩, 杜春, 李军.
基于注意力神经网络的对地观测卫星星上自主任务规划方法
Satellite Onboard Observation Task Planning Based on Attention Neural Network
计算机科学, 2022, 49(7): 242-247. https://doi.org/10.11896/jsjkx.210500093
[3] 张嘉淏, 刘峰, 齐佳音.
一种基于Bottleneck Transformer的轻量级微表情识别架构
Lightweight Micro-expression Recognition Architecture Based on Bottleneck Transformer
计算机科学, 2022, 49(6A): 370-377. https://doi.org/10.11896/jsjkx.210500023
[4] 赵丹丹, 黄德根, 孟佳娜, 董宇, 张攀.
基于BERT-GRU-ATT模型的中文实体关系分类
Chinese Entity Relations Classification Based on BERT-GRU-ATT
计算机科学, 2022, 49(6): 319-325. https://doi.org/10.11896/jsjkx.210600123
[5] 喻昕, 林植良.
解决一类非光滑伪凸优化问题的新型神经网络
Novel Neural Network for Dealing with a Kind of Non-smooth Pseudoconvex Optimization Problems
计算机科学, 2022, 49(5): 227-234. https://doi.org/10.11896/jsjkx.210400179
[6] 安鑫, 代子彪, 李阳, 孙晓, 任福继.
基于BERT的端到端语音合成方法
End-to-End Speech Synthesis Based on BERT
计算机科学, 2022, 49(4): 221-226. https://doi.org/10.11896/jsjkx.210300071
[7] 时雨涛, 孙晓.
一种会话理解模型的问题生成方法
Conversational Comprehension Model for Question Generation
计算机科学, 2022, 49(3): 232-238. https://doi.org/10.11896/jsjkx.210200153
[8] 李昊, 曹书瑜, 陈亚青, 张敏.
基于注意力机制的用户轨迹识别模型
User Trajectory Identification Model via Attention Mechanism
计算机科学, 2022, 49(3): 308-312. https://doi.org/10.11896/jsjkx.210300231
[9] 胡艳丽, 童谭骞, 张啸宇, 彭娟.
融入自注意力机制的深度学习情感分析方法
Self-attention-based BGRU and CNN for Sentiment Analysis
计算机科学, 2022, 49(1): 252-258. https://doi.org/10.11896/jsjkx.210600063
[10] 肖丁, 张玙璠, 纪厚业.
基于多头注意力机制的用户窃电行为检测
Electricity Theft Detection Based on Multi-head Attention Mechanism
计算机科学, 2022, 49(1): 140-145. https://doi.org/10.11896/jsjkx.210100177
[11] 徐少伟, 秦品乐, 曾建朝, 赵致楷, 高媛, 王丽芳.
基于多级特征和全局上下文的纵膈淋巴结分割算法
Mediastinal Lymph Node Segmentation Algorithm Based on Multi-level Features and Global Context
计算机科学, 2021, 48(6A): 95-100. https://doi.org/10.11896/jsjkx.200700067
[12] 曾友渝, 谢强.
基于改进RNN和VAR的船舶设备故障预测方法
Fault Prediction Method Based on Improved RNN and VAR for Ship Equipment
计算机科学, 2021, 48(6): 184-189. https://doi.org/10.11896/jsjkx.200700117
[13] 尹久, 池凯凯, 宦若虹.
基于ATT-DGRU的文本方面级别情感分析
Aspect-level Sentiment Analysis of Text Based on ATT-DGRU
计算机科学, 2021, 48(5): 217-224. https://doi.org/10.11896/jsjkx.200500076
[14] 陈千, 车苗苗, 郭鑫, 王素格.
一种循环卷积注意力模型的文本情感分类方法
Recurrent Convolution Attention Model for Sentiment Classification
计算机科学, 2021, 48(2): 245-249. https://doi.org/10.11896/jsjkx.200100078
[15] 吕明琪, 洪照雄, 陈铁明.
一种融合时空关联与社会事件的交通流预测方法
Traffic Flow Forecasting Method Combining Spatio-Temporal Correlations and Social Events
计算机科学, 2021, 48(2): 264-270. https://doi.org/10.11896/jsjkx.200300098
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!