计算机科学 ›› 2021, Vol. 48 ›› Issue (4): 157-163.doi: 10.11896/jsjkx.200300146
王习1, 张凯1, 李军辉1, 孔芳1, 张熠天2
WANG Xi1, ZHANG Kai1, LI Jun-hui1, KONG Fang1, ZHANG Yi-tian2
摘要: 目前大多数图像标题生成模型都是由一个基于卷积神经网络(Convolutional Neural Network,CNN)的图像编码器和一个基于循环神经网络(Recurrent Neural Network,RNN)的标题解码器组成。其中图像编码器用于提取图像的视觉特征,标题解码器基于视觉特征通过注意力机制来生成标题。然而,使用基于注意力机制的RNN的问题在于,解码端虽然可以对图像特征和标题交互的部分进行注意力建模,但是却忽略了标题内部交互作用的自我注意。因此,针对图像标题生成任务,文中提出了一种能同时结合循环网络和自注意力网络优点的模型。该模型一方面能够通过自注意力模型在统一的注意力区域内同时捕获模态内和模态间的相互作用,另一方面又保持了循环网络固有的优点。在MSCOCO数据集上的实验结果表明,CIDEr值从1.135提高到了1.166,所提方法能够有效提升图像标题生成的性能。
中图分类号:
[1]FARHADI A,HEJRATI M M,SADEGHI M A,et al.Every Picture Tells a Story:Generating Sentences from Images[C]//Proceedings Part IV of the 11th European Conference on Computer Vision.Heraklion,Crete,Greece:Springer,2010:15-29. [2]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[C]//Advances in Neural Information Processing Systems.2017:5998-6008. [3]DEHGHANI M,GOUWS S,VINYALS O,et al.UniversalTransformers[J].arXiv:1807.03819,2018. [4]KULKARNI G,PREMRAJ V,ORDONEZ V,et al.Babytalk:Understanding and generating simple image descriptions[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2013,35(12):2891-2903. [5]VINYALS O,TOSHEV A,BENGIO S,et al.Show and tell:A neural image caption generator[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Boston,MA,USA:IEEE,2015:3156-3164. [6]KARPATHY A,LI F F.Deep visual-semantic alignments for generating image descriptions[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2015:3128-3137. [7]MAO J H,XU W,YANG Y,et al.Deep captioning with multimo-dal recurrent neural networks(m-rnn)[J].arXiv:1412.6632,2014. [8]XU J,GAVVES E,FERNANDO B,et al.Guiding the long-short term memory model for image caption generation[C]//Procee-dings of the IEEE International Conference on Computer Vision.2015:2407-2415. [9]WU Q,SHEN C H,LIU L Q,et al.What value do explicit high level concepts have in vision to language problems?[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:203-212. [10]XU K,BA J,KIROS R,et al.Show,Attend and Tell:NeuralImage Caption Generation with Visual Attention[C]//Procee-dings of the 32nd International Conference on Machine Lear-ning.Lille,France:JMLR,2015:2048-2057. [11]LU J,XIONG C M,PARIKH D,et al.Knowing when to look:Adaptive attention via a visual sentinel for image captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:375-383. [12]CHEN L,ZHANG H W,XIAO J,et al.Sca-cnn:Spatial andchannel-wise attention in convolutional networks for image captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:5659-5667. [13]LI X R,LANW Y,DONG J F,et al.Adding Chinese Captions to Images[C]//Proceedings of the 2016 Association for Computing Machinery(ACM) on International Conference on Multimedia Retrieval.New York,USA:ACM,2016:271-275. [14]SZEGEDY C,LIU W,JIA Y Q,et al.Going deeper with convolutions[C]//Proceedings of the 32nd International Conference on Machine Learning.Lille,France:JMLR,2015:1-9. [15]RENNIE S J,MARCHERET E,MROUEH Y,et al.Self-critical sequence training for image captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:7008-7024. [16]ANDERSON P,HE X D,BUEHLER C,et al.Bottom-up and top-down attention for image captioning and visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:6077-6086. [17]HE X,YANG Y,SHI B,et al.VD-SAN:Visual-Densely Semantic Attention Network for Image Caption Generation[J].Neurocomputing,2019,328:48-55. [18]LIU M,LI L,HU L,et al.Image caption generation with dual attention mechanism[J].Information Processing and Management,2020,57(2):102178. [19]BA J L,KIROS J R,HINTON G E.Layer normalization[J].arXiv:1607.06450,2016. [20]RENNIE S J,MARCHERET E,MROUEH Y,et al.Self-critical Sequence Training for Image Captioning[J].arXiv:1612.00563,2016. [21]VEDANTAM R,ZITNICK C L,PARIKH D,et al.CIDEr:Consensus-based image description evaluation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2015:4566-4575. [22]LIN T Y,MAIRE M,BELONGIE S,et al.Microsoft coco:Common objects in context[C]//European Conference on Computer Vision.Springer,Cham,2014:740-755. [23]PAPINENI K,ROUKOS S,WARD T,et al.Bleu:a Method for Automatic Evaluation of Machine Translation[C]//Proceedings of the 40th Annual Meeting of the Association for Computatio-nal Linguistics.Philadelphia,PA,USA:ACL,2002:311-318. [24]DENKOWSKI M,LAVI A.Meteor universal:Language specific translation evaluation for any target language[C]//Proceedings of the Ninth Workshop on Statistical Machine Translation.2014:376-380. [25]LIN C Y.Rouge:A package for automatic evaluation of summaries[C]//Text Summarization Branches Out,Post-conference Workshop of ACL 2004.Barcelona,Spain,2004:74-81. [26]ANDERSON P,FERNANDO B,JOHNSON M,et al.Spice:Semantic propositional image caption evaluation[C]//European Conference on Computer Vision.Springer,Cham,2016:382-398. [27]HE K M,ZHANG X Y,RENS Q,et al.Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:770-778. [28]REN S,HE K,GIRSHICK R,et al.Faster RCNN:Towardsreal-time object detection with region proposal networks[C]// Advances in Neural Information Processing Systems.2015:91-99. [29]RUSSAKOVSKY O,DENG J,SU H,et al.Imagenet large scale visual recognition challenge[J].International Journal of Computer Vision,2015,115(3):211-252. [30]KINGMA D P,BA J.Adam:A method for stochastic optimization[J].arXiv:1412.6980,2014. [31]WISEMAN S,RUSH A M.Sequence-to-sequence learning asbeam-search optimization[J].arXiv:1606.02960,2016. [32]IOFFE S,SZEGEDY C.Batch Normalization.Accelerating Deep Network Training by Reducing Internal Covariate Shift[C]//Proceedings of the 32nd International Conference on Machine Learning.Lille,France:JMLR.org,2015:448-456. [33]DEVLIN J,CHANG M W,LEE K,et al.BERT:Pre-training of Deep Bidirectional Transformers for Language Understanding[J].Proceedings of NACL-HLT,2019,1:4171-4186. |
[1] | 金方焱, 王秀利. 融合RACNN和BiLSTM的金融领域事件隐式因果关系抽取 Implicit Causality Extraction of Financial Events Integrating RACNN and BiLSTM 计算机科学, 2022, 49(7): 179-186. https://doi.org/10.11896/jsjkx.210500190 |
[2] | 彭双, 伍江江, 陈浩, 杜春, 李军. 基于注意力神经网络的对地观测卫星星上自主任务规划方法 Satellite Onboard Observation Task Planning Based on Attention Neural Network 计算机科学, 2022, 49(7): 242-247. https://doi.org/10.11896/jsjkx.210500093 |
[3] | 张嘉淏, 刘峰, 齐佳音. 一种基于Bottleneck Transformer的轻量级微表情识别架构 Lightweight Micro-expression Recognition Architecture Based on Bottleneck Transformer 计算机科学, 2022, 49(6A): 370-377. https://doi.org/10.11896/jsjkx.210500023 |
[4] | 赵丹丹, 黄德根, 孟佳娜, 董宇, 张攀. 基于BERT-GRU-ATT模型的中文实体关系分类 Chinese Entity Relations Classification Based on BERT-GRU-ATT 计算机科学, 2022, 49(6): 319-325. https://doi.org/10.11896/jsjkx.210600123 |
[5] | 喻昕, 林植良. 解决一类非光滑伪凸优化问题的新型神经网络 Novel Neural Network for Dealing with a Kind of Non-smooth Pseudoconvex Optimization Problems 计算机科学, 2022, 49(5): 227-234. https://doi.org/10.11896/jsjkx.210400179 |
[6] | 安鑫, 代子彪, 李阳, 孙晓, 任福继. 基于BERT的端到端语音合成方法 End-to-End Speech Synthesis Based on BERT 计算机科学, 2022, 49(4): 221-226. https://doi.org/10.11896/jsjkx.210300071 |
[7] | 时雨涛, 孙晓. 一种会话理解模型的问题生成方法 Conversational Comprehension Model for Question Generation 计算机科学, 2022, 49(3): 232-238. https://doi.org/10.11896/jsjkx.210200153 |
[8] | 李昊, 曹书瑜, 陈亚青, 张敏. 基于注意力机制的用户轨迹识别模型 User Trajectory Identification Model via Attention Mechanism 计算机科学, 2022, 49(3): 308-312. https://doi.org/10.11896/jsjkx.210300231 |
[9] | 胡艳丽, 童谭骞, 张啸宇, 彭娟. 融入自注意力机制的深度学习情感分析方法 Self-attention-based BGRU and CNN for Sentiment Analysis 计算机科学, 2022, 49(1): 252-258. https://doi.org/10.11896/jsjkx.210600063 |
[10] | 肖丁, 张玙璠, 纪厚业. 基于多头注意力机制的用户窃电行为检测 Electricity Theft Detection Based on Multi-head Attention Mechanism 计算机科学, 2022, 49(1): 140-145. https://doi.org/10.11896/jsjkx.210100177 |
[11] | 徐少伟, 秦品乐, 曾建朝, 赵致楷, 高媛, 王丽芳. 基于多级特征和全局上下文的纵膈淋巴结分割算法 Mediastinal Lymph Node Segmentation Algorithm Based on Multi-level Features and Global Context 计算机科学, 2021, 48(6A): 95-100. https://doi.org/10.11896/jsjkx.200700067 |
[12] | 曾友渝, 谢强. 基于改进RNN和VAR的船舶设备故障预测方法 Fault Prediction Method Based on Improved RNN and VAR for Ship Equipment 计算机科学, 2021, 48(6): 184-189. https://doi.org/10.11896/jsjkx.200700117 |
[13] | 尹久, 池凯凯, 宦若虹. 基于ATT-DGRU的文本方面级别情感分析 Aspect-level Sentiment Analysis of Text Based on ATT-DGRU 计算机科学, 2021, 48(5): 217-224. https://doi.org/10.11896/jsjkx.200500076 |
[14] | 陈千, 车苗苗, 郭鑫, 王素格. 一种循环卷积注意力模型的文本情感分类方法 Recurrent Convolution Attention Model for Sentiment Classification 计算机科学, 2021, 48(2): 245-249. https://doi.org/10.11896/jsjkx.200100078 |
[15] | 吕明琪, 洪照雄, 陈铁明. 一种融合时空关联与社会事件的交通流预测方法 Traffic Flow Forecasting Method Combining Spatio-Temporal Correlations and Social Events 计算机科学, 2021, 48(2): 264-270. https://doi.org/10.11896/jsjkx.200300098 |
|