计算机科学 ›› 2022, Vol. 49 ›› Issue (10): 151-158.doi: 10.11896/jsjkx.210900159

• 计算机图形学&多媒体 • 上一篇    下一篇

基于空间和多层级联合编码的图像描述算法

方仲俊1,2, 张静1, 李冬冬1,2   

  1. 1 华东理工大学信息科学与工程学院 上海 200237
    2 苏州大学江苏省计算机信息处理技术重点实验室 江苏 苏州 215031
  • 收稿日期:2021-09-22 修回日期:2022-03-09 出版日期:2022-10-15 发布日期:2022-10-13
  • 通讯作者: 张静(jingzhang@ecust.edu.cn)
  • 作者简介:(y30190781@mail.ecust.edu.cn)
  • 基金资助:
    国家自然科学基金 (61806078)

Spatial Encoding and Multi-layer Joint Encoding Enhanced Transformer for Image Captioning

FANG Zhong-jun1,2, ZHANG Jing1, LI Dong-dong1,2   

  1. 1 School of Information Science and Engineering,East China University of Science and Technology,Shanghai 200237,China
    2 Provincial Key Laboratory for Computer Information Processing Technology,Soochow University,Suzhou,Jiangsu 215031,China
  • Received:2021-09-22 Revised:2022-03-09 Online:2022-10-15 Published:2022-10-13
  • About author:FANG Zhong-jun,born in 1997,postgraduate,is a member of China Computer Federation.His main research interests include computer vision,neural networks and image captioning.
    ZHANG Jing,born in 1978,Ph.D,associate professor,is a member of China Computer Federation.Her main research interests include computer vision,image/video information retrieval,image annotation and so on.
  • Supported by:
    National Natural Science Foundation of China(61806078).

摘要: 图像描述是图像理解领域的热点研究课题之一,它是结合计算机视觉和自然语言处理的跨媒体数据分析任务,通过理解图像内容并生成语义和语法都正确的句子来描述图像。现有的图像描述方法多采用编码器-解码器模型,该类方法在提取图像中的视觉对象特征时大多忽略了视觉对象之间的相对位置关系,但它对于正确描述图像的内容是非常重要的。基于此,提出了基于Transformer的空间和多层级联合编码的图像描述方法。为了更好地利用图像中所包含的对象的位置信息,提出了视觉对象的空间编码机制,将各个视觉对象独立的空间关系转换为视觉对象间的相对空间关系,以此来帮助模型识别各个视觉对象间的相对位置关系。同时,在视觉对象的编码阶段,顶部的编码特征保留了更多的贴合图像语义信息,但丢失了图像部分视觉信息,考虑到这一点,文中提出了多层级联合编码机制,通过整合各个浅层的编码层所包含的图像特征信息来完善顶部编码层所蕴含的语义的信息,从而获取到更丰富的贴合图像的语义信息的编码特征。文中在MSCOCO数据集上使用多种评估指标(BLEU,METEOR,ROUGE-L和 CIDEr等)对提出的图像描述方法进行评估,并通过消融实验证明了提出的基于空间的编码机制以及多层级联合编码机制能够辅助产生更为准确有效的图像描述语句。对比实验结果表明,所提方法能够产生准确、有效的图像描述并优于大多数最新的算法。

关键词: 图像描述, Transformer, 空间编码机制, 多层级联合编码机制, 注意力机制

Abstract: Image captioning is one of the hot research topics in the field of computer vision.It is a cross-media data analysis task that combines computer vision and natural language processing.It describes the image by understanding the content of the image and generating captions that are both semantically and grammatically correct.Existing image captioning methods mostly use the encoder-decoder model.This kind of methods mostly ignore the relative position relationship between visual objects when extracting the visual object features in image,and the relative position relationship between objects is very important for generating accurate captioning.Based on this,this paper proposes a spatial encoding and multi-layer joint encoding enhanced transformer for image captioning.In order to make better use of the position information contained in the image,this paper proposes a spatial encoding mechanism for visual objects,which converts the independent spatial relationship of each visual object into the relative spatial relationship between visual objects to help the model to recognize the relative spatial relationship between each visual object.At the same time,in the encoder part of visual objects,the top encoding feature retains more semantic information that fits the image but loses part of the visual information of the image.Taking this into account,this paper proposes a multi-level joint encoding mechanism to improve the semantic information contained in the top encoding layer by integrating the image feature information contained in each shallow encoding layer,so as to obtain richer semantic features that fit the image.This paper evaluates the proposed image captioning method by multiple evaluation indicators(BLEU,METEOR,ROUGE-L,CIDEr,etc.) on the MSCOCO dataset.The ablation experiment proves that the spatial encoding mechanism and the multi-level joint encoding mechanism proposed in this paper can be helpful in generating more accurate and effective image captions.Comparative experimental results show that the proposed method in can produce accurate and effective image caption and is superior to most of the latest methods.

Key words: Image captioning, Transformer, Spatial encoding mechanism, Multi-level joint encoding mechanism, Attention mechanism

中图分类号: 

  • TP183
[1]MITCHELL M,DODGE J,GOYAL A,et al.Midge:Generating image descriptions from computer vision detections[C]//Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics.2012:747-756.
[2]LU J,YANG J,BATRA D,et al.Neural baby talk[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:7219-7228.
[3]DEVLIN J,CHENG H,FANG H,et al.Language models for image captioning:The quirks and what works[C]//Association for Computational Linguistics(ACL).2015:100-105.
[4]WANG C,YANG H,BARTZ C,et al.Image captioning with deep bidirectional LSTMs[C]//Proceedings of the 24th ACM international conference on Multimedia.2016:988-997.
[5]ANDERSON P,HE X,BUEHLER C,et al.Bottom-up and top-down attention for image captioning and visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:6077-6086.
[6]LI G,ZHU L,LIU P,et al.Entangled transformer for image captioning[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2019:8928-8937.
[7]SIMONYAN K,ZISSERMAN A.Very deep convolutional networks for large-scale image recognition[J].arXiv:1409.1556,2014.
[8]HE K,ZHANG X,REN S,et al.Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:770-778.
[9]REN S,HE K,GIRSHICK R,et al.Faster R-CNN:towardsreal-time object detection with region proposal networks[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2016,39(6):1137-1149.
[10]VINYALS O,TOSHEV A,BENGIO S,et al.Show and tell:A neural image caption generator[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2015:3156-3164.
[11]XU K,BA J,KIROS R,et al.Show,attend and tell:Neuralimage caption generation with visual attention[C]//InternationalConference on Machine Learning.PMLR,2015:2048-2057.
[12]LU J,XIONG C,PARIKH D,et al.Knowing when to look:Adaptive attention via a visual sentinel for image captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:375-383.
[13]CHEN L,ZHANG H,XIAO J,et al.Sca-cnn:Spatial and channel-wise attention in convolutional networks for image captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:5659-5667.
[14]GUO Y,LIU Y,DE BOER M H T,et al.A dual prediction network for image captioning[C]//2018 IEEE International Conference on Multimedia and Expo.IEEE,2018:1-6.
[15]GAN Z,GAN C,HE X,et al.Semantic compositional networks for visual captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:5630-5639.
[16]YAO T,PAN Y,LI Y,et al.Boosting image captioning with attributes[C]//Proceedings of the IEEE International Conference on Computer Vision.2017:4894-4902.
[17]FENG Y,LAN L,ZHANG X,et al.AttResNet:Attention-based ResNet for Image Captioning[C]//Proceedings of the 2018 International Conference on Algorithms,Computing and Artificial Intelligence.2018:1-6.
[18]LI N,CHEN Z.Image Cationing with Visual-Semantic LSTM[C]//Proceedings of the 27th International Joint Conference on Artificial Intelligence.2018:793-799.
[19]ZHANG J,LI K,WANG Z.Parallel-fusion LSTM with synchronous semantic and visual information for image captioning[J].Journal of Visual Communication and Image Representation,2021,75:103044.
[20]ZHANG Z,WU Q,WANG Y,et al.Exploring region relationships implicitly:Image captioning with visual relationship attention[J].Image and Vision Computing,2021,109:104146.
[21]PEI H,CHEN Q,WANG J,et al.Visual Relational Reasoning for Image Caption[C]//2020 International Joint Conference on Neural Networks.IEEE,2020:1-8.
[22]VASWANI A,SHAZEER N,PARMAR N,et al.Attention is all you need[C]//Advances inNeural Information Processing Systems.2017:5998-6008.
[23]HERDADE S,KAPPELER A,BOAKYE K,et al.Image captioning:transforming objects into words[C]//Proceedings of the 33rd International Conference on Neural Information Processing Systems.2019:11137-11147.
[24]WANG D,HU H,CHEN D.Transformer with sparse self-attention mechanism for image captioning[J].Electronics Letters,2020,56(15):764-766.
[25]LIN T Y,MAIRE M,BELONGIE S,et al.Microsoft coco:Common objects in context[C]//European Conference on Computer Vision.Cham:Springer,2014:740-755.
[26]KARPATHY A,FEI-FEI L.Deep visual-semantic alignmentsfor generating image descriptions[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2015:3128-3137.
[27]PAPINENI K,ROUKOS S,WARD T,et al.Bleu:a method for automatic evaluation of machine translation[C]//Proceedings of the 40th annual meeting of the Association for Computational Linguistics.2002:311-318.
[28]BANERJEE S,LAVIE A.METEOR:An automatic metric forMT evaluation with improved correlation with human judgments[C]//Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization.2005:65-72.
[29]LIN C Y.Rouge:A package for automatic evaluation of summaries[C]//TextSummarization Branches Out.2004:74-81.
[30]VEDANTAM R,LAWRENCE ZITNICK C,PARIKH D.Ci-der:Consensus-based image description evaluation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2015:4566-4575.
[31]KINGMA D P,BA J.Adam:A method for stochastic optimization[C]//Proceedings of the 3rd International Conference for Learning Representations.2015.
[1] 周芳泉, 成卫青.
基于全局增强图神经网络的序列推荐
Sequence Recommendation Based on Global Enhanced Graph Neural Network
计算机科学, 2022, 49(9): 55-63. https://doi.org/10.11896/jsjkx.210700085
[2] 戴禹, 许林峰.
基于文本行匹配的跨图文本阅读方法
Cross-image Text Reading Method Based on Text Line Matching
计算机科学, 2022, 49(9): 139-145. https://doi.org/10.11896/jsjkx.220600032
[3] 周乐员, 张剑华, 袁甜甜, 陈胜勇.
多层注意力机制融合的序列到序列中国连续手语识别和翻译
Sequence-to-Sequence Chinese Continuous Sign Language Recognition and Translation with Multi- layer Attention Mechanism Fusion
计算机科学, 2022, 49(9): 155-161. https://doi.org/10.11896/jsjkx.210800026
[4] 熊丽琴, 曹雷, 赖俊, 陈希亮.
基于值分解的多智能体深度强化学习综述
Overview of Multi-agent Deep Reinforcement Learning Based on Value Factorization
计算机科学, 2022, 49(9): 172-182. https://doi.org/10.11896/jsjkx.210800112
[5] 饶志双, 贾真, 张凡, 李天瑞.
基于Key-Value关联记忆网络的知识图谱问答方法
Key-Value Relational Memory Networks for Question Answering over Knowledge Graph
计算机科学, 2022, 49(9): 202-207. https://doi.org/10.11896/jsjkx.220300277
[6] 姜梦函, 李邵梅, 郑洪浩, 张建朋.
基于改进位置编码的谣言检测模型
Rumor Detection Model Based on Improved Position Embedding
计算机科学, 2022, 49(8): 330-335. https://doi.org/10.11896/jsjkx.210600046
[7] 汪鸣, 彭舰, 黄飞虎.
基于多时间尺度时空图网络的交通流量预测模型
Multi-time Scale Spatial-Temporal Graph Neural Network for Traffic Flow Prediction
计算机科学, 2022, 49(8): 40-48. https://doi.org/10.11896/jsjkx.220100188
[8] 朱承璋, 黄嘉儿, 肖亚龙, 王晗, 邹北骥.
基于注意力机制的医学影像深度哈希检索算法
Deep Hash Retrieval Algorithm for Medical Images Based on Attention Mechanism
计算机科学, 2022, 49(8): 113-119. https://doi.org/10.11896/jsjkx.210700153
[9] 孙奇, 吉根林, 张杰.
基于非局部注意力生成对抗网络的视频异常事件检测方法
Non-local Attention Based Generative Adversarial Network for Video Abnormal Event Detection
计算机科学, 2022, 49(8): 172-177. https://doi.org/10.11896/jsjkx.210600061
[10] 闫佳丹, 贾彩燕.
基于双图神经网络信息融合的文本分类方法
Text Classification Method Based on Information Fusion of Dual-graph Neural Network
计算机科学, 2022, 49(8): 230-236. https://doi.org/10.11896/jsjkx.210600042
[11] 金方焱, 王秀利.
融合RACNN和BiLSTM的金融领域事件隐式因果关系抽取
Implicit Causality Extraction of Financial Events Integrating RACNN and BiLSTM
计算机科学, 2022, 49(7): 179-186. https://doi.org/10.11896/jsjkx.210500190
[12] 熊罗庚, 郑尚, 邹海涛, 于化龙, 高尚.
融合双向门控循环单元和注意力机制的软件自承认技术债识别方法
Software Self-admitted Technical Debt Identification with Bidirectional Gate Recurrent Unit and Attention Mechanism
计算机科学, 2022, 49(7): 212-219. https://doi.org/10.11896/jsjkx.210500075
[13] 彭双, 伍江江, 陈浩, 杜春, 李军.
基于注意力神经网络的对地观测卫星星上自主任务规划方法
Satellite Onboard Observation Task Planning Based on Attention Neural Network
计算机科学, 2022, 49(7): 242-247. https://doi.org/10.11896/jsjkx.210500093
[14] 张颖涛, 张杰, 张睿, 张文强.
全局信息引导的真实图像风格迁移
Photorealistic Style Transfer Guided by Global Information
计算机科学, 2022, 49(7): 100-105. https://doi.org/10.11896/jsjkx.210600036
[15] 曾志贤, 曹建军, 翁年凤, 蒋国权, 徐滨.
基于注意力机制的细粒度语义关联视频-文本跨模态实体分辨
Fine-grained Semantic Association Video-Text Cross-modal Entity Resolution Based on Attention Mechanism
计算机科学, 2022, 49(7): 106-112. https://doi.org/10.11896/jsjkx.210500224
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!