计算机科学 ›› 2022, Vol. 49 ›› Issue (10): 151-158.doi: 10.11896/jsjkx.210900159
方仲俊1,2, 张静1, 李冬冬1,2
FANG Zhong-jun1,2, ZHANG Jing1, LI Dong-dong1,2
摘要: 图像描述是图像理解领域的热点研究课题之一,它是结合计算机视觉和自然语言处理的跨媒体数据分析任务,通过理解图像内容并生成语义和语法都正确的句子来描述图像。现有的图像描述方法多采用编码器-解码器模型,该类方法在提取图像中的视觉对象特征时大多忽略了视觉对象之间的相对位置关系,但它对于正确描述图像的内容是非常重要的。基于此,提出了基于Transformer的空间和多层级联合编码的图像描述方法。为了更好地利用图像中所包含的对象的位置信息,提出了视觉对象的空间编码机制,将各个视觉对象独立的空间关系转换为视觉对象间的相对空间关系,以此来帮助模型识别各个视觉对象间的相对位置关系。同时,在视觉对象的编码阶段,顶部的编码特征保留了更多的贴合图像语义信息,但丢失了图像部分视觉信息,考虑到这一点,文中提出了多层级联合编码机制,通过整合各个浅层的编码层所包含的图像特征信息来完善顶部编码层所蕴含的语义的信息,从而获取到更丰富的贴合图像的语义信息的编码特征。文中在MSCOCO数据集上使用多种评估指标(BLEU,METEOR,ROUGE-L和 CIDEr等)对提出的图像描述方法进行评估,并通过消融实验证明了提出的基于空间的编码机制以及多层级联合编码机制能够辅助产生更为准确有效的图像描述语句。对比实验结果表明,所提方法能够产生准确、有效的图像描述并优于大多数最新的算法。
中图分类号:
| [1]MITCHELL M,DODGE J,GOYAL A,et al.Midge:Generating image descriptions from computer vision detections[C]//Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics.2012:747-756. [2]LU J,YANG J,BATRA D,et al.Neural baby talk[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:7219-7228. [3]DEVLIN J,CHENG H,FANG H,et al.Language models for image captioning:The quirks and what works[C]//Association for Computational Linguistics(ACL).2015:100-105. [4]WANG C,YANG H,BARTZ C,et al.Image captioning with deep bidirectional LSTMs[C]//Proceedings of the 24th ACM international conference on Multimedia.2016:988-997. [5]ANDERSON P,HE X,BUEHLER C,et al.Bottom-up and top-down attention for image captioning and visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:6077-6086. [6]LI G,ZHU L,LIU P,et al.Entangled transformer for image captioning[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2019:8928-8937. [7]SIMONYAN K,ZISSERMAN A.Very deep convolutional networks for large-scale image recognition[J].arXiv:1409.1556,2014. [8]HE K,ZHANG X,REN S,et al.Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:770-778. [9]REN S,HE K,GIRSHICK R,et al.Faster R-CNN:towardsreal-time object detection with region proposal networks[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2016,39(6):1137-1149. [10]VINYALS O,TOSHEV A,BENGIO S,et al.Show and tell:A neural image caption generator[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2015:3156-3164. [11]XU K,BA J,KIROS R,et al.Show,attend and tell:Neuralimage caption generation with visual attention[C]//InternationalConference on Machine Learning.PMLR,2015:2048-2057. [12]LU J,XIONG C,PARIKH D,et al.Knowing when to look:Adaptive attention via a visual sentinel for image captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:375-383. [13]CHEN L,ZHANG H,XIAO J,et al.Sca-cnn:Spatial and channel-wise attention in convolutional networks for image captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:5659-5667. [14]GUO Y,LIU Y,DE BOER M H T,et al.A dual prediction network for image captioning[C]//2018 IEEE International Conference on Multimedia and Expo.IEEE,2018:1-6. [15]GAN Z,GAN C,HE X,et al.Semantic compositional networks for visual captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:5630-5639. [16]YAO T,PAN Y,LI Y,et al.Boosting image captioning with attributes[C]//Proceedings of the IEEE International Conference on Computer Vision.2017:4894-4902. [17]FENG Y,LAN L,ZHANG X,et al.AttResNet:Attention-based ResNet for Image Captioning[C]//Proceedings of the 2018 International Conference on Algorithms,Computing and Artificial Intelligence.2018:1-6. [18]LI N,CHEN Z.Image Cationing with Visual-Semantic LSTM[C]//Proceedings of the 27th International Joint Conference on Artificial Intelligence.2018:793-799. [19]ZHANG J,LI K,WANG Z.Parallel-fusion LSTM with synchronous semantic and visual information for image captioning[J].Journal of Visual Communication and Image Representation,2021,75:103044. [20]ZHANG Z,WU Q,WANG Y,et al.Exploring region relationships implicitly:Image captioning with visual relationship attention[J].Image and Vision Computing,2021,109:104146. [21]PEI H,CHEN Q,WANG J,et al.Visual Relational Reasoning for Image Caption[C]//2020 International Joint Conference on Neural Networks.IEEE,2020:1-8. [22]VASWANI A,SHAZEER N,PARMAR N,et al.Attention is all you need[C]//Advances inNeural Information Processing Systems.2017:5998-6008. [23]HERDADE S,KAPPELER A,BOAKYE K,et al.Image captioning:transforming objects into words[C]//Proceedings of the 33rd International Conference on Neural Information Processing Systems.2019:11137-11147. [24]WANG D,HU H,CHEN D.Transformer with sparse self-attention mechanism for image captioning[J].Electronics Letters,2020,56(15):764-766. [25]LIN T Y,MAIRE M,BELONGIE S,et al.Microsoft coco:Common objects in context[C]//European Conference on Computer Vision.Cham:Springer,2014:740-755. [26]KARPATHY A,FEI-FEI L.Deep visual-semantic alignmentsfor generating image descriptions[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2015:3128-3137. [27]PAPINENI K,ROUKOS S,WARD T,et al.Bleu:a method for automatic evaluation of machine translation[C]//Proceedings of the 40th annual meeting of the Association for Computational Linguistics.2002:311-318. [28]BANERJEE S,LAVIE A.METEOR:An automatic metric forMT evaluation with improved correlation with human judgments[C]//Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization.2005:65-72. [29]LIN C Y.Rouge:A package for automatic evaluation of summaries[C]//TextSummarization Branches Out.2004:74-81. [30]VEDANTAM R,LAWRENCE ZITNICK C,PARIKH D.Ci-der:Consensus-based image description evaluation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2015:4566-4575. [31]KINGMA D P,BA J.Adam:A method for stochastic optimization[C]//Proceedings of the 3rd International Conference for Learning Representations.2015. | 
| [1] | 周芳泉, 成卫青. 基于全局增强图神经网络的序列推荐 Sequence Recommendation Based on Global Enhanced Graph Neural Network 计算机科学, 2022, 49(9): 55-63. https://doi.org/10.11896/jsjkx.210700085 | 
| [2] | 戴禹, 许林峰. 基于文本行匹配的跨图文本阅读方法 Cross-image Text Reading Method Based on Text Line Matching 计算机科学, 2022, 49(9): 139-145. https://doi.org/10.11896/jsjkx.220600032 | 
| [3] | 周乐员, 张剑华, 袁甜甜, 陈胜勇. 多层注意力机制融合的序列到序列中国连续手语识别和翻译 Sequence-to-Sequence Chinese Continuous Sign Language Recognition and Translation with Multi- layer Attention Mechanism Fusion 计算机科学, 2022, 49(9): 155-161. https://doi.org/10.11896/jsjkx.210800026 | 
| [4] | 熊丽琴, 曹雷, 赖俊, 陈希亮. 基于值分解的多智能体深度强化学习综述 Overview of Multi-agent Deep Reinforcement Learning Based on Value Factorization 计算机科学, 2022, 49(9): 172-182. https://doi.org/10.11896/jsjkx.210800112 | 
| [5] | 饶志双, 贾真, 张凡, 李天瑞. 基于Key-Value关联记忆网络的知识图谱问答方法 Key-Value Relational Memory Networks for Question Answering over Knowledge Graph 计算机科学, 2022, 49(9): 202-207. https://doi.org/10.11896/jsjkx.220300277 | 
| [6] | 姜梦函, 李邵梅, 郑洪浩, 张建朋. 基于改进位置编码的谣言检测模型 Rumor Detection Model Based on Improved Position Embedding 计算机科学, 2022, 49(8): 330-335. https://doi.org/10.11896/jsjkx.210600046 | 
| [7] | 汪鸣, 彭舰, 黄飞虎. 基于多时间尺度时空图网络的交通流量预测模型 Multi-time Scale Spatial-Temporal Graph Neural Network for Traffic Flow Prediction 计算机科学, 2022, 49(8): 40-48. https://doi.org/10.11896/jsjkx.220100188 | 
| [8] | 朱承璋, 黄嘉儿, 肖亚龙, 王晗, 邹北骥. 基于注意力机制的医学影像深度哈希检索算法 Deep Hash Retrieval Algorithm for Medical Images Based on Attention Mechanism 计算机科学, 2022, 49(8): 113-119. https://doi.org/10.11896/jsjkx.210700153 | 
| [9] | 孙奇, 吉根林, 张杰. 基于非局部注意力生成对抗网络的视频异常事件检测方法 Non-local Attention Based Generative Adversarial Network for Video Abnormal Event Detection 计算机科学, 2022, 49(8): 172-177. https://doi.org/10.11896/jsjkx.210600061 | 
| [10] | 闫佳丹, 贾彩燕. 基于双图神经网络信息融合的文本分类方法 Text Classification Method Based on Information Fusion of Dual-graph Neural Network 计算机科学, 2022, 49(8): 230-236. https://doi.org/10.11896/jsjkx.210600042 | 
| [11] | 金方焱, 王秀利. 融合RACNN和BiLSTM的金融领域事件隐式因果关系抽取 Implicit Causality Extraction of Financial Events Integrating RACNN and BiLSTM 计算机科学, 2022, 49(7): 179-186. https://doi.org/10.11896/jsjkx.210500190 | 
| [12] | 熊罗庚, 郑尚, 邹海涛, 于化龙, 高尚. 融合双向门控循环单元和注意力机制的软件自承认技术债识别方法 Software Self-admitted Technical Debt Identification with Bidirectional Gate Recurrent Unit and Attention Mechanism 计算机科学, 2022, 49(7): 212-219. https://doi.org/10.11896/jsjkx.210500075 | 
| [13] | 彭双, 伍江江, 陈浩, 杜春, 李军. 基于注意力神经网络的对地观测卫星星上自主任务规划方法 Satellite Onboard Observation Task Planning Based on Attention Neural Network 计算机科学, 2022, 49(7): 242-247. https://doi.org/10.11896/jsjkx.210500093 | 
| [14] | 张颖涛, 张杰, 张睿, 张文强. 全局信息引导的真实图像风格迁移 Photorealistic Style Transfer Guided by Global Information 计算机科学, 2022, 49(7): 100-105. https://doi.org/10.11896/jsjkx.210600036 | 
| [15] | 曾志贤, 曹建军, 翁年凤, 蒋国权, 徐滨. 基于注意力机制的细粒度语义关联视频-文本跨模态实体分辨 Fine-grained Semantic Association Video-Text Cross-modal Entity Resolution Based on Attention Mechanism 计算机科学, 2022, 49(7): 106-112. https://doi.org/10.11896/jsjkx.210500224 | 
| 
 | ||