计算机科学 ›› 2024, Vol. 51 ›› Issue (7): 214-220.doi: 10.11896/jsjkx.230600167

• 计算机图形学&多媒体 • 上一篇    下一篇

基于外部先验和自先验注意力的图像描述生成方法

李永杰, 钱艺, 文益民   

  1. 广西图像图形与智能处理重点实验室(桂林电子科技大学) 广西 桂林 541004
  • 收稿日期:2023-06-21 修回日期:2023-10-25 出版日期:2024-07-15 发布日期:2024-07-10
  • 通讯作者: 文益民(ymwen@guet.edu.cn)
  • 作者简介:(jayli1998@foxmail.com)
  • 基金资助:
    广西重点研发计划项目(桂科AB21220023);国家自然科学基金(62366011);广西图像图形与智能处理重点实验室项目(GIIP2306);桂林电子科技大学研究生教育创新计划项目(2023YCXB11)

Image Captioning Generation Method Based on External Prior and Self-prior Attention

LI Yongjie, QIAN Yi, WEN Yimin   

  1. Guangxi Key Laboratory of Image and Graphic Intelligent Processing(Guilin University of Electronic Technology),Guilin,Guangxi 541004,China
  • Received:2023-06-21 Revised:2023-10-25 Online:2024-07-15 Published:2024-07-10
  • About author:LI Yongjie,born in 1998,postgraduate.His main research interests include computer vision,neural networks and image captioning.
    WEN Yimin,born in 1969,Ph.D,professor,Ph.D supervisor,is a distinguished member of CCF(No.06757D).His main research interests include machine learning,computer vision and big data analytics.
  • Supported by:
    Key R&D Program of Guangxi(AB21220023),National Natural Science Foundation of China(62366011),Guangxi Key Laboratory of Image and Graphic Intelligent Processing(GIIP2306) and Innovation Project of GUET Graduate Education(2023YCXB11).

摘要: 图像描述是一种结合计算机视觉和自然语言处理的跨模态任务,旨在理解图像内容并生成恰当的句子。现有的图像描述方法通常使用自注意力机制来捕获样本内的长距离依赖关系,但这种方式不仅忽略了样本间的潜在相关性,而且缺乏对先验知识的利用,导致生成内容与参考描述存在一定差异。针对上述问题,文中提出了一种基于外部先验和自先验注意力(External Prior and Self-prior Attention,EPSPA)的图像描述方法。其中,外部先验模块能够隐式地考虑到样本间的潜在相关性进而减少来自其他样本的干扰信息。同时,自先验注意力能够充分利用上一层的注意力权重来模拟先验知识,使其指导模型进行特征提取。在公开数据集上使用多种指标对EPSPA进行评估,实验结果表明该方法能够在保持低参数量的前提下表现出优于现有方法的性能。

关键词: 图像描述, 自注意力机制, 潜在相关性, 外部先验模块, 自先验注意力

Abstract: Image captioning,a multimodal task that combines computer vision and natural language processing,aims to comprehend the content of images and generate appropriate textual captions.Existing image captioning methods often employ self-attention mechanisms to capture long-range dependencies within samples.However,this approach overlooks the potential correlations among different samples and fails to utilize prior knowledge,resulting in discrepancies between the generated content and refe-rence captions.To address these issues,this paper proposes an image description approach based on external prior and self-prior attention(EPSPA).The external prior module implicitly considers the potential correlations among samples and removes interfe-rence from other samples.Meanwhile,the self-prior attention effectively utilizes attention weights from previous layers to simulate prior knowledge and guide the model in feature extraction.Evaluation results of EPSPA on publicly available datasets using multiple metrics demonstrates its superior performance compared to existing methods while maintaining a low parameter count.

Key words: Image captioning, Self-attentive mechanisms, Potential associations, External prior module, Self-prior attention

中图分类号: 

  • TP391
[1]SUTSKEVER I,VINYALS O,LE Q V.Sequence to sequence learning with neural networks[J].arXiv:1409.3215,2014.
[2]VINYALS O,TOSHEV A,BENGIO S,et al.Show and tell:A neural image caption generator[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2015:3156-3164.
[3]KARPATHY A,LI F F.Deep visual-semantic alignments forgenerating image descriptions[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2015:3128-3137.
[4]CORNIA M,BARALDI L,CUCCHIARA R.Show,control and tell:A framework for generating controllable and grounded captions[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019:8307-8316.
[5]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[J].arXiv:1706.03762,2017.
[6]LIU Z,LIN Y,CAO Y,et al.Swin transformer:Hierarchical vision transformer using shifted windows[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:10012-10022.
[7]DOSOVITSKIY A,BEYER L,KOLESNIKOV A,et al.Animage is worth 16x16 words:Transformers for image recognition at scale[J].arXiv:2010.11929,2020.
[8]HUANG L,WANG W,CHEN J,et al.Attention on attentionfor image captioning[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2019:4634-4643.
[9]CORNIA M,STEFANINI M,BARALDI L,et al.Meshed-me-mory transformer for image captioning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:10578-10587.
[10]SZEGEDY C,VANHOUCKE V,IOFFE S,et al.Rethinking the inception architecture for computer vision[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:2818-2826.
[11]XU K,BA J,KIROS R,et al.Show,attend and tell:Neuralimage caption generation with visual attention[C]//InternationalConference on Machine Learning.PMLR,2015:2048-2057.
[12]HE K,ZHANG X,REN S,et al.Deep residual learning forimage recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:770-778.
[13]LUONG M T,PHAM H,MANNING C D.Effective approaches to attention-based neural machine translation[J].arXiv:1508.04025,2015.
[14]ANDERSON P,HE X,BUEHLER C,et al.Bottom-up and top-down attention for image captioning and visual question answe-ring[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:6077-6086.
[15]LI Y,PAN Y,YAO T,et al.Comprehending and ordering semantics for image captioning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2022:17990-17999.
[16]JOHNSON J,KRISHNA R,STARK M,et al.Image retrievalusing scene graphs[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2015:3668-3678.
[17]HERDADE S,KAPPELER A,BOAKYE K,et al.Image captioning:Transforming objects into words[J].arXiv:2106.10887,2019.
[18]FANG Z J,ZHANG J,LI D D.Image description algorithmbased on spatial and multi-level joint coding[J].Computer Science,2022,49(10):151-158.
[19]GUO L,LIU J,ZHU X,et al.Normalized and geometry-aware self-attention network for image captioning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:10327-10336.
[20]TOLSTIKHIN I O,HOULSBY N,KOLESNIKOV A,et al.Mlp-mixer:An all-mlp architecture for vision[J].Advances in Neural Information Processing Systems,2021,34:24261-24272.
[21]LIU H,DAI Z,SO D,et al.Pay attention to mlps[J].Advances in Neural Information Processing Systems,2021,34:9204-9215.
[22]GUO M H,LIU Z N,MU T J,et al.Beyond self-attention:External attention using two linear layers for visual tasks[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2022,45(5):5436-5447.
[23]REN S,HE K,GIRSHICK R,et al.Faster r-cnn:Towards real-time object detection with region proposal networks[J].Advances in Neural Information Processing Systems,2015,28:91-99.
[24]NEUMANN P M,PRAEGER C E.Cyclic matrices over finite fields[J].Journal of the London Mathematical Society,1995,52(2):263-284.
[25]VEDANTAM R,ZITNICK C L,PARIKH D.Cider:Consensus-based image description evaluation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2015:4566-4575.
[26]RENNIE S J,MARCHERET E,MROUEH Y,et al.Self-criticalsequence training for image captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:7008-7024.
[27]WANG P,NG H T.A beam-search decoder for normalization of social media text with application to machine translation[C]//Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.2013:471-481.
[28]LIN T Y,MAIRE M,BELONGIE S,et al.Microsoft coco:Common objects in context[C]//Computer Vision-ECCV 2014:13th European Conference,Zurich,Switzerland,September 6-12,2014,Proceedings,Part V 13.Springer International Publishing,2014:740-755.
[29]PAPINENI K,ROUKOS S,WARD T,et al.Bleu:a method for automatic evaluation of machine translation[C]//Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics.2002:311-318.
[30]BANERJEE S,LAVIE A.METEOR:An automatic metric forMT evaluation with improved correlation with human judgments[C]//Proceedings of the ACL workshop on Intrinsicand Extrinsic Evaluation Measures for Machine Translation and/or Summarization.2005:65-72.
[31]LIN C Y.Rouge:A package for automatic evaluation of summaries[C]//Proceedings of the Workshop on Text Summarization Branches Out.2004:74-81.
[32]ANDERSON P,FERNANDO B,JOHNSON M,et al.Spice:Semantic propositional image caption evaluation[C]//Computer Vision-ECCV 2016:14th European Conference,Amsterdam,The Netherlands,October 11-14,2016,Proceedings,Part V 14.Springer International Publishing,2016:382-398.
[33]KRISHNA R,ZHU Y,GROTH O,et al.Visual genome:Connecting language and vision using crowdsourced dense image annotations[J].International Journal of Computer Vision,2017,123(1):32-73.
[34]KINGMA D P,BA J.Adam:A method for stochastic optimiza-tion[J].arXiv:1412.6980,2014.
[35]JIANG W,MA L,JIANG Y G,et al.Recurrent fusion network for image captioning[C]//Proceedings of the European Confe-rence on Computer Vision(ECCV).2018:499-515.
[36]YAO T,PAN Y,LI Y,et al.Exploring visual relationship for image captioning[C]//Proceedings of the European Conference on Computer Vision(ECCV).2018:684-699.
[37]YANG X,TANG K,ZHANG H,et al.Auto-encoding scenegraphs for image captioning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019:10685-10694.
[38]LIU M F,SHI Q,NIE L Q.Image Captioning Based on Visual Relevance and Context Dual Attention[J].Journal of Software,2022,33(9):3210-3222.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!