Computer Science ›› 2024, Vol. 51 ›› Issue (7): 214-220.doi: 10.11896/jsjkx.230600167

• Computer Graphics & Multimedia • Previous Articles     Next Articles

Image Captioning Generation Method Based on External Prior and Self-prior Attention

LI Yongjie, QIAN Yi, WEN Yimin   

  1. Guangxi Key Laboratory of Image and Graphic Intelligent Processing(Guilin University of Electronic Technology),Guilin,Guangxi 541004,China
  • Received:2023-06-21 Revised:2023-10-25 Online:2024-07-15 Published:2024-07-10
  • About author:LI Yongjie,born in 1998,postgraduate.His main research interests include computer vision,neural networks and image captioning.
    WEN Yimin,born in 1969,Ph.D,professor,Ph.D supervisor,is a distinguished member of CCF(No.06757D).His main research interests include machine learning,computer vision and big data analytics.
  • Supported by:
    Key R&D Program of Guangxi(AB21220023),National Natural Science Foundation of China(62366011),Guangxi Key Laboratory of Image and Graphic Intelligent Processing(GIIP2306) and Innovation Project of GUET Graduate Education(2023YCXB11).

Abstract: Image captioning,a multimodal task that combines computer vision and natural language processing,aims to comprehend the content of images and generate appropriate textual captions.Existing image captioning methods often employ self-attention mechanisms to capture long-range dependencies within samples.However,this approach overlooks the potential correlations among different samples and fails to utilize prior knowledge,resulting in discrepancies between the generated content and refe-rence captions.To address these issues,this paper proposes an image description approach based on external prior and self-prior attention(EPSPA).The external prior module implicitly considers the potential correlations among samples and removes interfe-rence from other samples.Meanwhile,the self-prior attention effectively utilizes attention weights from previous layers to simulate prior knowledge and guide the model in feature extraction.Evaluation results of EPSPA on publicly available datasets using multiple metrics demonstrates its superior performance compared to existing methods while maintaining a low parameter count.

Key words: Image captioning, Self-attentive mechanisms, Potential associations, External prior module, Self-prior attention

CLC Number: 

  • TP391
[1]SUTSKEVER I,VINYALS O,LE Q V.Sequence to sequence learning with neural networks[J].arXiv:1409.3215,2014.
[2]VINYALS O,TOSHEV A,BENGIO S,et al.Show and tell:A neural image caption generator[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2015:3156-3164.
[3]KARPATHY A,LI F F.Deep visual-semantic alignments forgenerating image descriptions[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2015:3128-3137.
[4]CORNIA M,BARALDI L,CUCCHIARA R.Show,control and tell:A framework for generating controllable and grounded captions[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019:8307-8316.
[5]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[J].arXiv:1706.03762,2017.
[6]LIU Z,LIN Y,CAO Y,et al.Swin transformer:Hierarchical vision transformer using shifted windows[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:10012-10022.
[7]DOSOVITSKIY A,BEYER L,KOLESNIKOV A,et al.Animage is worth 16x16 words:Transformers for image recognition at scale[J].arXiv:2010.11929,2020.
[8]HUANG L,WANG W,CHEN J,et al.Attention on attentionfor image captioning[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2019:4634-4643.
[9]CORNIA M,STEFANINI M,BARALDI L,et al.Meshed-me-mory transformer for image captioning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:10578-10587.
[10]SZEGEDY C,VANHOUCKE V,IOFFE S,et al.Rethinking the inception architecture for computer vision[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:2818-2826.
[11]XU K,BA J,KIROS R,et al.Show,attend and tell:Neuralimage caption generation with visual attention[C]//InternationalConference on Machine Learning.PMLR,2015:2048-2057.
[12]HE K,ZHANG X,REN S,et al.Deep residual learning forimage recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:770-778.
[13]LUONG M T,PHAM H,MANNING C D.Effective approaches to attention-based neural machine translation[J].arXiv:1508.04025,2015.
[14]ANDERSON P,HE X,BUEHLER C,et al.Bottom-up and top-down attention for image captioning and visual question answe-ring[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:6077-6086.
[15]LI Y,PAN Y,YAO T,et al.Comprehending and ordering semantics for image captioning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2022:17990-17999.
[16]JOHNSON J,KRISHNA R,STARK M,et al.Image retrievalusing scene graphs[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2015:3668-3678.
[17]HERDADE S,KAPPELER A,BOAKYE K,et al.Image captioning:Transforming objects into words[J].arXiv:2106.10887,2019.
[18]FANG Z J,ZHANG J,LI D D.Image description algorithmbased on spatial and multi-level joint coding[J].Computer Science,2022,49(10):151-158.
[19]GUO L,LIU J,ZHU X,et al.Normalized and geometry-aware self-attention network for image captioning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:10327-10336.
[20]TOLSTIKHIN I O,HOULSBY N,KOLESNIKOV A,et al.Mlp-mixer:An all-mlp architecture for vision[J].Advances in Neural Information Processing Systems,2021,34:24261-24272.
[21]LIU H,DAI Z,SO D,et al.Pay attention to mlps[J].Advances in Neural Information Processing Systems,2021,34:9204-9215.
[22]GUO M H,LIU Z N,MU T J,et al.Beyond self-attention:External attention using two linear layers for visual tasks[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2022,45(5):5436-5447.
[23]REN S,HE K,GIRSHICK R,et al.Faster r-cnn:Towards real-time object detection with region proposal networks[J].Advances in Neural Information Processing Systems,2015,28:91-99.
[24]NEUMANN P M,PRAEGER C E.Cyclic matrices over finite fields[J].Journal of the London Mathematical Society,1995,52(2):263-284.
[25]VEDANTAM R,ZITNICK C L,PARIKH D.Cider:Consensus-based image description evaluation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2015:4566-4575.
[26]RENNIE S J,MARCHERET E,MROUEH Y,et al.Self-criticalsequence training for image captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:7008-7024.
[27]WANG P,NG H T.A beam-search decoder for normalization of social media text with application to machine translation[C]//Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.2013:471-481.
[28]LIN T Y,MAIRE M,BELONGIE S,et al.Microsoft coco:Common objects in context[C]//Computer Vision-ECCV 2014:13th European Conference,Zurich,Switzerland,September 6-12,2014,Proceedings,Part V 13.Springer International Publishing,2014:740-755.
[29]PAPINENI K,ROUKOS S,WARD T,et al.Bleu:a method for automatic evaluation of machine translation[C]//Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics.2002:311-318.
[30]BANERJEE S,LAVIE A.METEOR:An automatic metric forMT evaluation with improved correlation with human judgments[C]//Proceedings of the ACL workshop on Intrinsicand Extrinsic Evaluation Measures for Machine Translation and/or Summarization.2005:65-72.
[31]LIN C Y.Rouge:A package for automatic evaluation of summaries[C]//Proceedings of the Workshop on Text Summarization Branches Out.2004:74-81.
[32]ANDERSON P,FERNANDO B,JOHNSON M,et al.Spice:Semantic propositional image caption evaluation[C]//Computer Vision-ECCV 2016:14th European Conference,Amsterdam,The Netherlands,October 11-14,2016,Proceedings,Part V 14.Springer International Publishing,2016:382-398.
[33]KRISHNA R,ZHU Y,GROTH O,et al.Visual genome:Connecting language and vision using crowdsourced dense image annotations[J].International Journal of Computer Vision,2017,123(1):32-73.
[34]KINGMA D P,BA J.Adam:A method for stochastic optimiza-tion[J].arXiv:1412.6980,2014.
[35]JIANG W,MA L,JIANG Y G,et al.Recurrent fusion network for image captioning[C]//Proceedings of the European Confe-rence on Computer Vision(ECCV).2018:499-515.
[36]YAO T,PAN Y,LI Y,et al.Exploring visual relationship for image captioning[C]//Proceedings of the European Conference on Computer Vision(ECCV).2018:684-699.
[37]YANG X,TANG K,ZHANG H,et al.Auto-encoding scenegraphs for image captioning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019:10685-10694.
[38]LIU M F,SHI Q,NIE L Q.Image Captioning Based on Visual Relevance and Context Dual Attention[J].Journal of Software,2022,33(9):3210-3222.
[1] ZHOU Ziyi, XIONG Hailing. Image Captioning Optimization Strategy Based on Deep Learning [J]. Computer Science, 2023, 50(8): 99-110.
[2] CHEN Zhang-hui, XIONG Yun. Stylized Image Captioning Model Based on Disentangle-Retrieve-Generate [J]. Computer Science, 2022, 49(6): 180-186.
[3] FANG Zhong-jun, ZHANG Jing, LI Dong-dong. Spatial Encoding and Multi-layer Joint Encoding Enhanced Transformer for Image Captioning [J]. Computer Science, 2022, 49(10): 151-158.
[4] WANG Ming-zhan, JI Jun-zhong, JIA Ao-zhe, ZHANG Xiao-dan. Cross-scale Feature Fusion Self-attention for Image Captioning [J]. Computer Science, 2022, 49(10): 191-197.
[5] MIAO Yi, ZHAO Zeng-shun, YANG Yu-lu, XU Ning, YANG Hao-ran, SUN Qian. Survey of Image Captioning Methods [J]. Computer Science, 2020, 47(12): 149-160.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!