基于外部先验和自先验注意力的图像描述生成方法

doi:10.11896/jsjkx.230600167

Abstract

Abstract: Image captioning,a multimodal task that combines computer vision and natural language processing,aims to comprehend the content of images and generate appropriate textual captions.Existing image captioning methods often employ self-attention mechanisms to capture long-range dependencies within samples.However,this approach overlooks the potential correlations among different samples and fails to utilize prior knowledge,resulting in discrepancies between the generated content and refe-rence captions.To address these issues,this paper proposes an image description approach based on external prior and self-prior attention(EPSPA).The external prior module implicitly considers the potential correlations among samples and removes interfe-rence from other samples.Meanwhile,the self-prior attention effectively utilizes attention weights from previous layers to simulate prior knowledge and guide the model in feature extraction.Evaluation results of EPSPA on publicly available datasets using multiple metrics demonstrates its superior performance compared to existing methods while maintaining a low parameter count.

Key words: Image captioning, Self-attentive mechanisms, Potential associations, External prior module, Self-prior attention

CLC Number:

TP391

LI Yongjie, QIAN Yi, WEN Yimin. Image Captioning Generation Method Based on External Prior and Self-prior Attention[J].Computer Science, 2024, 51(7): 214-220.

References

[1]SUTSKEVER I,VINYALS O,LE Q V.Sequence to sequence learning with neural networks[J].arXiv:1409.3215,2014.
[2]VINYALS O,TOSHEV A,BENGIO S,et al.Show and tell:A neural image caption generator[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2015:3156-3164.
[3]KARPATHY A,LI F F.Deep visual-semantic alignments forgenerating image descriptions[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2015:3128-3137.
[4]CORNIA M,BARALDI L,CUCCHIARA R.Show,control and tell:A framework for generating controllable and grounded captions[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019:8307-8316.
[5]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[J].arXiv:1706.03762,2017.
[6]LIU Z,LIN Y,CAO Y,et al.Swin transformer:Hierarchical vision transformer using shifted windows[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:10012-10022.
[7]DOSOVITSKIY A,BEYER L,KOLESNIKOV A,et al.Animage is worth 16x16 words:Transformers for image recognition at scale[J].arXiv:2010.11929,2020.
[8]HUANG L,WANG W,CHEN J,et al.Attention on attentionfor image captioning[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2019:4634-4643.
[9]CORNIA M,STEFANINI M,BARALDI L,et al.Meshed-me-mory transformer for image captioning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:10578-10587.
[10]SZEGEDY C,VANHOUCKE V,IOFFE S,et al.Rethinking the inception architecture for computer vision[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:2818-2826.
[11]XU K,BA J,KIROS R,et al.Show,attend and tell:Neuralimage caption generation with visual attention[C]//InternationalConference on Machine Learning.PMLR,2015:2048-2057.
[12]HE K,ZHANG X,REN S,et al.Deep residual learning forimage recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:770-778.
[13]LUONG M T,PHAM H,MANNING C D.Effective approaches to attention-based neural machine translation[J].arXiv:1508.04025,2015.
[14]ANDERSON P,HE X,BUEHLER C,et al.Bottom-up and top-down attention for image captioning and visual question answe-ring[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:6077-6086.
[15]LI Y,PAN Y,YAO T,et al.Comprehending and ordering semantics for image captioning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2022:17990-17999.
[16]JOHNSON J,KRISHNA R,STARK M,et al.Image retrievalusing scene graphs[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2015:3668-3678.
[17]HERDADE S,KAPPELER A,BOAKYE K,et al.Image captioning:Transforming objects into words[J].arXiv:2106.10887,2019.
[18]FANG Z J,ZHANG J,LI D D.Image description algorithmbased on spatial and multi-level joint coding[J].Computer Science,2022,49(10):151-158.
[19]GUO L,LIU J,ZHU X,et al.Normalized and geometry-aware self-attention network for image captioning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:10327-10336.
[20]TOLSTIKHIN I O,HOULSBY N,KOLESNIKOV A,et al.Mlp-mixer:An all-mlp architecture for vision[J].Advances in Neural Information Processing Systems,2021,34:24261-24272.
[21]LIU H,DAI Z,SO D,et al.Pay attention to mlps[J].Advances in Neural Information Processing Systems,2021,34:9204-9215.
[22]GUO M H,LIU Z N,MU T J,et al.Beyond self-attention:External attention using two linear layers for visual tasks[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2022,45(5):5436-5447.
[23]REN S,HE K,GIRSHICK R,et al.Faster r-cnn:Towards real-time object detection with region proposal networks[J].Advances in Neural Information Processing Systems,2015,28:91-99.
[24]NEUMANN P M,PRAEGER C E.Cyclic matrices over finite fields[J].Journal of the London Mathematical Society,1995,52(2):263-284.
[25]VEDANTAM R,ZITNICK C L,PARIKH D.Cider:Consensus-based image description evaluation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2015:4566-4575.
[26]RENNIE S J,MARCHERET E,MROUEH Y,et al.Self-criticalsequence training for image captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:7008-7024.
[27]WANG P,NG H T.A beam-search decoder for normalization of social media text with application to machine translation[C]//Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.2013:471-481.
[28]LIN T Y,MAIRE M,BELONGIE S,et al.Microsoft coco:Common objects in context[C]//Computer Vision-ECCV 2014:13th European Conference,Zurich,Switzerland,September 6-12,2014,Proceedings,Part V 13.Springer International Publishing,2014:740-755.
[29]PAPINENI K,ROUKOS S,WARD T,et al.Bleu:a method for automatic evaluation of machine translation[C]//Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics.2002:311-318.
[30]BANERJEE S,LAVIE A.METEOR:An automatic metric forMT evaluation with improved correlation with human judgments[C]//Proceedings of the ACL workshop on Intrinsicand Extrinsic Evaluation Measures for Machine Translation and/or Summarization.2005:65-72.
[31]LIN C Y.Rouge:A package for automatic evaluation of summaries[C]//Proceedings of the Workshop on Text Summarization Branches Out.2004:74-81.
[32]ANDERSON P,FERNANDO B,JOHNSON M,et al.Spice:Semantic propositional image caption evaluation[C]//Computer Vision-ECCV 2016:14th European Conference,Amsterdam,The Netherlands,October 11-14,2016,Proceedings,Part V 14.Springer International Publishing,2016:382-398.
[33]KRISHNA R,ZHU Y,GROTH O,et al.Visual genome:Connecting language and vision using crowdsourced dense image annotations[J].International Journal of Computer Vision,2017,123(1):32-73.
[34]KINGMA D P,BA J.Adam:A method for stochastic optimiza-tion[J].arXiv:1412.6980,2014.
[35]JIANG W,MA L,JIANG Y G,et al.Recurrent fusion network for image captioning[C]//Proceedings of the European Confe-rence on Computer Vision(ECCV).2018:499-515.
[36]YAO T,PAN Y,LI Y,et al.Exploring visual relationship for image captioning[C]//Proceedings of the European Conference on Computer Vision(ECCV).2018:684-699.
[37]YANG X,TANG K,ZHANG H,et al.Auto-encoding scenegraphs for image captioning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019:10685-10694.
[38]LIU M F,SHI Q,NIE L Q.Image Captioning Based on Visual Relevance and Context Dual Attention[J].Journal of Software,2022,33(9):3210-3222.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Image Captioning Generation Method Based on External Prior and Self-prior Attention

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 5

Metrics

Comments

Recommended 0

[1]	ZHOU Ziyi, XIONG Hailing. Image Captioning Optimization Strategy Based on Deep Learning [J]. Computer Science, 2023, 50(8): 99-110.
[2]	CHEN Zhang-hui, XIONG Yun. Stylized Image Captioning Model Based on Disentangle-Retrieve-Generate [J]. Computer Science, 2022, 49(6): 180-186.
[3]	FANG Zhong-jun, ZHANG Jing, LI Dong-dong. Spatial Encoding and Multi-layer Joint Encoding Enhanced Transformer for Image Captioning [J]. Computer Science, 2022, 49(10): 151-158.
[4]	WANG Ming-zhan, JI Jun-zhong, JIA Ao-zhe, ZHANG Xiao-dan. Cross-scale Feature Fusion Self-attention for Image Captioning [J]. Computer Science, 2022, 49(10): 191-197.
[5]	MIAO Yi, ZHAO Zeng-shun, YANG Yu-lu, XU Ning, YANG Hao-ran, SUN Qian. Survey of Image Captioning Methods [J]. Computer Science, 2020, 47(12): 149-160.