计算机科学 ›› 2025, Vol. 52 ›› Issue (8): 222-231.doi: 10.11896/jsjkx.240600082

• 计算机图形学&多媒体 • 上一篇    下一篇

VSRI:基于视觉语义关系交互的图像字幕生成方法

刘健, 姚任远, 高楠, 梁荣华, 陈朋   

  1. 浙江工业大学计算机科学与技术学院 杭州 310023
  • 收稿日期:2024-06-12 修回日期:2024-09-27 出版日期:2025-08-15 发布日期:2025-08-08
  • 通讯作者: 高楠(gaonan@zjut.edu.cn)
  • 作者简介:(jliu83@zjut.edu.cn)
  • 基金资助:
    国家自然科学基金(62202430); 浙江省自然科学基金(LY24F020018,LDT23F0202,LDT23F02021F02)

VSRI:Visual Semantic Relational Interactor for Image Caption

LIU Jian, YAO Renyuan, GAO Nan, LIANG Ronghua, CHEN Peng   

  1. College of Computer Science and Technology,Zhejiang University of Technology,Hangzhou 310023,China
  • Received:2024-06-12 Revised:2024-09-27 Online:2025-08-15 Published:2025-08-08
  • About author:LIU Jian,born in 1988,Ph.D,assistant professor,is a member of CCF(No.P8928M).His main research interests include time-series database,storage system and machine learning.
    GAO Nan,born in 1983,Ph.D,assistant professor,is a member of CCF(No.83932F).Her main research interests include cross modal generation and retrieval,natural language processing and medical image processing.
  • Supported by:
    National Natural Science Foundation of China(62202430) and Natural Science Foundation of Zhejiang Province(LY24F020018,LDT23F0202,LDT23F02021F02).

摘要: 图像字幕是多模态图像理解的关键目标之一,为此需要生成细节丰富且准确的图像字幕。目前,主流的图像字幕方法主要关注区域之间的相互关系,忽略了区域与网格之间的视觉语义关系,导致生成效果不佳。为此,提出了一种视觉语义关系交互框架,在区域与网格之间动态地构建视觉语义关系交互,以生成具有丰富场景细节和准确关系的描述。首先,提出语义关系构造器用于构建区域语义关系;然后,提出视觉-语义关系联合编码器,用于构建区域和网格内外的视觉和语义关系交互;最后,提出自适应桥接解码器,用于自适应地平衡区域和网格特征的贡献,并融合这两种特征以生成文本。在MSCOCO数据集上进行的实验表明,提出的方法在BLEU,Meteor等指标上均优于主流的基线方法。

关键词: 图像字幕生成, 视觉语义关系, 多模态学习, 注意力机制, 神经网络模型

Abstract: Image captioning is one of the key objectives of multimodal image understanding.This paper aims to generate detail-rich and accurate image caption.Currently,mainstream image captioning methods focus on the interrelationships between regions,but ignore the visual semantic relationships between regions and grids,leading to suboptimal generation results.This paper proposes a visual semantic relational interactor(VSRI) framework,which dynamically constructs visual semantic relational interactions between regions and grids to generate captions with rich scene details and accurate relationships.Specifically,first,region semantic relations are constructed by the semantic relation constructor(SRC).Then,a visual-semantic relation joint encoder(VSRJE) module is proposed to construct visual and semantic relational interactions within and between regions and grids.Finally,an adaptive bridging decoder(ABD) module is designed to dynamically balance the contributions of multi-granularity region and grid features and generate text.Experiments on the MSCOCO dataset show that the proposed VSRI significantly outperforms baselines in 7 different metrics such as BLEUs and Meteor.

Key words: Image caption, Visual semantic relation, Multimodal learning, Attention mechanism, Neural networks

中图分类号: 

  • TP391
[1]XIAO X,SUN Z,LI T,et al.Relational graph reasoning transformer for image captioning[C]//2022 IEEE International Conference on Multimedia and Expo(ICME).IEEE,2022:1-6.
[2]REN S,HE K,GIRSHICK R,et al.Faster R-CNN:Towards real-time object detection with region proposal networks[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2017,39(6):1137-1149.
[3]LUO Y,JI J,SUN X,et al.Dual-Level Collaborative Transfor-mer for Image Captioning[C]//35th AAAI Conference on Artificial Intelligence.2021:2286-2293.
[4]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[C]//Proceedings of the 31st International Confe-rence on Neural Information Processing Systems(NIPS'17).2017:6000-6010.
[5]LI G,ZHU L,LIU P,et al.Entangled transformer for image captioning[C]//IEEE/CVF International Conference on Computer Vision(ICCV).2019:8927-8936.
[6]HUANG L,WANG W,CHEN J,et al.Attention on Attention for Image Captioning[C]//IEEE/CVF International Conference on Computer Vision(ICCV).2019:4633-4642.
[7]HAFETH D A,KOLLIAS S,GHAFOOR M.Semantic repre-sentations with attention networks for boosting image captioning[J].IEEE Access,2023,11:40230-40239.
[8]LI D W,ZHANG X W,YAN L.Multimodal Name Entity Re-cognition Method Based on Heterogeneous Graph Network[J].Journal of Chinese Computer Systems.2024,45(9):2063-2070.
[9]ANDERSON P,HE X,BUEHLER C,et al.Bottom-up and top-down attention for image captioning and visual question answe-ring[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:6077-6086.
[10]HERDADE S,KAPPELER A,BOAKYE K,et al.Image captioning:Transforming objects into words[C]//Proceedings of the 33rd International Conference on Neural Information Processing Systems.Red Hook,NY:Curran Associates Inc.,2019:11137-11147.
[11]WANG C,GU X.Learning joint relationship attention network for image captioning[J].Expert Systems with Applications,2023,211:118474.
[12]LIU Z,LIN Y,CAO Y,et al.Swin transformer:Hierarchical vision transformer using shifted windows[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:10012-10022.
[13]NGUYEN V Q,SUGANUMA M,OKATANI T.Grit:Faster and better image captioning transformer using dual visual features[C]//European Conference on Computer Vision.Cham:Springer,2022:167-184.
[14]LIU Z,LIN Y,CAO Y,et al.Swin transformer:Hierarchical vision transformer using shifted windows[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:10012-10022.
[15]WANG W,CHEN Z,HU H.Hierarchical attention network for image captioning[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2019:8957-8964.
[16]RADFORD A,WU J,CHILD R,et al.Language models are unsupervised multitask learners[J].OpenAI Blog,2019,1(8):9.
[17]LIN T Y,MAIRE M,BELONGIE S,et al.Microsoft COCO:Common objects in context[C]//European Conference on Computer Vision.Cham:Springer,2014:740-755.
[18]VOULODIMOS A,DOULAMIS N,DOULAMIS A,et al.Deep learning for computer vision:a brief review[J].Computational Intelligence and Neuroscience,2018,2018:7068349.
[19]RICHARD E,REDDY B.Text classification for clinical trialoperations:evaluation and comparison of natural language processing techniques[J].Therapeutic Innovation & Regulatory Science,2021,55(2):447-453.
[20]CORNIA M,STEFANINI M,BARALDI L,et al.Meshed-me-mory transformer for image captioning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:10578-10587.
[21]GUO L,LIU J,ZHU X,et al.Normalized and geometry-aware self-attention network for image captioning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:10327-10336.
[22]XU K,BA J,KIROS R,et al.Show,attend and tell:Neuralimage caption generation with visual attention[C]//Internatio-nal Conference on Machine Learning.PMLR,2015:2048-2057.
[23]JIANG H,MISRA I,ROHRBACH M,et al.In defense of grid features for visual question answering[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:10267-10276.
[24]LU J,XIONG C,PARIKH D,et al.Knowing when to look:Adaptive attention via a visual sentinel for image captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:375-383.
[25]HE K,ZHANG X,REN S,et al.Deep residual learning forimage recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:770-778.
[26]SATANJEEV B,ALON L.METEOR:An automatic metric for MT evaluation with improved correlation with human judgments[C]//Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization.2005:65-72.
[27]WANG J,LI Y,PAN Y,et al.Contextual and selective attention networks for image captioning[J].Science China Information Sciences,2022,65(12):222103.
[28]JI J,LUO Y,SUN X,et al.Improving image captioning by leveraging intra-and inter-layer global representation in transformer network[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2021:1655-1663.
[29]HERDADE S,KAPPELER A,BOAKYE K,et al.Image captioning:Transforming objects into words[C]//Proceedings of the 33rd International Conference on Neural Information Processing Systems.Red Hook,NY:Curran Associates Inc.,2019:11137-11147.
[30]XIE T,DING W,ZHANG J,et al.Bi-LS-AttM:A Bidirectional LSTM and Attention Mechanism Model for Improving Image Captioning[J].Applied Sciences,2023,13(13):7916.
[31]REN S,HE K,et al.Faster r-cnn:Towards real-time object detection with region proposal networks[J].Advances in Neural Information Processing Systems,2015(28):91-99.
[32]THANGAVEL K,PALANISAMY N,MUTHUSAMY S,et al.A novel method for image captioning using multimodal feature fusion employing mask RNN and LSTM models[J].Soft Computing,2023,27(19):14205-14218.
[33]WANG J B,WANG W,WANG L,et al.Learning visual relationship and context-aware attention for image captioning[J].Pattern Recognition,2020:98:107075.
[34]LI Z X,WEI H Y,HUANG F C,et al.Combine Visual Features and Scene Semantics for Image Captioning[J].Chinese Journal of Computer,2022,43(9):1624-1640.
[35]YOU Q,JIN H,WANG Z,et al.Image captioning with semantic attention[C]//Proceedings of the IEEE Conference on Compu-ter Vision and Pattern Recognition.2016:4651-4659.
[36]PEI H,CHEN Q,WANG J,et al.Visual relational reasoning for image caption[C]//2020 International Joint Conference on Neural Networks(IJCNN).IEEE,2020:1-8.
[37]YAO T,PAN Y,LI Y,et al.Exploring visual relationship for image captioning[C]//Proceedings of the European Conference on Computer Vision(ECCV).2018:684-699.
[38]KRISHNA R,ZHU Y,GROTH O,et al.Visual genome:Connecting language and vision using crowdsourced dense image annotations[J].International Journal of Computer Vision,2017,123:32-73.
[39]DEVLIN J,CHANG M W,LEE K,et al.Bert:Pre-training of deep bidirectional transformers for language understanding[J].arXiv:1810.04805,2018.
[40]RADFORD A,KIM J W,HALLACY C,et al.Learning transferable visual models from natural language supervision[C]//International Conference on Machine Learning.PMLR,2021:8748-8763.
[41]PAPINENI K,ROUKOS S,WARD T,et al.Bleu:a method for automatic evaluation of machine translation[C]//Proceedings of the 40th AnnualMeeting of the Association for Computational Linguistics.2002:311-318.
[42]BANERJEE S,LAVIE A.METEOR:An automatic metric for MT evaluation with improved correlation with human judgments[C]//Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization.2005:65-72.
[43]LIN C Y.Rouge:A package for automatic evaluation of summaries[M]//Text Summarization Branches Out.2004:74-81.
[44]VEDANTAM R,LAWRENCE ZITNICK C,PARIKH D.Cider:Consensus-based image description evaluation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2015:4566-4575.
[45]CHEN L,ZHANG H,XIAO J,et al.Sca-cnn:Spatial and channel-wise attention in convolutional networks for image captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:5659-5667.
[46]WANG J,WANG W,WANG L,et al.Learning visual relationship and context-aware attention for image captioning[J].Pattern Recognition,2020,98:107075.
[47]ZHANG J,MEI K,ZHENG Y,et al.Integrating part of speech guidance for image captioning[J].IEEE Transactions on Multimedia,2020,23:92-104.
[48]LI X,JIANG S.Know more say less:Image captioning based on scene graphs[J].IEEE Transactions on Multimedia,2019,21(8):2117-2130.
[49]DING S,QU S,XI Y,et al.Stimulus-driven and concept-driven analysis for image caption generation[J].Neurocomputing,2020,398:520-530.
[50]ZHA Z J,LIU D,ZHANG H,et al.Context-aware visual policy network for fine-grained image captioning[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2019,44(2):710-722.
[51]YANG L,HU H,XING S,et al.Constrained LSTM and residual attention for image captioning[J].ACM Transactions on Multimedia Computing,Communications,and Applications,2020,16(3):1-18.
[52]ZHANG Y,SHI X,MI S,et al.Image captioning with transformer and knowledge graph[J].Pattern Recognition Letters,2021,143:43-49.
[53]WU J,CHEN T,WU H,et al.Fine-grained image captioning with global-local discriminative objective[J].IEEE Transactions on Multimedia,2020,23:2413-2427.
[54]WU L,XU M,SANG L,et al.Noise augmented double-stream graph convolutional networks for image captioning[J].IEEE Transactions on Circuitsand Systems for Video Technology,2020,31(8):3118-3127.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!