Computer Science ›› 2025, Vol. 52 ›› Issue (8): 222-231.doi: 10.11896/jsjkx.240600082

• Computer Graphics & Multimedia • Previous Articles     Next Articles

VSRI:Visual Semantic Relational Interactor for Image Caption

LIU Jian, YAO Renyuan, GAO Nan, LIANG Ronghua, CHEN Peng   

  1. College of Computer Science and Technology,Zhejiang University of Technology,Hangzhou 310023,China
  • Received:2024-06-12 Revised:2024-09-27 Online:2025-08-15 Published:2025-08-08
  • About author:LIU Jian,born in 1988,Ph.D,assistant professor,is a member of CCF(No.P8928M).His main research interests include time-series database,storage system and machine learning.
    GAO Nan,born in 1983,Ph.D,assistant professor,is a member of CCF(No.83932F).Her main research interests include cross modal generation and retrieval,natural language processing and medical image processing.
  • Supported by:
    National Natural Science Foundation of China(62202430) and Natural Science Foundation of Zhejiang Province(LY24F020018,LDT23F0202,LDT23F02021F02).

Abstract: Image captioning is one of the key objectives of multimodal image understanding.This paper aims to generate detail-rich and accurate image caption.Currently,mainstream image captioning methods focus on the interrelationships between regions,but ignore the visual semantic relationships between regions and grids,leading to suboptimal generation results.This paper proposes a visual semantic relational interactor(VSRI) framework,which dynamically constructs visual semantic relational interactions between regions and grids to generate captions with rich scene details and accurate relationships.Specifically,first,region semantic relations are constructed by the semantic relation constructor(SRC).Then,a visual-semantic relation joint encoder(VSRJE) module is proposed to construct visual and semantic relational interactions within and between regions and grids.Finally,an adaptive bridging decoder(ABD) module is designed to dynamically balance the contributions of multi-granularity region and grid features and generate text.Experiments on the MSCOCO dataset show that the proposed VSRI significantly outperforms baselines in 7 different metrics such as BLEUs and Meteor.

Key words: Image caption, Visual semantic relation, Multimodal learning, Attention mechanism, Neural networks

CLC Number: 

  • TP391
[1]XIAO X,SUN Z,LI T,et al.Relational graph reasoning transformer for image captioning[C]//2022 IEEE International Conference on Multimedia and Expo(ICME).IEEE,2022:1-6.
[2]REN S,HE K,GIRSHICK R,et al.Faster R-CNN:Towards real-time object detection with region proposal networks[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2017,39(6):1137-1149.
[3]LUO Y,JI J,SUN X,et al.Dual-Level Collaborative Transfor-mer for Image Captioning[C]//35th AAAI Conference on Artificial Intelligence.2021:2286-2293.
[4]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[C]//Proceedings of the 31st International Confe-rence on Neural Information Processing Systems(NIPS'17).2017:6000-6010.
[5]LI G,ZHU L,LIU P,et al.Entangled transformer for image captioning[C]//IEEE/CVF International Conference on Computer Vision(ICCV).2019:8927-8936.
[6]HUANG L,WANG W,CHEN J,et al.Attention on Attention for Image Captioning[C]//IEEE/CVF International Conference on Computer Vision(ICCV).2019:4633-4642.
[7]HAFETH D A,KOLLIAS S,GHAFOOR M.Semantic repre-sentations with attention networks for boosting image captioning[J].IEEE Access,2023,11:40230-40239.
[8]LI D W,ZHANG X W,YAN L.Multimodal Name Entity Re-cognition Method Based on Heterogeneous Graph Network[J].Journal of Chinese Computer Systems.2024,45(9):2063-2070.
[9]ANDERSON P,HE X,BUEHLER C,et al.Bottom-up and top-down attention for image captioning and visual question answe-ring[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:6077-6086.
[10]HERDADE S,KAPPELER A,BOAKYE K,et al.Image captioning:Transforming objects into words[C]//Proceedings of the 33rd International Conference on Neural Information Processing Systems.Red Hook,NY:Curran Associates Inc.,2019:11137-11147.
[11]WANG C,GU X.Learning joint relationship attention network for image captioning[J].Expert Systems with Applications,2023,211:118474.
[12]LIU Z,LIN Y,CAO Y,et al.Swin transformer:Hierarchical vision transformer using shifted windows[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:10012-10022.
[13]NGUYEN V Q,SUGANUMA M,OKATANI T.Grit:Faster and better image captioning transformer using dual visual features[C]//European Conference on Computer Vision.Cham:Springer,2022:167-184.
[14]LIU Z,LIN Y,CAO Y,et al.Swin transformer:Hierarchical vision transformer using shifted windows[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:10012-10022.
[15]WANG W,CHEN Z,HU H.Hierarchical attention network for image captioning[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2019:8957-8964.
[16]RADFORD A,WU J,CHILD R,et al.Language models are unsupervised multitask learners[J].OpenAI Blog,2019,1(8):9.
[17]LIN T Y,MAIRE M,BELONGIE S,et al.Microsoft COCO:Common objects in context[C]//European Conference on Computer Vision.Cham:Springer,2014:740-755.
[18]VOULODIMOS A,DOULAMIS N,DOULAMIS A,et al.Deep learning for computer vision:a brief review[J].Computational Intelligence and Neuroscience,2018,2018:7068349.
[19]RICHARD E,REDDY B.Text classification for clinical trialoperations:evaluation and comparison of natural language processing techniques[J].Therapeutic Innovation & Regulatory Science,2021,55(2):447-453.
[20]CORNIA M,STEFANINI M,BARALDI L,et al.Meshed-me-mory transformer for image captioning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:10578-10587.
[21]GUO L,LIU J,ZHU X,et al.Normalized and geometry-aware self-attention network for image captioning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:10327-10336.
[22]XU K,BA J,KIROS R,et al.Show,attend and tell:Neuralimage caption generation with visual attention[C]//Internatio-nal Conference on Machine Learning.PMLR,2015:2048-2057.
[23]JIANG H,MISRA I,ROHRBACH M,et al.In defense of grid features for visual question answering[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:10267-10276.
[24]LU J,XIONG C,PARIKH D,et al.Knowing when to look:Adaptive attention via a visual sentinel for image captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:375-383.
[25]HE K,ZHANG X,REN S,et al.Deep residual learning forimage recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:770-778.
[26]SATANJEEV B,ALON L.METEOR:An automatic metric for MT evaluation with improved correlation with human judgments[C]//Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization.2005:65-72.
[27]WANG J,LI Y,PAN Y,et al.Contextual and selective attention networks for image captioning[J].Science China Information Sciences,2022,65(12):222103.
[28]JI J,LUO Y,SUN X,et al.Improving image captioning by leveraging intra-and inter-layer global representation in transformer network[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2021:1655-1663.
[29]HERDADE S,KAPPELER A,BOAKYE K,et al.Image captioning:Transforming objects into words[C]//Proceedings of the 33rd International Conference on Neural Information Processing Systems.Red Hook,NY:Curran Associates Inc.,2019:11137-11147.
[30]XIE T,DING W,ZHANG J,et al.Bi-LS-AttM:A Bidirectional LSTM and Attention Mechanism Model for Improving Image Captioning[J].Applied Sciences,2023,13(13):7916.
[31]REN S,HE K,et al.Faster r-cnn:Towards real-time object detection with region proposal networks[J].Advances in Neural Information Processing Systems,2015(28):91-99.
[32]THANGAVEL K,PALANISAMY N,MUTHUSAMY S,et al.A novel method for image captioning using multimodal feature fusion employing mask RNN and LSTM models[J].Soft Computing,2023,27(19):14205-14218.
[33]WANG J B,WANG W,WANG L,et al.Learning visual relationship and context-aware attention for image captioning[J].Pattern Recognition,2020:98:107075.
[34]LI Z X,WEI H Y,HUANG F C,et al.Combine Visual Features and Scene Semantics for Image Captioning[J].Chinese Journal of Computer,2022,43(9):1624-1640.
[35]YOU Q,JIN H,WANG Z,et al.Image captioning with semantic attention[C]//Proceedings of the IEEE Conference on Compu-ter Vision and Pattern Recognition.2016:4651-4659.
[36]PEI H,CHEN Q,WANG J,et al.Visual relational reasoning for image caption[C]//2020 International Joint Conference on Neural Networks(IJCNN).IEEE,2020:1-8.
[37]YAO T,PAN Y,LI Y,et al.Exploring visual relationship for image captioning[C]//Proceedings of the European Conference on Computer Vision(ECCV).2018:684-699.
[38]KRISHNA R,ZHU Y,GROTH O,et al.Visual genome:Connecting language and vision using crowdsourced dense image annotations[J].International Journal of Computer Vision,2017,123:32-73.
[39]DEVLIN J,CHANG M W,LEE K,et al.Bert:Pre-training of deep bidirectional transformers for language understanding[J].arXiv:1810.04805,2018.
[40]RADFORD A,KIM J W,HALLACY C,et al.Learning transferable visual models from natural language supervision[C]//International Conference on Machine Learning.PMLR,2021:8748-8763.
[41]PAPINENI K,ROUKOS S,WARD T,et al.Bleu:a method for automatic evaluation of machine translation[C]//Proceedings of the 40th AnnualMeeting of the Association for Computational Linguistics.2002:311-318.
[42]BANERJEE S,LAVIE A.METEOR:An automatic metric for MT evaluation with improved correlation with human judgments[C]//Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization.2005:65-72.
[43]LIN C Y.Rouge:A package for automatic evaluation of summaries[M]//Text Summarization Branches Out.2004:74-81.
[44]VEDANTAM R,LAWRENCE ZITNICK C,PARIKH D.Cider:Consensus-based image description evaluation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2015:4566-4575.
[45]CHEN L,ZHANG H,XIAO J,et al.Sca-cnn:Spatial and channel-wise attention in convolutional networks for image captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:5659-5667.
[46]WANG J,WANG W,WANG L,et al.Learning visual relationship and context-aware attention for image captioning[J].Pattern Recognition,2020,98:107075.
[47]ZHANG J,MEI K,ZHENG Y,et al.Integrating part of speech guidance for image captioning[J].IEEE Transactions on Multimedia,2020,23:92-104.
[48]LI X,JIANG S.Know more say less:Image captioning based on scene graphs[J].IEEE Transactions on Multimedia,2019,21(8):2117-2130.
[49]DING S,QU S,XI Y,et al.Stimulus-driven and concept-driven analysis for image caption generation[J].Neurocomputing,2020,398:520-530.
[50]ZHA Z J,LIU D,ZHANG H,et al.Context-aware visual policy network for fine-grained image captioning[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2019,44(2):710-722.
[51]YANG L,HU H,XING S,et al.Constrained LSTM and residual attention for image captioning[J].ACM Transactions on Multimedia Computing,Communications,and Applications,2020,16(3):1-18.
[52]ZHANG Y,SHI X,MI S,et al.Image captioning with transformer and knowledge graph[J].Pattern Recognition Letters,2021,143:43-49.
[53]WU J,CHEN T,WU H,et al.Fine-grained image captioning with global-local discriminative objective[J].IEEE Transactions on Multimedia,2020,23:2413-2427.
[54]WU L,XU M,SANG L,et al.Noise augmented double-stream graph convolutional networks for image captioning[J].IEEE Transactions on Circuitsand Systems for Video Technology,2020,31(8):3118-3127.
[1] GUO Husheng, ZHANG Xufei, SUN Yujie, WANG Wenjian. Continuously Evolution Streaming Graph Neural Network [J]. Computer Science, 2025, 52(8): 118-126.
[2] LIU Yajun, JI Qingge. Pedestrian Trajectory Prediction Based on Motion Patterns and Time-Frequency Domain Fusion [J]. Computer Science, 2025, 52(7): 92-102.
[3] LUO Xuyang, TAN Zhiyi. Knowledge-aware Graph Refinement Network for Recommendation [J]. Computer Science, 2025, 52(7): 103-109.
[4] HAO Jiahui, WAN Yuan, ZHANG Yuhang. Research on Node Learning of Graph Neural Networks Fusing Positional and StructuralInformation [J]. Computer Science, 2025, 52(7): 110-118.
[5] LIU Chengzhuang, ZHAI Sulan, LIU Haiqing, WANG Kunpeng. Weakly-aligned RGBT Salient Object Detection Based on Multi-modal Feature Alignment [J]. Computer Science, 2025, 52(7): 142-150.
[6] ZHUANG Jianjun, WAN Li. SCF U2-Net:Lightweight U2-Net Improved Method for Breast Ultrasound Lesion SegmentationCombined with Fuzzy Logic [J]. Computer Science, 2025, 52(7): 161-169.
[7] JIANG Kun, ZHAO Zhengpeng, PU Yuanyuan, HUANG Jian, GU Jinjing, XU Dan. Cross-modal Hypergraph Optimisation Learning for Multimodal Sentiment Analysis [J]. Computer Science, 2025, 52(7): 210-217.
[8] ZHENG Cheng, YANG Nan. Aspect-based Sentiment Analysis Based on Syntax,Semantics and Affective Knowledge [J]. Computer Science, 2025, 52(7): 218-225.
[9] WANG Youkang, CHENG Chunling. Multimodal Sentiment Analysis Model Based on Cross-modal Unidirectional Weighting [J]. Computer Science, 2025, 52(7): 226-232.
[10] KONG Yinling, WANG Zhongqing, WANG Hongling. Study on Opinion Summarization Incorporating Evaluation Object Information [J]. Computer Science, 2025, 52(7): 233-240.
[11] LI Daicheng, LI Han, LIU Zheyu, GONG Shiheng. MacBERT Based Chinese Named Entity Recognition Fusion with Dependent Syntactic Information and Multi-view Lexical Information [J]. Computer Science, 2025, 52(6A): 240600121-8.
[12] HUANG Bocheng, WANG Xiaolong, AN Guocheng, ZHANG Tao. Transmission Line Fault Identification Method Based on Transfer Learning and Improved YOLOv8s [J]. Computer Science, 2025, 52(6A): 240800044-8.
[13] WU Zhihua, CHENG Jianghua, LIU Tong, CAI Yahui, CHENG Bang, PAN Lehao. Human Target Detection Algorithm for Low-quality Laser Through-window Imaging [J]. Computer Science, 2025, 52(6A): 240600069-6.
[14] ZHENG Chuangrui, DENG Xiuqin, CHEN Lei. Traffic Prediction Model Based on Decoupled Adaptive Dynamic Graph Convolution [J]. Computer Science, 2025, 52(6A): 240400149-8.
[15] HONG Yi, SHEN Shikai, SHE Yumei, YANG Bin, DAI Fei, WANG Jianxiao, ZHANG Liyi. Multivariate Time Series Prediction Based on Dynamic Graph Learning and Attention Mechanism [J]. Computer Science, 2025, 52(6A): 240700047-8.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!