计算机科学 ›› 2025, Vol. 52 ›› Issue (8): 222-231.doi: 10.11896/jsjkx.240600082
刘健, 姚任远, 高楠, 梁荣华, 陈朋
LIU Jian, YAO Renyuan, GAO Nan, LIANG Ronghua, CHEN Peng
摘要: 图像字幕是多模态图像理解的关键目标之一,为此需要生成细节丰富且准确的图像字幕。目前,主流的图像字幕方法主要关注区域之间的相互关系,忽略了区域与网格之间的视觉语义关系,导致生成效果不佳。为此,提出了一种视觉语义关系交互框架,在区域与网格之间动态地构建视觉语义关系交互,以生成具有丰富场景细节和准确关系的描述。首先,提出语义关系构造器用于构建区域语义关系;然后,提出视觉-语义关系联合编码器,用于构建区域和网格内外的视觉和语义关系交互;最后,提出自适应桥接解码器,用于自适应地平衡区域和网格特征的贡献,并融合这两种特征以生成文本。在MSCOCO数据集上进行的实验表明,提出的方法在BLEU,Meteor等指标上均优于主流的基线方法。
中图分类号:
[1]XIAO X,SUN Z,LI T,et al.Relational graph reasoning transformer for image captioning[C]//2022 IEEE International Conference on Multimedia and Expo(ICME).IEEE,2022:1-6. [2]REN S,HE K,GIRSHICK R,et al.Faster R-CNN:Towards real-time object detection with region proposal networks[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2017,39(6):1137-1149. [3]LUO Y,JI J,SUN X,et al.Dual-Level Collaborative Transfor-mer for Image Captioning[C]//35th AAAI Conference on Artificial Intelligence.2021:2286-2293. [4]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[C]//Proceedings of the 31st International Confe-rence on Neural Information Processing Systems(NIPS'17).2017:6000-6010. [5]LI G,ZHU L,LIU P,et al.Entangled transformer for image captioning[C]//IEEE/CVF International Conference on Computer Vision(ICCV).2019:8927-8936. [6]HUANG L,WANG W,CHEN J,et al.Attention on Attention for Image Captioning[C]//IEEE/CVF International Conference on Computer Vision(ICCV).2019:4633-4642. [7]HAFETH D A,KOLLIAS S,GHAFOOR M.Semantic repre-sentations with attention networks for boosting image captioning[J].IEEE Access,2023,11:40230-40239. [8]LI D W,ZHANG X W,YAN L.Multimodal Name Entity Re-cognition Method Based on Heterogeneous Graph Network[J].Journal of Chinese Computer Systems.2024,45(9):2063-2070. [9]ANDERSON P,HE X,BUEHLER C,et al.Bottom-up and top-down attention for image captioning and visual question answe-ring[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:6077-6086. [10]HERDADE S,KAPPELER A,BOAKYE K,et al.Image captioning:Transforming objects into words[C]//Proceedings of the 33rd International Conference on Neural Information Processing Systems.Red Hook,NY:Curran Associates Inc.,2019:11137-11147. [11]WANG C,GU X.Learning joint relationship attention network for image captioning[J].Expert Systems with Applications,2023,211:118474. [12]LIU Z,LIN Y,CAO Y,et al.Swin transformer:Hierarchical vision transformer using shifted windows[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:10012-10022. [13]NGUYEN V Q,SUGANUMA M,OKATANI T.Grit:Faster and better image captioning transformer using dual visual features[C]//European Conference on Computer Vision.Cham:Springer,2022:167-184. [14]LIU Z,LIN Y,CAO Y,et al.Swin transformer:Hierarchical vision transformer using shifted windows[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:10012-10022. [15]WANG W,CHEN Z,HU H.Hierarchical attention network for image captioning[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2019:8957-8964. [16]RADFORD A,WU J,CHILD R,et al.Language models are unsupervised multitask learners[J].OpenAI Blog,2019,1(8):9. [17]LIN T Y,MAIRE M,BELONGIE S,et al.Microsoft COCO:Common objects in context[C]//European Conference on Computer Vision.Cham:Springer,2014:740-755. [18]VOULODIMOS A,DOULAMIS N,DOULAMIS A,et al.Deep learning for computer vision:a brief review[J].Computational Intelligence and Neuroscience,2018,2018:7068349. [19]RICHARD E,REDDY B.Text classification for clinical trialoperations:evaluation and comparison of natural language processing techniques[J].Therapeutic Innovation & Regulatory Science,2021,55(2):447-453. [20]CORNIA M,STEFANINI M,BARALDI L,et al.Meshed-me-mory transformer for image captioning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:10578-10587. [21]GUO L,LIU J,ZHU X,et al.Normalized and geometry-aware self-attention network for image captioning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:10327-10336. [22]XU K,BA J,KIROS R,et al.Show,attend and tell:Neuralimage caption generation with visual attention[C]//Internatio-nal Conference on Machine Learning.PMLR,2015:2048-2057. [23]JIANG H,MISRA I,ROHRBACH M,et al.In defense of grid features for visual question answering[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:10267-10276. [24]LU J,XIONG C,PARIKH D,et al.Knowing when to look:Adaptive attention via a visual sentinel for image captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:375-383. [25]HE K,ZHANG X,REN S,et al.Deep residual learning forimage recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:770-778. [26]SATANJEEV B,ALON L.METEOR:An automatic metric for MT evaluation with improved correlation with human judgments[C]//Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization.2005:65-72. [27]WANG J,LI Y,PAN Y,et al.Contextual and selective attention networks for image captioning[J].Science China Information Sciences,2022,65(12):222103. [28]JI J,LUO Y,SUN X,et al.Improving image captioning by leveraging intra-and inter-layer global representation in transformer network[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2021:1655-1663. [29]HERDADE S,KAPPELER A,BOAKYE K,et al.Image captioning:Transforming objects into words[C]//Proceedings of the 33rd International Conference on Neural Information Processing Systems.Red Hook,NY:Curran Associates Inc.,2019:11137-11147. [30]XIE T,DING W,ZHANG J,et al.Bi-LS-AttM:A Bidirectional LSTM and Attention Mechanism Model for Improving Image Captioning[J].Applied Sciences,2023,13(13):7916. [31]REN S,HE K,et al.Faster r-cnn:Towards real-time object detection with region proposal networks[J].Advances in Neural Information Processing Systems,2015(28):91-99. [32]THANGAVEL K,PALANISAMY N,MUTHUSAMY S,et al.A novel method for image captioning using multimodal feature fusion employing mask RNN and LSTM models[J].Soft Computing,2023,27(19):14205-14218. [33]WANG J B,WANG W,WANG L,et al.Learning visual relationship and context-aware attention for image captioning[J].Pattern Recognition,2020:98:107075. [34]LI Z X,WEI H Y,HUANG F C,et al.Combine Visual Features and Scene Semantics for Image Captioning[J].Chinese Journal of Computer,2022,43(9):1624-1640. [35]YOU Q,JIN H,WANG Z,et al.Image captioning with semantic attention[C]//Proceedings of the IEEE Conference on Compu-ter Vision and Pattern Recognition.2016:4651-4659. [36]PEI H,CHEN Q,WANG J,et al.Visual relational reasoning for image caption[C]//2020 International Joint Conference on Neural Networks(IJCNN).IEEE,2020:1-8. [37]YAO T,PAN Y,LI Y,et al.Exploring visual relationship for image captioning[C]//Proceedings of the European Conference on Computer Vision(ECCV).2018:684-699. [38]KRISHNA R,ZHU Y,GROTH O,et al.Visual genome:Connecting language and vision using crowdsourced dense image annotations[J].International Journal of Computer Vision,2017,123:32-73. [39]DEVLIN J,CHANG M W,LEE K,et al.Bert:Pre-training of deep bidirectional transformers for language understanding[J].arXiv:1810.04805,2018. [40]RADFORD A,KIM J W,HALLACY C,et al.Learning transferable visual models from natural language supervision[C]//International Conference on Machine Learning.PMLR,2021:8748-8763. [41]PAPINENI K,ROUKOS S,WARD T,et al.Bleu:a method for automatic evaluation of machine translation[C]//Proceedings of the 40th AnnualMeeting of the Association for Computational Linguistics.2002:311-318. [42]BANERJEE S,LAVIE A.METEOR:An automatic metric for MT evaluation with improved correlation with human judgments[C]//Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization.2005:65-72. [43]LIN C Y.Rouge:A package for automatic evaluation of summaries[M]//Text Summarization Branches Out.2004:74-81. [44]VEDANTAM R,LAWRENCE ZITNICK C,PARIKH D.Cider:Consensus-based image description evaluation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2015:4566-4575. [45]CHEN L,ZHANG H,XIAO J,et al.Sca-cnn:Spatial and channel-wise attention in convolutional networks for image captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:5659-5667. [46]WANG J,WANG W,WANG L,et al.Learning visual relationship and context-aware attention for image captioning[J].Pattern Recognition,2020,98:107075. [47]ZHANG J,MEI K,ZHENG Y,et al.Integrating part of speech guidance for image captioning[J].IEEE Transactions on Multimedia,2020,23:92-104. [48]LI X,JIANG S.Know more say less:Image captioning based on scene graphs[J].IEEE Transactions on Multimedia,2019,21(8):2117-2130. [49]DING S,QU S,XI Y,et al.Stimulus-driven and concept-driven analysis for image caption generation[J].Neurocomputing,2020,398:520-530. [50]ZHA Z J,LIU D,ZHANG H,et al.Context-aware visual policy network for fine-grained image captioning[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2019,44(2):710-722. [51]YANG L,HU H,XING S,et al.Constrained LSTM and residual attention for image captioning[J].ACM Transactions on Multimedia Computing,Communications,and Applications,2020,16(3):1-18. [52]ZHANG Y,SHI X,MI S,et al.Image captioning with transformer and knowledge graph[J].Pattern Recognition Letters,2021,143:43-49. [53]WU J,CHEN T,WU H,et al.Fine-grained image captioning with global-local discriminative objective[J].IEEE Transactions on Multimedia,2020,23:2413-2427. [54]WU L,XU M,SANG L,et al.Noise augmented double-stream graph convolutional networks for image captioning[J].IEEE Transactions on Circuitsand Systems for Video Technology,2020,31(8):3118-3127. |
|