计算机科学 ›› 2025, Vol. 52 ›› Issue (6A): 240800086-7.doi: 10.11896/jsjkx.240800086
徐钰涛, 汤守国
XU Yutao, TANG Shouguo
摘要: 为了更好地探索图像中的潜在常识信息,引入了一种创新的视觉常识特征用于视觉问答(Visual Question Answering,VQA)任务,并通过视觉特征融合模块有效地整合了自底向上特征和视觉常识特征,从而实现了丰富的视觉特征表示。其中引导式注意力融合方法,通过将自底向上特征与视觉常识特征共同输入信息交互模块,使注意力机制能够捕捉到与问题文本更为相关的图片内容。在此基础上,设计并引入了一种门控计数模块(Gated Counting Module,GCM),旨在保留图像特征中实体的数量信息。这一模块在计数问题上显著提升了模型性能,同时保持了信息的完整性和相关性。与传统方法相比,GCM能够更准确地处理涉及数量的视觉问题,从而增强了整体VQA任务的准确性。最后,在广泛使用的数据集VQA v2.0上进行了大量实验,所提方法取得了较好的结果。
中图分类号:
[1]ANTOL S,AGRAWAL A,LU J,et al.VQA:Visual Question Answering[C]//Proceedings of the IEEE International Conference on Computer Vision.2015:2425-2433. [2]KRIZHEVSKY A,SUTSKEVER I,HINTON G E.ImageNet Classification with Deep Convolutional Neural Networks[J].Communications of the ACM,2017,60(6):84-90. [3]REN S,HE K,GIRSHICK R,et al.Faster R-CNN:TowardsReal-Time Object Detection with Region Proposal Networks[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2016,39(6):1137-1149. [4]HOCHREITER S,SCHMIDHUBER J.Long Short-Term Memory[J].Neural Computation,1997,9(8):1735-1780. [5]VASWANI A,SHAZEER N,PARMAR N,et al.Attention Is All You Need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems.2017:6000-6010. [6]WANG Z,JI S.Learning Convolutional Text Representationsfor Visual Question Answering[C]//Proceedings of the 2018 SIAM International Conference on Data Mining.2018:594-602. [7]LU J,YANG J,BATRA D,et al.Hierarchical Question-Image Co-Attention for Visual Question Answering[C]//Proceedings of the 30th International Conference on Neural Information Processing Systems.2016:289-297. [8]YU Z,YU J,CUI Y,et al.Deep Modular Co-Attention Net-works for Visual Question Answering[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019:6281-6290. [9]HE K,ZHANG X,REN S,et al.Deep Residual Learning for Image Recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:770-778. [10]HE K,GKIOXARI G,DOLLAR P,et al.Mask R-CNN[C]//Proceedings of the IEEE International Conference on Computer Vision.2017:2961-2969. [11]LONG J,SHELHAMER E,DARRELL T.Fully Convolutional Networks for Semantic Segmentation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2015:3431-3440. [12]LIU W,ANGUELOV D,ERHAN D,et al.SSD:Single ShotMultiBox Detector[C]//Computer Vision-ECCV 2016.Cham:Springer International Publishing,2016:21-37. [13]LI B,WU W,WANG Q,et al.SiamRPN++:Evolution of Siamese Visual Tracking With Very Deep Networks[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019:4282-4291. [14]KRISTAN M,MATAS J,LEONARDIS A,et al.The VisualObject Tracking VOT2015 Challenge Results[C]//Proceedings of the IEEE International Conference on Computer Vision Workshops.2015:1-23. [15]WANG T,HUANG J,ZHANG H,et al.Visual Commonsense R-CNN[C/OL]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:10760-10770. [16]AGRAWAL A,BATRA D,PARIKH D,et al.Don't Just Assume; Look and Answer:Overcoming Priors for Visual Question Answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:4971-4980. [17]HUDSON D A,MANNING C D.GQA:A New Dataset for Real-World Visual Reasoning and Compositional Question Answering[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019:6700-6709. [18]HENDRICKS L A,BURNS K,SAENKO K,et al.Women also Snowboard:Overcoming Bias in Captioning Models[C]//Proceedings of the European Conference on Computer Vision (ECCV).2018:771-787. [19]MANJUNATHA V,SAINI N,DAVIS L S.Explicit Bias Discovery in Visual Question Answering Models[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019:9562-9571. [20]RAMAKRISHNAN S,AGRAWAL A,LEE S.OvercomingLanguage Priors in Visual Question Answering with Adversarial Regularization[C]//Proceedings of the 32nd International Conference on Neural Information Processing Systems.2018:1548-1558. [21]SADEGHI F,KUMAR DIVVALA S K,FARHADI A.VisKE:Visual Knowledge Extraction and Question Answering by Visual Verification of Relation Phrases[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2015:1456-1464. [22]SU Z,ZHU C,DONG Y,et al.Learning Visual KnowledgeMemory Networks for Visual Question Answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:7736-7745. [23]GOYAL R,EBRAHIMI KAHOU S,MICHALSKI V,et al.The “Something Something” Video Database for Learning and Eva-luating Visual Common Sense[C]//Proceedings of the IEEE International Conference on Computer Vision.2017:5842-5850. [24]LEMPITSKY V,ZISSERMAN A.Learning To Count Objectsin Images[C]//Proceedings of the 24th International Conference on Neural Information Processing Systems.2010:1324-1332. [25]XIONG H,LU H,LIU C,et al.From Open Set to Closed Set:Counting Objects by Spatial Divide-and-Conquer[C]//Procee-dings of the IEEE/CVF International Conference on Computer Vision.2019:8362-8371. [26]HUBERMAN-SPIEGELGLAS I,FATTAL R.Single ImageObject Counting and Localizing Using Active-Learning[C]//Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision.2022:1310-1319. [27]ZHANG Y,HARE J,PRÜGEL-BENNETT A.Learning toCount Objects in Natural Images for Visual Question Answering[J].arXiv:1802.05766,2018. [28]ACHARYA M,KAFLE K,KANAN C.TallyQA:AnsweringComplex Counting Questions[J].Proceedings of the AAAI Conference on Artificial Intelligence,2019,33(1):8076-8084. [29]TROTT A,XIONG C,SOCHER R.Interpretable Counting for Visual Question Answering[J].arXiv:1712.08697,2018. [30]WHITEHEAD S,WU H,JI H,et al.Separating Skills and Concepts for Novel Visual Question Answering[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2021:5632-5641. [31]ANDERSON P,HE X,BUEHLER C,et al.Bottom-Up andTop-Down Attention for Image Captioning and Visual Question Answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:6077-6086. [32]PENNINGTON J,SOCHER R,MANNING C.GloVe:Global Vectors for Word Representation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Proces-sing(EMNLP).Doha,Qatar:Association for Computational Linguistics,2014:1532-1543. [33]KRISHNA R,ZHU Y,GROTH O,et al.Visual Genome:Connecting Language and Vision Using Crowdsourced Dense Image Annotations[J].International Journal of Computer Vision,2017,123(1):32-73. [34]KINGMA D P,BA J.Adam:A Method for Stochastic Optimization[J].arXiv:1412.6980,2017. [35]YU Z,YU J,XIANG C,et al.Beyond Bilinear:Generalized Multimodal Factorized High-Order Pooling for Visual Question Answering[J].IEEE Transactions on Neural Networks and Learning Systems,2018,29(12):5947-5959. [36]BEN-YOUNES H,CADENE R,THOME N,et al.BLOCK:Bilinear Superdiagonal Fusion for Visual Question Answering and Visual Relationship Detection[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2019:8102-8109. [37]NGUYEN D K,OKATANI T.Improved Fusion of Visual and Language Representations by Dense Symmetric Co-Attention for Visual Question Answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:6087-6096. [38]KIM J H,JUN J,ZHANG B T.Bilinear Attention Networks[C]//NeurIPS 2018.Montréal,Canada,2018:1-11. [39]GAO P,JIANG Z,YOU H,et al.Dynamic Fusion With Intra- and Inter-Modality Attention Flow for Visual Question Answering[C]//Proceedings of the IEEE/CVF Conference on ComputerVision and Pattern Recognition.2019:6639-6648. [40]KIM J J,LEE D G,WU J,et al.Visual question answering based on local-scene-aware referring expression generation[J].Neural Networks,2021,139:158-167. [41]SHUANG K,GUO J,WANG Z.Comprehensive-perception dynamic reasoning for visual question answering[J].Pattern Re-cognition,2022,131:108878. [42]CHEN C,HAN D,CHANG C C.CAAN:Context-Aware attention network for visual question answering[J].Pattern Recognition,2022,132:108980. [43]GUO Z,HAN D.Sparse co-attention visual question answering networks based on thresholds[J].Applied Intelligence,2023,53(1):586-600. [44]HU T,HE L L.Joint relational reasoning visual question answering model based on gating mechanism[J].Intelligent Computer and Applications,2023,13(12):138-143. [45]ZHANG J,LIU X,WANG Z.Latent Attention Network WithPosition Perception for Visual Question Answering[J].IEEE Transactions on Neural Networks and Learning Systems,2024,36(3):5059-5069. |
|