融合视觉常识特征和门控计数方法的视觉问答

doi:10.11896/jsjkx.240800086

计算机科学 ›› 2025, Vol. 52 ›› Issue (6A): 240800086-7.doi: 10.11896/jsjkx.240800086

• 图像处理&多媒体技术 • 上一篇下一篇

融合视觉常识特征和门控计数方法的视觉问答

徐钰涛, 汤守国

昆明理工大学信息工程与自动化学院昆明 650504
云南省计算机技术应用重点实验室昆明 650504

出版日期:2025-06-16 发布日期:2025-06-12
通讯作者: 汤守国(tondycool@qq.com)
作者简介:(20212104076@stu.kust.edu.cn)
基金资助:
云南省基础研究专项(202201AS070029);云南省重大专项计划(202302AD080002)

Visual Question Answering Integrating Visual Common Sense Features and Gated Counting Module

XU Yutao, TANG Shouguo

Faculty of Information Engineering and Automation,Kunming University of Science and Technology,Kunming 650504,China
Yunnan Key Laboratory of Computer Technologies Application,Kunming 650504,China

Online:2025-06-16 Published:2025-06-12
About author:XU Yutao,born in 1999,postgraduate.His main research interest is visual question answering.
TANG Shouguo,born in 1981,expert experimenter.His main research interests include medical information technology and machine learning.
Supported by:
Yunnan Fundamental Research Projects(202201AS070029) and Yunnan Science and Technology Major Project(202302AD080002).

摘要/Abstract

摘要： 为了更好地探索图像中的潜在常识信息,引入了一种创新的视觉常识特征用于视觉问答(Visual Question Answering,VQA)任务,并通过视觉特征融合模块有效地整合了自底向上特征和视觉常识特征,从而实现了丰富的视觉特征表示。其中引导式注意力融合方法,通过将自底向上特征与视觉常识特征共同输入信息交互模块,使注意力机制能够捕捉到与问题文本更为相关的图片内容。在此基础上,设计并引入了一种门控计数模块(Gated Counting Module,GCM),旨在保留图像特征中实体的数量信息。这一模块在计数问题上显著提升了模型性能,同时保持了信息的完整性和相关性。与传统方法相比,GCM能够更准确地处理涉及数量的视觉问题,从而增强了整体VQA任务的准确性。最后,在广泛使用的数据集VQA v2.0上进行了大量实验,所提方法取得了较好的结果。

关键词: 视觉问答, 视觉常识特征, 特征融合, 视觉特征, Faster R-CNN, 门控计数模块

Abstract: To better explore potential common sense information in images,this paper introduces an innovative Visual common sense feature for the visual question answering(VQA) task,and effectively integrate the bottom-up feature with the visual common sense feature through the visual feature fusion module.Thus,rich visual feature representation is realized.Guided attention fusion method,through the input of bottom-up features and visual common sense features into the information interaction mo-dule,enables the attention mechanism to capture the image content more relevant to the problem text.On this basis,this paper also designs and introduces a gated counting module(GCM) to retain the number of entities in image features.This module significantly improves model performance on counting problems while maintaining information integrity and relevance.Compared to traditional methods,GCM is able to handle visual problems involving quantities more accurately,thus enhancing the accuracy of the overall VQA task.Finally,we have carried out a large number of experiments on the widely used dataset VQA v2.0 and obtained relatively good results.

Key words: Visual question answering, Visual common sense feature, Feature fusion, Visual feature, Faster R-CNN, Gate counting module

中图分类号:

TP391

徐钰涛, 汤守国. 融合视觉常识特征和门控计数方法的视觉问答[J]. 计算机科学, 2025, 52(6A): 240800086-7. https://doi.org/10.11896/jsjkx.240800086

XU Yutao, TANG Shouguo. Visual Question Answering Integrating Visual Common Sense Features and Gated Counting Module[J]. Computer Science, 2025, 52(6A): 240800086-7. https://doi.org/10.11896/jsjkx.240800086

参考文献

[1]ANTOL S,AGRAWAL A,LU J,et al.VQA:Visual Question Answering[C]//Proceedings of the IEEE International Conference on Computer Vision.2015:2425-2433.
[2]KRIZHEVSKY A,SUTSKEVER I,HINTON G E.ImageNet Classification with Deep Convolutional Neural Networks[J].Communications of the ACM,2017,60(6):84-90.
[3]REN S,HE K,GIRSHICK R,et al.Faster R-CNN:TowardsReal-Time Object Detection with Region Proposal Networks[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2016,39(6):1137-1149.
[4]HOCHREITER S,SCHMIDHUBER J.Long Short-Term Memory[J].Neural Computation,1997,9(8):1735-1780.
[5]VASWANI A,SHAZEER N,PARMAR N,et al.Attention Is All You Need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems.2017:6000-6010.
[6]WANG Z,JI S.Learning Convolutional Text Representationsfor Visual Question Answering[C]//Proceedings of the 2018 SIAM International Conference on Data Mining.2018:594-602.
[7]LU J,YANG J,BATRA D,et al.Hierarchical Question-Image Co-Attention for Visual Question Answering[C]//Proceedings of the 30th International Conference on Neural Information Processing Systems.2016:289-297.
[8]YU Z,YU J,CUI Y,et al.Deep Modular Co-Attention Net-works for Visual Question Answering[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019:6281-6290.
[9]HE K,ZHANG X,REN S,et al.Deep Residual Learning for Image Recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:770-778.
[10]HE K,GKIOXARI G,DOLLAR P,et al.Mask R-CNN[C]//Proceedings of the IEEE International Conference on Computer Vision.2017:2961-2969.
[11]LONG J,SHELHAMER E,DARRELL T.Fully Convolutional Networks for Semantic Segmentation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2015:3431-3440.
[12]LIU W,ANGUELOV D,ERHAN D,et al.SSD:Single ShotMultiBox Detector[C]//Computer Vision－ECCV 2016.Cham:Springer International Publishing,2016:21-37.
[13]LI B,WU W,WANG Q,et al.SiamRPN++:Evolution of Siamese Visual Tracking With Very Deep Networks[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019:4282-4291.
[14]KRISTAN M,MATAS J,LEONARDIS A,et al.The VisualObject Tracking VOT2015 Challenge Results[C]//Proceedings of the IEEE International Conference on Computer Vision Workshops.2015:1-23.
[15]WANG T,HUANG J,ZHANG H,et al.Visual Commonsense R-CNN[C/OL]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:10760-10770.
[16]AGRAWAL A,BATRA D,PARIKH D,et al.Don't Just Assume; Look and Answer:Overcoming Priors for Visual Question Answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:4971-4980.
[17]HUDSON D A,MANNING C D.GQA:A New Dataset for Real-World Visual Reasoning and Compositional Question Answering[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019:6700-6709.
[18]HENDRICKS L A,BURNS K,SAENKO K,et al.Women also Snowboard:Overcoming Bias in Captioning Models[C]//Proceedings of the European Conference on Computer Vision (ECCV).2018:771-787.
[19]MANJUNATHA V,SAINI N,DAVIS L S.Explicit Bias Discovery in Visual Question Answering Models[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019:9562-9571.
[20]RAMAKRISHNAN S,AGRAWAL A,LEE S.OvercomingLanguage Priors in Visual Question Answering with Adversarial Regularization[C]//Proceedings of the 32nd International Conference on Neural Information Processing Systems.2018:1548-1558.
[21]SADEGHI F,KUMAR DIVVALA S K,FARHADI A.VisKE:Visual Knowledge Extraction and Question Answering by Visual Verification of Relation Phrases[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2015:1456-1464.
[22]SU Z,ZHU C,DONG Y,et al.Learning Visual KnowledgeMemory Networks for Visual Question Answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:7736-7745.
[23]GOYAL R,EBRAHIMI KAHOU S,MICHALSKI V,et al.The “Something Something” Video Database for Learning and Eva-luating Visual Common Sense[C]//Proceedings of the IEEE International Conference on Computer Vision.2017:5842-5850.
[24]LEMPITSKY V,ZISSERMAN A.Learning To Count Objectsin Images[C]//Proceedings of the 24th International Conference on Neural Information Processing Systems.2010:1324-1332.
[25]XIONG H,LU H,LIU C,et al.From Open Set to Closed Set:Counting Objects by Spatial Divide-and-Conquer[C]//Procee-dings of the IEEE/CVF International Conference on Computer Vision.2019:8362-8371.
[26]HUBERMAN-SPIEGELGLAS I,FATTAL R.Single ImageObject Counting and Localizing Using Active-Learning[C]//Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision.2022:1310-1319.
[27]ZHANG Y,HARE J,PRÜGEL-BENNETT A.Learning toCount Objects in Natural Images for Visual Question Answering[J].arXiv:1802.05766,2018.
[28]ACHARYA M,KAFLE K,KANAN C.TallyQA:AnsweringComplex Counting Questions[J].Proceedings of the AAAI Conference on Artificial Intelligence,2019,33(1):8076-8084.
[29]TROTT A,XIONG C,SOCHER R.Interpretable Counting for Visual Question Answering[J].arXiv:1712.08697,2018.
[30]WHITEHEAD S,WU H,JI H,et al.Separating Skills and Concepts for Novel Visual Question Answering[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2021:5632-5641.
[31]ANDERSON P,HE X,BUEHLER C,et al.Bottom-Up andTop-Down Attention for Image Captioning and Visual Question Answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:6077-6086.
[32]PENNINGTON J,SOCHER R,MANNING C.GloVe:Global Vectors for Word Representation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Proces-sing(EMNLP).Doha,Qatar:Association for Computational Linguistics,2014:1532-1543.
[33]KRISHNA R,ZHU Y,GROTH O,et al.Visual Genome:Connecting Language and Vision Using Crowdsourced Dense Image Annotations[J].International Journal of Computer Vision,2017,123(1):32-73.
[34]KINGMA D P,BA J.Adam:A Method for Stochastic Optimization[J].arXiv:1412.6980,2017.
[35]YU Z,YU J,XIANG C,et al.Beyond Bilinear:Generalized Multimodal Factorized High-Order Pooling for Visual Question Answering[J].IEEE Transactions on Neural Networks and Learning Systems,2018,29(12):5947-5959.
[36]BEN-YOUNES H,CADENE R,THOME N,et al.BLOCK:Bilinear Superdiagonal Fusion for Visual Question Answering and Visual Relationship Detection[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2019:8102-8109.
[37]NGUYEN D K,OKATANI T.Improved Fusion of Visual and Language Representations by Dense Symmetric Co-Attention for Visual Question Answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:6087-6096.
[38]KIM J H,JUN J,ZHANG B T.Bilinear Attention Networks[C]//NeurIPS 2018.Montréal,Canada,2018:1-11.
[39]GAO P,JIANG Z,YOU H,et al.Dynamic Fusion With Intra- and Inter-Modality Attention Flow for Visual Question Answering[C]//Proceedings of the IEEE/CVF Conference on ComputerVision and Pattern Recognition.2019:6639-6648.
[40]KIM J J,LEE D G,WU J,et al.Visual question answering based on local-scene-aware referring expression generation[J].Neural Networks,2021,139:158-167.
[41]SHUANG K,GUO J,WANG Z.Comprehensive-perception dynamic reasoning for visual question answering[J].Pattern Re-cognition,2022,131:108878.
[42]CHEN C,HAN D,CHANG C C.CAAN:Context-Aware attention network for visual question answering[J].Pattern Recognition,2022,132:108980.
[43]GUO Z,HAN D.Sparse co-attention visual question answering networks based on thresholds[J].Applied Intelligence,2023,53(1):586-600.
[44]HU T,HE L L.Joint relational reasoning visual question answering model based on gating mechanism[J].Intelligent Computer and Applications,2023,13(12):138-143.
[45]ZHANG J,LIU X,WANG Z.Latent Attention Network WithPosition Perception for Visual Question Answering[J].IEEE Transactions on Neural Networks and Learning Systems,2024,36(3):5059-5069.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

融合视觉常识特征和门控计数方法的视觉问答

Visual Question Answering Integrating Visual Common Sense Features and Gated Counting Module

PDF (PC)

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 0

Metrics

本文评价

推荐阅读 0