Computer Science ›› 2025, Vol. 52 ›› Issue (6A): 240800086-7.doi: 10.11896/jsjkx.240800086

• Image Processing & Multimedia Technology • Previous Articles     Next Articles

Visual Question Answering Integrating Visual Common Sense Features and Gated Counting Module

XU Yutao, TANG Shouguo   

  1. Faculty of Information Engineering and Automation,Kunming University of Science and Technology,Kunming 650504,China
    Yunnan Key Laboratory of Computer Technologies Application,Kunming 650504,China
  • Online:2025-06-16 Published:2025-06-12
  • About author:XU Yutao,born in 1999,postgraduate.His main research interest is visual question answering.
    TANG Shouguo,born in 1981,expert experimenter.His main research interests include medical information technology and machine learning.
  • Supported by:
    Yunnan Fundamental Research Projects(202201AS070029) and Yunnan Science and Technology Major Project(202302AD080002).

Abstract: To better explore potential common sense information in images,this paper introduces an innovative Visual common sense feature for the visual question answering(VQA) task,and effectively integrate the bottom-up feature with the visual common sense feature through the visual feature fusion module.Thus,rich visual feature representation is realized.Guided attention fusion method,through the input of bottom-up features and visual common sense features into the information interaction mo-dule,enables the attention mechanism to capture the image content more relevant to the problem text.On this basis,this paper also designs and introduces a gated counting module(GCM) to retain the number of entities in image features.This module significantly improves model performance on counting problems while maintaining information integrity and relevance.Compared to traditional methods,GCM is able to handle visual problems involving quantities more accurately,thus enhancing the accuracy of the overall VQA task.Finally,we have carried out a large number of experiments on the widely used dataset VQA v2.0 and obtained relatively good results.

Key words: Visual question answering, Visual common sense feature, Feature fusion, Visual feature, Faster R-CNN, Gate counting module

CLC Number: 

  • TP391
[1]ANTOL S,AGRAWAL A,LU J,et al.VQA:Visual Question Answering[C]//Proceedings of the IEEE International Conference on Computer Vision.2015:2425-2433.
[2]KRIZHEVSKY A,SUTSKEVER I,HINTON G E.ImageNet Classification with Deep Convolutional Neural Networks[J].Communications of the ACM,2017,60(6):84-90.
[3]REN S,HE K,GIRSHICK R,et al.Faster R-CNN:TowardsReal-Time Object Detection with Region Proposal Networks[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2016,39(6):1137-1149.
[4]HOCHREITER S,SCHMIDHUBER J.Long Short-Term Memory[J].Neural Computation,1997,9(8):1735-1780.
[5]VASWANI A,SHAZEER N,PARMAR N,et al.Attention Is All You Need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems.2017:6000-6010.
[6]WANG Z,JI S.Learning Convolutional Text Representationsfor Visual Question Answering[C]//Proceedings of the 2018 SIAM International Conference on Data Mining.2018:594-602.
[7]LU J,YANG J,BATRA D,et al.Hierarchical Question-Image Co-Attention for Visual Question Answering[C]//Proceedings of the 30th International Conference on Neural Information Processing Systems.2016:289-297.
[8]YU Z,YU J,CUI Y,et al.Deep Modular Co-Attention Net-works for Visual Question Answering[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019:6281-6290.
[9]HE K,ZHANG X,REN S,et al.Deep Residual Learning for Image Recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:770-778.
[10]HE K,GKIOXARI G,DOLLAR P,et al.Mask R-CNN[C]//Proceedings of the IEEE International Conference on Computer Vision.2017:2961-2969.
[11]LONG J,SHELHAMER E,DARRELL T.Fully Convolutional Networks for Semantic Segmentation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2015:3431-3440.
[12]LIU W,ANGUELOV D,ERHAN D,et al.SSD:Single ShotMultiBox Detector[C]//Computer Vision-ECCV 2016.Cham:Springer International Publishing,2016:21-37.
[13]LI B,WU W,WANG Q,et al.SiamRPN++:Evolution of Siamese Visual Tracking With Very Deep Networks[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019:4282-4291.
[14]KRISTAN M,MATAS J,LEONARDIS A,et al.The VisualObject Tracking VOT2015 Challenge Results[C]//Proceedings of the IEEE International Conference on Computer Vision Workshops.2015:1-23.
[15]WANG T,HUANG J,ZHANG H,et al.Visual Commonsense R-CNN[C/OL]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:10760-10770.
[16]AGRAWAL A,BATRA D,PARIKH D,et al.Don't Just Assume; Look and Answer:Overcoming Priors for Visual Question Answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:4971-4980.
[17]HUDSON D A,MANNING C D.GQA:A New Dataset for Real-World Visual Reasoning and Compositional Question Answering[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019:6700-6709.
[18]HENDRICKS L A,BURNS K,SAENKO K,et al.Women also Snowboard:Overcoming Bias in Captioning Models[C]//Proceedings of the European Conference on Computer Vision (ECCV).2018:771-787.
[19]MANJUNATHA V,SAINI N,DAVIS L S.Explicit Bias Discovery in Visual Question Answering Models[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019:9562-9571.
[20]RAMAKRISHNAN S,AGRAWAL A,LEE S.OvercomingLanguage Priors in Visual Question Answering with Adversarial Regularization[C]//Proceedings of the 32nd International Conference on Neural Information Processing Systems.2018:1548-1558.
[21]SADEGHI F,KUMAR DIVVALA S K,FARHADI A.VisKE:Visual Knowledge Extraction and Question Answering by Visual Verification of Relation Phrases[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2015:1456-1464.
[22]SU Z,ZHU C,DONG Y,et al.Learning Visual KnowledgeMemory Networks for Visual Question Answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:7736-7745.
[23]GOYAL R,EBRAHIMI KAHOU S,MICHALSKI V,et al.The “Something Something” Video Database for Learning and Eva-luating Visual Common Sense[C]//Proceedings of the IEEE International Conference on Computer Vision.2017:5842-5850.
[24]LEMPITSKY V,ZISSERMAN A.Learning To Count Objectsin Images[C]//Proceedings of the 24th International Conference on Neural Information Processing Systems.2010:1324-1332.
[25]XIONG H,LU H,LIU C,et al.From Open Set to Closed Set:Counting Objects by Spatial Divide-and-Conquer[C]//Procee-dings of the IEEE/CVF International Conference on Computer Vision.2019:8362-8371.
[26]HUBERMAN-SPIEGELGLAS I,FATTAL R.Single ImageObject Counting and Localizing Using Active-Learning[C]//Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision.2022:1310-1319.
[27]ZHANG Y,HARE J,PRÜGEL-BENNETT A.Learning toCount Objects in Natural Images for Visual Question Answering[J].arXiv:1802.05766,2018.
[28]ACHARYA M,KAFLE K,KANAN C.TallyQA:AnsweringComplex Counting Questions[J].Proceedings of the AAAI Conference on Artificial Intelligence,2019,33(1):8076-8084.
[29]TROTT A,XIONG C,SOCHER R.Interpretable Counting for Visual Question Answering[J].arXiv:1712.08697,2018.
[30]WHITEHEAD S,WU H,JI H,et al.Separating Skills and Concepts for Novel Visual Question Answering[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2021:5632-5641.
[31]ANDERSON P,HE X,BUEHLER C,et al.Bottom-Up andTop-Down Attention for Image Captioning and Visual Question Answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:6077-6086.
[32]PENNINGTON J,SOCHER R,MANNING C.GloVe:Global Vectors for Word Representation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Proces-sing(EMNLP).Doha,Qatar:Association for Computational Linguistics,2014:1532-1543.
[33]KRISHNA R,ZHU Y,GROTH O,et al.Visual Genome:Connecting Language and Vision Using Crowdsourced Dense Image Annotations[J].International Journal of Computer Vision,2017,123(1):32-73.
[34]KINGMA D P,BA J.Adam:A Method for Stochastic Optimization[J].arXiv:1412.6980,2017.
[35]YU Z,YU J,XIANG C,et al.Beyond Bilinear:Generalized Multimodal Factorized High-Order Pooling for Visual Question Answering[J].IEEE Transactions on Neural Networks and Learning Systems,2018,29(12):5947-5959.
[36]BEN-YOUNES H,CADENE R,THOME N,et al.BLOCK:Bilinear Superdiagonal Fusion for Visual Question Answering and Visual Relationship Detection[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2019:8102-8109.
[37]NGUYEN D K,OKATANI T.Improved Fusion of Visual and Language Representations by Dense Symmetric Co-Attention for Visual Question Answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:6087-6096.
[38]KIM J H,JUN J,ZHANG B T.Bilinear Attention Networks[C]//NeurIPS 2018.Montréal,Canada,2018:1-11.
[39]GAO P,JIANG Z,YOU H,et al.Dynamic Fusion With Intra- and Inter-Modality Attention Flow for Visual Question Answering[C]//Proceedings of the IEEE/CVF Conference on ComputerVision and Pattern Recognition.2019:6639-6648.
[40]KIM J J,LEE D G,WU J,et al.Visual question answering based on local-scene-aware referring expression generation[J].Neural Networks,2021,139:158-167.
[41]SHUANG K,GUO J,WANG Z.Comprehensive-perception dynamic reasoning for visual question answering[J].Pattern Re-cognition,2022,131:108878.
[42]CHEN C,HAN D,CHANG C C.CAAN:Context-Aware attention network for visual question answering[J].Pattern Recognition,2022,132:108980.
[43]GUO Z,HAN D.Sparse co-attention visual question answering networks based on thresholds[J].Applied Intelligence,2023,53(1):586-600.
[44]HU T,HE L L.Joint relational reasoning visual question answering model based on gating mechanism[J].Intelligent Computer and Applications,2023,13(12):138-143.
[45]ZHANG J,LIU X,WANG Z.Latent Attention Network WithPosition Perception for Visual Question Answering[J].IEEE Transactions on Neural Networks and Learning Systems,2024,36(3):5059-5069.
[1] WANG Rong , ZOU Shuping, HAO Pengfei, GUO Jiawei, SHU Peng. Sand Dust Image Enhancement Method Based on Multi-cascaded Attention Interaction [J]. Computer Science, 2025, 52(6A): 240800048-7.
[2] JIN Lu, LIU Mingkun, ZHANG Chunhong, CHEN Kefei, LUO Yaqiong, LI Bo. Pedestrian Re-identification Based on Spatial Transformation and Multi-scale Feature Fusion [J]. Computer Science, 2025, 52(6A): 240800156-7.
[3] ZHANG Yongyu, GUO Chenjuan, WEI Hanyue. Deep Learning Stock Price Probability Prediction Based on Multi-modal Feature Wavelet Decomposition [J]. Computer Science, 2025, 52(6A): 240600140-11.
[4] SHI Xincheng, WANG Baohui, YU Litao, DU Hui. Study on Segmentation Algorithm of Lower Limb Bone Anatomical Structure Based on 3D CTImages [J]. Computer Science, 2025, 52(6A): 240500119-9.
[5] LI Weirong, YIN Jibin. FB-TimesNet:An Improved Multimodal Emotion Recognition Method Based on TimesNet [J]. Computer Science, 2025, 52(6A): 240900046-8.
[6] XU Yutao, TANG Shouguo. External Knowledge Query-based for Visual Question Answering [J]. Computer Science, 2025, 52(6A): 240400101-8.
[7] WANG Rui, TANG Zhanjun. Multi-feature Fusion and Ensemble Learning-based Wind Turbine Blade Defect Detection Method [J]. Computer Science, 2025, 52(6A): 240900138-8.
[8] LI Mingjie, HU Yi, YI Zhengming. Flame Image Enhancement with Few Samples Based on Style Weight Modulation Technique [J]. Computer Science, 2025, 52(6A): 240500129-7.
[9] SHEN Xinyang, WANG Shanmin, SUN Yubao. Depression Recognition Based on Speech Corpus Alignment and Adaptive Fusion [J]. Computer Science, 2025, 52(6): 219-227.
[10] GUO Yecai, HU Xiaowei, MAO Xiangnan. Multi-scale Feature Fusion Residual Denoising Network Based on Cascade [J]. Computer Science, 2025, 52(6): 239-246.
[11] GENG Sheng, DING Weiping, JU Hengrong, HUANG Jiashuang, JIANG Shu, WANG Haipeng. FDiff-Fusion:Medical Image Diffusion Fusion Network Segmentation Model Driven Based onFuzzy Logic [J]. Computer Science, 2025, 52(6): 274-285.
[12] JIANG Wenwen, XIA Ying. Improved U-Net Multi-scale Feature Fusion Semantic Segmentation Network for RemoteSensing Images [J]. Computer Science, 2025, 52(5): 212-219.
[13] LI Xiwang, CAO Peisong, WU Yuying, GUO Shuming, SHE Wei. Study on Security Risk Relation Extraction Based on Multi-view IB [J]. Computer Science, 2025, 52(5): 330-336.
[14] DENG Ceyu, LI Duantengchuan, HU Yiren, WANG Xiaoguang, LI Zhifei. Joint Inter-word and Inter-sentence Multi-relationship Modeling for Review-basedRecommendation Algorithm [J]. Computer Science, 2025, 52(4): 119-128.
[15] YANG Jincai, YU Moyang, HU Man, XIAO Ming. Automatic Identification and Classification of Topical Discourse Markers [J]. Computer Science, 2025, 52(4): 255-261.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!