计算机科学 ›› 2025, Vol. 52 ›› Issue (6A): 240400101-8.doi: 10.11896/jsjkx.240400101

• 人工智能 • 上一篇    下一篇

基于外部知识查询的视觉问答

徐钰涛, 汤守国   

  1. 昆明理工大学信息工程与自动化学院 昆明 650504
    云南省计算机技术应用重点实验 昆明 650504
  • 出版日期:2025-06-16 发布日期:2025-06-12
  • 通讯作者: 汤守国(tondycool@qq.com)
  • 作者简介:(20212104076@stu.kust.edu.cn)
  • 基金资助:
    云南省基础研究专项(202201AS070029);云南省重大专项计划(202302AD080002)

External Knowledge Query-based for Visual Question Answering

XU Yutao, TANG Shouguo   

  1. Faculty of Information Engineering and Automation,Kunming University of Science and Technology,Kunming 650504,China
    Yunnan Key Laboratory of Computer Technologies Application,Kunming 650504,China
  • Online:2025-06-16 Published:2025-06-12
  • About author:XU Yutao,born in 1999,postgraduate.His main research interest includes vi-sual question answering.
    TANG Shouguo,born in 1981,expert experimenter.His main research interests include medical information technology and machine learning.
  • Supported by:
    Special Foundation for Basic Research Program of Yunnan(202201AS070029) and Major Project of Yunnan(202302AD080002).

摘要: 为了有效解决现阶段视觉问答(Visual Question Answering,VQA)模型难以处理需要额外知识才能解答的问题,文中提出了一种问题引导的外部知识查询机制(Question-Guided Mechanism for Querying External Knowledge,QGK),旨在集成关键知识以丰富问题文本,从而提高VQA模型的准确率。首先,开发了一种问题引导的外部知识查询机制(QGK),以扩充模型内的文本特征表示并增强其处理复杂问题的能力。其中包含了多阶段处理流程,包括关键词提取、查询构造、知识筛选和提炼步骤。其次,还引入了视觉常识特征以验证所提方法的有效性。实验结果表明,所提出的查询机制能够有效提供重要的外部知识,显著提升模型在VQA v2.0数据集上的准确率。当将查询机制单独加入基线模型时,准确率提升至71.05%;而将视觉常识特征与外部知识查询机制相结合时,模型的准确率进一步提高至71.38%。这些结果验证了所提方法对于提升VQA模型性能的显著效果。

关键词: 视觉问答, 外部知识库, 查询机制, 长短时记忆网络, 文本特征

Abstract: To address the limitation of current visual question answering(VQA) models in handling questions that require external knowledge,this paper proposes a question-guided mechanism for querying external knowledge(QGK).The aim is to integrate key knowledge to enrich question text,thereby improving the accuracy of VQA models.We develop a question-guided external knowledge query mechanism to expand the text feature representation within the model and enhance its ability to handle complex problems.This mechanism includes a multi-stage processing method with steps for keyword extraction,query construction,and knowledge screening and refining.Besides,we introduce visual common sense features to validate the effectiveness of the proposed method.Experimental results demonstrate that the proposed query mechanism effectively provides crucial external knowledge and significantly improves model accuracy on the VQA v2.0 dataset.When the query mechanism is integrated into the baseline mo-del,the accuracy increases to 71.05%.Furthermore,combining visual common sense features with the external knowledge querymechanism boosts the model’s accuracy to 71.38%.These results confirm the significant impact of the proposed method on enhancing VQA model performance.

Key words: Visual question answering, External knowledge base, Query mechanism, Long-short term memory network, Text feature

中图分类号: 

  • TP391
[1]ANTOL S,AGRAWAL A,LU J,et al.VQA:Visual Question Answering[C]//Proceedings of the IEEE International Conference on Computer Vision.2015:2425-2433.
[2]MALINOWSKI M,ROHRBACH M,FRITZ M.Ask Your Neurons:A Neural-Based Approach to Answering Questions About Images[C]//Proceedings of the IEEE International Conference on Computer Vision.2015:1-9.
[3]KIM J H,ON K W,LIM W,et al.Hadamard Product for Low-rank Bilinear Pooling[M/OL].arXiv,2017[2024-03-31].http://arxiv.org/abs/1610.04325.
[4]LU J,YANG J,BATRA D,et al.Hierarchical Question-Image Co-Attention for Visual Question Answering[C]//Advances in Neural Information Processing Systems:卷 29.Curran Associates,Inc.,2016.
[5]YU Z,YU J,CUI Y,et al.Deep Modular Co-Attention Net-works for Visual Question Answering[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019:6281-6290.
[6]VASWANI A,SHAZEER N,PARMAR N,et al.Attention Is All You Need[M/OL].arXiv,2017[2022-07-04].http://arxiv.org/abs/1706.03762.
[7]NOH H,SEO P H,HAN B.Image Question Answering Using Convolutional Neural Network With Dynamic Parameter Prediction[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:30-38.
[8]ADITYA S,YANG Y,BARAL C.Explicit Reasoning over End-to-End Neural Architectures for Visual Question Answering[J].Proceedings of the AAAI Conference on Artificial Intelligence,2018,32(1).
[9]AUER S,BIZER C,KOBILAROV G,et al.DBpedia:A Nucleus for a Web of Open Data[C]//ABERER K,CHOI K S,NOY N,et al.The Semantic Web.Berlin,Heidelberg:Springer,2007:722-735.
[10]WANG P,WU Q,SHEN C,et al.FVQA:Fact-Based Visual Question Answering[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2018,40(10):2413-2427.
[11]WU Q,WANG P,SHEN C,et al.Ask Me Anything:Free-Form Visual Question Answering Based on Knowledge From External Sources[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:4622-4630.
[12]SPEER R,CHIN J,HAVASI C.ConceptNet 5.5:An OpenMultilingual Graph of General Knowledge[J].Proceedings of the AAAI Conference on Artificial Intelligence,2017,31(1).
[13]VRANDEČIĆ D,KRÖTZSCH M.Wikidata:a free collaborative knowledgebase[J].Communications of the ACM,2014,57(10):78-85.
[14]SUCHANEK F M,KASNECI G,WEIKUM G.Yago:a core of semantic knowledge[C]//Proceedings of the 16th international conference on World Wide Web.New York,NY,USA:Association for Computing Machinery,2007:697-706.
[15]WANG T,HUANG J,ZHANG H,et al.Visual Commonsense R-CNN[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:10760-10770.
[16]ANDERSON P,HE X,BUEHLER C,et al.Bottom-Up andTop-Down Attention for Image Captioning and Visual Question Answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:6077-6086.
[17]KRISHNA R,ZHU Y,GROTH O,et al.Visual Genome:Connecting Language and Vision Using Crowdsourced Dense Image Annotations[J].International Journal of Computer Vision,2017,123(1):32-73.
[18]REN S,HE K,GIRSHICK R,et al.Faster R-CNN:Towards Real-Time Object Detection with Region Proposal Networks[C]//Advances in Neural Information Processing Systems:卷 28.Curran Associates,Inc.,2015.
[19]TENEY D,ANDERSON P,HE X,et al.Tips and Tricks forVisual Question Answering:Learnings From the 2017 Challenge[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:4223-4232.
[20]PENNINGTON J,SOCHER R,MANNING C.GloVe:Global Vectors for Word Representation[C]//MOSCHITTI A,PANG B,DAELEMANS W.Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing(EMNLP).Doha,Qatar:Association for Computational Linguistics,2014:1532-1543.
[21]KINGMA D P,BA J.Adam:A Method for Stochastic Optimization[M/OL].arXiv,2017[2024-03-31].http://arxiv.org/abs/1412.6980.
[22]YU Z,YU J,XIANG C,et al.Beyond Bilinear:Generalized Multimodal Factorized High-Order Pooling for Visual Question Answering[J].IEEE Transactions on Neural Networks and Learning Systems,2018,29(12):5947-5959.
[23]BEN-YOUNES H,CADENE R,THOME N,et al.BLOCK:Bilinear Superdiagonal Fusion for Visual Question Answering and Visual Relationship Detection[J].Proceedings of the AAAI Conference on Artificial Intelligence,2019,33(1):8102-8109.
[24]NGUYEN D K,OKATANI T.Improved Fusion of Visual and Language Representations by Dense Symmetric Co-Attention for Visual Question Answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:6087-6096.
[25]KIM J H,JUN J,ZHANG B T.Bilinear Attention Networks[C]//Advances in Neural Information Processing Systems:卷 31.Curran Associates,Inc.,2018.
[26]GAO P,JIANG Z,YOU H,et al.Dynamic Fusion With Intra-and Inter-Modality Attention Flow for Visual Question Answering[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019:6639-6648.
[27]LIU Y,ZHANG X,ZHANG Q,et al.Dual self-attention with co-attention networks for visual question answering[J].Pattern Recognition,2021,117:107956.
[28]KIM J J,LEE D G,WU J,et al.Visual question answering based on local-scene-aware referring expression generation[J].Neural Networks,2021,139:158-167.
[29]SHUANG K,GUO J,WANG Z.Comprehensive-perception dynamic reasoning for visual question answering[J].Pattern Recognition,2022,131:108878.
[30]GUO Z,HAN D.Sparse co-attention visual question answering networks based on thresholds[J].Applied Intelligence,2023,53(1):586-600.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!