计算机科学 ›› 2025, Vol. 52 ›› Issue (6A): 240400101-8.doi: 10.11896/jsjkx.240400101
徐钰涛, 汤守国
XU Yutao, TANG Shouguo
摘要: 为了有效解决现阶段视觉问答(Visual Question Answering,VQA)模型难以处理需要额外知识才能解答的问题,文中提出了一种问题引导的外部知识查询机制(Question-Guided Mechanism for Querying External Knowledge,QGK),旨在集成关键知识以丰富问题文本,从而提高VQA模型的准确率。首先,开发了一种问题引导的外部知识查询机制(QGK),以扩充模型内的文本特征表示并增强其处理复杂问题的能力。其中包含了多阶段处理流程,包括关键词提取、查询构造、知识筛选和提炼步骤。其次,还引入了视觉常识特征以验证所提方法的有效性。实验结果表明,所提出的查询机制能够有效提供重要的外部知识,显著提升模型在VQA v2.0数据集上的准确率。当将查询机制单独加入基线模型时,准确率提升至71.05%;而将视觉常识特征与外部知识查询机制相结合时,模型的准确率进一步提高至71.38%。这些结果验证了所提方法对于提升VQA模型性能的显著效果。
中图分类号:
[1]ANTOL S,AGRAWAL A,LU J,et al.VQA:Visual Question Answering[C]//Proceedings of the IEEE International Conference on Computer Vision.2015:2425-2433. [2]MALINOWSKI M,ROHRBACH M,FRITZ M.Ask Your Neurons:A Neural-Based Approach to Answering Questions About Images[C]//Proceedings of the IEEE International Conference on Computer Vision.2015:1-9. [3]KIM J H,ON K W,LIM W,et al.Hadamard Product for Low-rank Bilinear Pooling[M/OL].arXiv,2017[2024-03-31].http://arxiv.org/abs/1610.04325. [4]LU J,YANG J,BATRA D,et al.Hierarchical Question-Image Co-Attention for Visual Question Answering[C]//Advances in Neural Information Processing Systems:卷 29.Curran Associates,Inc.,2016. [5]YU Z,YU J,CUI Y,et al.Deep Modular Co-Attention Net-works for Visual Question Answering[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019:6281-6290. [6]VASWANI A,SHAZEER N,PARMAR N,et al.Attention Is All You Need[M/OL].arXiv,2017[2022-07-04].http://arxiv.org/abs/1706.03762. [7]NOH H,SEO P H,HAN B.Image Question Answering Using Convolutional Neural Network With Dynamic Parameter Prediction[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:30-38. [8]ADITYA S,YANG Y,BARAL C.Explicit Reasoning over End-to-End Neural Architectures for Visual Question Answering[J].Proceedings of the AAAI Conference on Artificial Intelligence,2018,32(1). [9]AUER S,BIZER C,KOBILAROV G,et al.DBpedia:A Nucleus for a Web of Open Data[C]//ABERER K,CHOI K S,NOY N,et al.The Semantic Web.Berlin,Heidelberg:Springer,2007:722-735. [10]WANG P,WU Q,SHEN C,et al.FVQA:Fact-Based Visual Question Answering[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2018,40(10):2413-2427. [11]WU Q,WANG P,SHEN C,et al.Ask Me Anything:Free-Form Visual Question Answering Based on Knowledge From External Sources[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:4622-4630. [12]SPEER R,CHIN J,HAVASI C.ConceptNet 5.5:An OpenMultilingual Graph of General Knowledge[J].Proceedings of the AAAI Conference on Artificial Intelligence,2017,31(1). [13]VRANDEČIĆ D,KRÖTZSCH M.Wikidata:a free collaborative knowledgebase[J].Communications of the ACM,2014,57(10):78-85. [14]SUCHANEK F M,KASNECI G,WEIKUM G.Yago:a core of semantic knowledge[C]//Proceedings of the 16th international conference on World Wide Web.New York,NY,USA:Association for Computing Machinery,2007:697-706. [15]WANG T,HUANG J,ZHANG H,et al.Visual Commonsense R-CNN[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:10760-10770. [16]ANDERSON P,HE X,BUEHLER C,et al.Bottom-Up andTop-Down Attention for Image Captioning and Visual Question Answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:6077-6086. [17]KRISHNA R,ZHU Y,GROTH O,et al.Visual Genome:Connecting Language and Vision Using Crowdsourced Dense Image Annotations[J].International Journal of Computer Vision,2017,123(1):32-73. [18]REN S,HE K,GIRSHICK R,et al.Faster R-CNN:Towards Real-Time Object Detection with Region Proposal Networks[C]//Advances in Neural Information Processing Systems:卷 28.Curran Associates,Inc.,2015. [19]TENEY D,ANDERSON P,HE X,et al.Tips and Tricks forVisual Question Answering:Learnings From the 2017 Challenge[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:4223-4232. [20]PENNINGTON J,SOCHER R,MANNING C.GloVe:Global Vectors for Word Representation[C]//MOSCHITTI A,PANG B,DAELEMANS W.Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing(EMNLP).Doha,Qatar:Association for Computational Linguistics,2014:1532-1543. [21]KINGMA D P,BA J.Adam:A Method for Stochastic Optimization[M/OL].arXiv,2017[2024-03-31].http://arxiv.org/abs/1412.6980. [22]YU Z,YU J,XIANG C,et al.Beyond Bilinear:Generalized Multimodal Factorized High-Order Pooling for Visual Question Answering[J].IEEE Transactions on Neural Networks and Learning Systems,2018,29(12):5947-5959. [23]BEN-YOUNES H,CADENE R,THOME N,et al.BLOCK:Bilinear Superdiagonal Fusion for Visual Question Answering and Visual Relationship Detection[J].Proceedings of the AAAI Conference on Artificial Intelligence,2019,33(1):8102-8109. [24]NGUYEN D K,OKATANI T.Improved Fusion of Visual and Language Representations by Dense Symmetric Co-Attention for Visual Question Answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:6087-6096. [25]KIM J H,JUN J,ZHANG B T.Bilinear Attention Networks[C]//Advances in Neural Information Processing Systems:卷 31.Curran Associates,Inc.,2018. [26]GAO P,JIANG Z,YOU H,et al.Dynamic Fusion With Intra-and Inter-Modality Attention Flow for Visual Question Answering[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019:6639-6648. [27]LIU Y,ZHANG X,ZHANG Q,et al.Dual self-attention with co-attention networks for visual question answering[J].Pattern Recognition,2021,117:107956. [28]KIM J J,LEE D G,WU J,et al.Visual question answering based on local-scene-aware referring expression generation[J].Neural Networks,2021,139:158-167. [29]SHUANG K,GUO J,WANG Z.Comprehensive-perception dynamic reasoning for visual question answering[J].Pattern Recognition,2022,131:108878. [30]GUO Z,HAN D.Sparse co-attention visual question answering networks based on thresholds[J].Applied Intelligence,2023,53(1):586-600. |
|