基于外部知识查询的视觉问答

doi:10.11896/jsjkx.240400101

Abstract

Abstract: To address the limitation of current visual question answering(VQA) models in handling questions that require external knowledge,this paper proposes a question-guided mechanism for querying external knowledge(QGK).The aim is to integrate key knowledge to enrich question text,thereby improving the accuracy of VQA models.We develop a question-guided external knowledge query mechanism to expand the text feature representation within the model and enhance its ability to handle complex problems.This mechanism includes a multi-stage processing method with steps for keyword extraction,query construction,and knowledge screening and refining.Besides,we introduce visual common sense features to validate the effectiveness of the proposed method.Experimental results demonstrate that the proposed query mechanism effectively provides crucial external knowledge and significantly improves model accuracy on the VQA v2.0 dataset.When the query mechanism is integrated into the baseline mo-del,the accuracy increases to 71.05%.Furthermore,combining visual common sense features with the external knowledge querymechanism boosts the model’s accuracy to 71.38%.These results confirm the significant impact of the proposed method on enhancing VQA model performance.

Key words: Visual question answering, External knowledge base, Query mechanism, Long-short term memory network, Text feature

CLC Number:

TP391

XU Yutao, TANG Shouguo. External Knowledge Query-based for Visual Question Answering[J].Computer Science, 2025, 52(6A): 240400101-8.

References

[1]ANTOL S,AGRAWAL A,LU J,et al.VQA:Visual Question Answering[C]//Proceedings of the IEEE International Conference on Computer Vision.2015:2425-2433.
[2]MALINOWSKI M,ROHRBACH M,FRITZ M.Ask Your Neurons:A Neural-Based Approach to Answering Questions About Images[C]//Proceedings of the IEEE International Conference on Computer Vision.2015:1-9.
[3]KIM J H,ON K W,LIM W,et al.Hadamard Product for Low-rank Bilinear Pooling[M/OL].arXiv,2017[2024-03-31].http://arxiv.org/abs/1610.04325.
[4]LU J,YANG J,BATRA D,et al.Hierarchical Question-Image Co-Attention for Visual Question Answering[C]//Advances in Neural Information Processing Systems:卷 29.Curran Associates,Inc.,2016.
[5]YU Z,YU J,CUI Y,et al.Deep Modular Co-Attention Net-works for Visual Question Answering[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019:6281-6290.
[6]VASWANI A,SHAZEER N,PARMAR N,et al.Attention Is All You Need[M/OL].arXiv,2017[2022-07-04].http://arxiv.org/abs/1706.03762.
[7]NOH H,SEO P H,HAN B.Image Question Answering Using Convolutional Neural Network With Dynamic Parameter Prediction[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:30-38.
[8]ADITYA S,YANG Y,BARAL C.Explicit Reasoning over End-to-End Neural Architectures for Visual Question Answering[J].Proceedings of the AAAI Conference on Artificial Intelligence,2018,32(1).
[9]AUER S,BIZER C,KOBILAROV G,et al.DBpedia:A Nucleus for a Web of Open Data[C]//ABERER K,CHOI K S,NOY N,et al.The Semantic Web.Berlin,Heidelberg:Springer,2007:722-735.
[10]WANG P,WU Q,SHEN C,et al.FVQA:Fact-Based Visual Question Answering[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2018,40(10):2413-2427.
[11]WU Q,WANG P,SHEN C,et al.Ask Me Anything:Free-Form Visual Question Answering Based on Knowledge From External Sources[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:4622-4630.
[12]SPEER R,CHIN J,HAVASI C.ConceptNet 5.5:An OpenMultilingual Graph of General Knowledge[J].Proceedings of the AAAI Conference on Artificial Intelligence,2017,31(1).
[13]VRANDEČIĆ D,KRÖTZSCH M.Wikidata:a free collaborative knowledgebase[J].Communications of the ACM,2014,57(10):78-85.
[14]SUCHANEK F M,KASNECI G,WEIKUM G.Yago:a core of semantic knowledge[C]//Proceedings of the 16th international conference on World Wide Web.New York,NY,USA:Association for Computing Machinery,2007:697-706.
[15]WANG T,HUANG J,ZHANG H,et al.Visual Commonsense R-CNN[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:10760-10770.
[16]ANDERSON P,HE X,BUEHLER C,et al.Bottom-Up andTop-Down Attention for Image Captioning and Visual Question Answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:6077-6086.
[17]KRISHNA R,ZHU Y,GROTH O,et al.Visual Genome:Connecting Language and Vision Using Crowdsourced Dense Image Annotations[J].International Journal of Computer Vision,2017,123(1):32-73.
[18]REN S,HE K,GIRSHICK R,et al.Faster R-CNN:Towards Real-Time Object Detection with Region Proposal Networks[C]//Advances in Neural Information Processing Systems:卷 28.Curran Associates,Inc.,2015.
[19]TENEY D,ANDERSON P,HE X,et al.Tips and Tricks forVisual Question Answering:Learnings From the 2017 Challenge[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:4223-4232.
[20]PENNINGTON J,SOCHER R,MANNING C.GloVe:Global Vectors for Word Representation[C]//MOSCHITTI A,PANG B,DAELEMANS W.Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing(EMNLP).Doha,Qatar:Association for Computational Linguistics,2014:1532-1543.
[21]KINGMA D P,BA J.Adam:A Method for Stochastic Optimization[M/OL].arXiv,2017[2024-03-31].http://arxiv.org/abs/1412.6980.
[22]YU Z,YU J,XIANG C,et al.Beyond Bilinear:Generalized Multimodal Factorized High-Order Pooling for Visual Question Answering[J].IEEE Transactions on Neural Networks and Learning Systems,2018,29(12):5947-5959.
[23]BEN-YOUNES H,CADENE R,THOME N,et al.BLOCK:Bilinear Superdiagonal Fusion for Visual Question Answering and Visual Relationship Detection[J].Proceedings of the AAAI Conference on Artificial Intelligence,2019,33(1):8102-8109.
[24]NGUYEN D K,OKATANI T.Improved Fusion of Visual and Language Representations by Dense Symmetric Co-Attention for Visual Question Answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:6087-6096.
[25]KIM J H,JUN J,ZHANG B T.Bilinear Attention Networks[C]//Advances in Neural Information Processing Systems:卷 31.Curran Associates,Inc.,2018.
[26]GAO P,JIANG Z,YOU H,et al.Dynamic Fusion With Intra-and Inter-Modality Attention Flow for Visual Question Answering[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019:6639-6648.
[27]LIU Y,ZHANG X,ZHANG Q,et al.Dual self-attention with co-attention networks for visual question answering[J].Pattern Recognition,2021,117:107956.
[28]KIM J J,LEE D G,WU J,et al.Visual question answering based on local-scene-aware referring expression generation[J].Neural Networks,2021,139:158-167.
[29]SHUANG K,GUO J,WANG Z.Comprehensive-perception dynamic reasoning for visual question answering[J].Pattern Recognition,2022,131:108878.
[30]GUO Z,HAN D.Sparse co-attention visual question answering networks based on thresholds[J].Applied Intelligence,2023,53(1):586-600.

Related Articles 15

[1]	XU Yutao, TANG Shouguo. Visual Question Answering Integrating Visual Common Sense Features and Gated Counting Module [J]. Computer Science, 2025, 52(6A): 240800086-7.
[2]	GU Huijie, FANG Wenchong, ZHOU Zhifeng, ZHU Wen, MA Guang, LI Yingchen. CSO-LSTM Based Power Prediction Method for New Energy Generation [J]. Computer Science, 2025, 52(6A): 240600053-11.
[3]	HE Shiyang, WANG Zhaohui, GONG Shengrong, ZHONG Shan. Cross-modal Information Filtering-based Networks for Visual Question Answering [J]. Computer Science, 2024, 51(5): 85-91.
[4]	CHEN Runhuan, DAI Hua, ZHENG Guineng, LI Hui , YANG Geng. Urban Electricity Load Forecasting Method Based on Discrepancy Compensation and Short-termSampling Contrastive Loss [J]. Computer Science, 2024, 51(4): 158-164.
[5]	LI Xiang, FAN Zhiguang, LI Xuexiang, ZHANG Weixing, YANG Cong, CAO Yangjie. Survey of Visual Question Answering Based on Deep Learning [J]. Computer Science, 2023, 50(5): 177-188.
[6]	ZOU Yunzhu, DU Shengdong, TENG Fei, LI Tianrui. Visual Question Answering Model Based on Multi-modal Deep Feature Fusion [J]. Computer Science, 2023, 50(2): 123-129.
[7]	WANG Ruiping, WU Shihong, ZHANG Meihang, WANG Xiaoping. Knowledge-based Visual Question Answering:A Survey [J]. Computer Science, 2023, 50(1): 166-175.
[8]	JIANG Meng-han, LI Shao-mei, ZHENG Hong-hao, ZHANG Jian-peng. Rumor Detection Model Based on Improved Position Embedding [J]. Computer Science, 2022, 49(8): 330-335.
[9]	KANG Yan, WU Zhi-wei, KOU Yong-qi, ZHANG Lan, XIE Si-yu, LI Hao. Deep Integrated Learning Software Requirement Classification Fusing Bert and Graph Convolution [J]. Computer Science, 2022, 49(6A): 150-158.
[10]	YUAN De-sen, LIU Xiu-jing, WU Qing-bo, LI Hong-liang, MENG Fan-man, NGAN King-ngi, XU Lin-feng. Visual Question Answering Method Based on Counterfactual Thinking [J]. Computer Science, 2022, 49(12): 229-235.
[11]	ZHU Guang-li, XU Xin, ZHANG Shun-xiang, WU Hou-yue, HUANG Ju. PosNet:Position-based Causal Relation Extraction Network [J]. Computer Science, 2022, 49(12): 305-311.
[12]	NIU Yu-lei, ZHANG Han-wang. Survey on Visual Question Answering and Dialogue [J]. Computer Science, 2021, 48(3): 87-96.
[13]	YU You-qin, LI Bi-cheng. Microblog User Interest Recognition Based on Multi-granularity Text Feature Representation [J]. Computer Science, 2021, 48(12): 219-225.
[14]	CHEN Jin-yin, JIANG Tao and ZHENG Hai-bin. Radio Modulation Recognition Based on Signal-noise Ratio Classification [J]. Computer Science, 2020, 47(6A): 310-317.
[15]	ZHAO Cheng, YE Yao-wei, YAO Ming-hai. Stock Volatility Forecast Based on Financial Text Emotion [J]. Computer Science, 2020, 47(5): 79-83.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

External Knowledge Query-based for Visual Question Answering

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 15

Metrics

Comments

Recommended 0