Computer Science ›› 2023, Vol. 50 ›› Issue (1): 166-175.doi: 10.11896/jsjkx.211100237

• Artificial Intelligence • Previous Articles     Next Articles

Knowledge-based Visual Question Answering:A Survey

WANG Ruiping1,2, WU Shihong2, ZHANG Meihang3, WANG Xiaoping1   

  1. 1 School of Artificial Intelligence and Automation,Huazhong University of Science and Technology,Wuhan 430074,China
    2 Research Institute of Yuanguang,YGSOFT INC.,Zhuhai,Guangdong 519085,China
    3 School of Mechanical Automation,Wuhan University of Science and Technology,Wuhan 430081,China
  • Received:2021-11-23 Revised:2022-06-13 Online:2023-01-15 Published:2023-01-09
  • About author:NG Ruiping,born in 1986,Ph.D,is a member of IEEE.His main research interests include computer vision,image processing,NLP and visual question answering.
  • Supported by:
    National Natural Science Foundation of China(51975432).

Abstract: As an important presentation form of the completeness of artificial intelligence and the visual Turing test,visual question answering(VQA),coupled with its potential application value,has received extensive attention from computer vision and na-tural language processing.Knowledge plays an important role in visual question answering,especially when dealing with complex and open questions,reasoning knowledge and external knowledge are critical to obtaining correct answers.The question and answer mechanism that contains knowledge is called knowledge-based visual question answering(Kb-VQA).At present,systematic investigations on Kb-VQA have not been discovered.Research on knowledge participation methods and expression forms in VQA can effectively fill the gaps in the literature review in the knowledge-based visual question answering system.In this paper,the constituent units of Kb-VQA are investigated,the existence of knowledge is studied,and the concept of knowledge hierarchy is proposed.Further,the knowledge participation methods and expression forms in the process of visual feature extraction,language feature extraction and multi-modal fusion are summarized,and future development trends and research directions are discussed.

Key words: Visual question answering, Knowledge stratification, Internal logical reasoning, External knowledge base, Knowledge expression form, Knowledge participation method

CLC Number: 

  • TP391
[1]ZHANG W,YU J,ZHAO W,et al.DMRFNet:Deep Multimodal Reasoning and Fusion for Visual Question Answering and explanation generation[J].Information Fusion,2021,72:70-79.
[2]LAO M,GUO Y,PU N,et al.Multi-stage hybrid embedding fusion network for visual question answering[J].Neurocompu-ting,2021,423:541-550.
[3]FARAZI M,KHAN S,BARNES N.Accuracy vs.complexity:A trade-off in visual question answering models[J].Pattern Recognition,2021,120(1):108106.
[4]LIN Y T.Research on Visual Question Answering Technology and Its Application[D].Hangzhou:Zhejiang University,2019.
[5]ANTOL S,AGRAWAL A,LU J,et al.Vqa:Visual question answering[C]//Proceedings of the IEEE International Conference on Computer Vision.Santiago:IEEE Press,2015:2425-2433.
[6]SU Z,ZHU C,DONG Y,et al.Learning visual knowledge me-mory networks for visual question answering[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Salt Lake City:IEEE Press,2018:7736-7745.
[7]LIU F,XIANG T,HOSPEDALES T M,et al.iVQA:Inversevisual question answering[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Salt Lake City:IEEE Press,2018:8611-8619.
[8]XIAN G J,HUANG Y Z.Review of research on visual question answering technology based on neural network[J].Network Security Technology and Application,2018,1:42-47.
[9]YU J,WANG L,YU Z.Research on Visual Question Answering Technology[J].Computer Research and Development,2018,55(9):1946-1958.
[10]NIU Y L,ZHANG H W.Overview of Visual Questions and Answers and Dialogues[J].Computer Science,2021,48(3):87-96.
[11]WU A M,JIANG P,HAN Y.A Survey of Cross-Media Question Answering and Reasoning Based on Vision and Language[J].Computer Science,2021,48(3):71-78.
[12]ZHANG D,CAO R,WU S.Information fusion in visual question answering:A Survey[J].Information Fusion,2019,52:268-280.
[13]TENEY D,WU Q,VAN D.Visual question answering:A tutorial[J].IEEE Signal Processing Magazine,2017,34(6):63-75.
[14]WU Q,TENEY D,WANG P,et al.Visual question answering:A survey of methods and datasets[J].Computer Vision and Image Understanding,2017,163:21-40.
[15]MANMADHAN S,KOVOOR B.Visual question answering:a state-of-the-art review[J].Artificial Intelligence Review,2020,53(8):5705-5745.
[16]YU J,ZHANG W,LU Y,et al.Reasoning on the relation:Enhancing visual representation for visual question answering and cross-modal retrieval[J].IEEE Transactions on Multimedia,2020,22(12):3196-3209.
[17]HAN X,WANG S,SU C,et al.Interpretable Visual Reasoning via Probabilistic Formulation Under Natural Supervision[C]//Proceedings of the European Conference on Computer Vision.Glasgow:Springer,2020:553-570.
[18]XU H,SAENKO K.Ask,attend and answer:Exploring question-guided spatial attention for visual question answering[C]//Proceedings of the European Conference on Computer Vision.Amsterdam:Springer,2016:451-466.
[19]DAS A,AGRAWAL H,ZITNICK L,et al.Human attention in visual question answering:Do humans and deep networks look at the same regions?[J].Computer Vision and Image Understanding,2017,163:90-100.
[20]BOLISANI E,BRATIANU C.The elusive definition of know-ledge[M].Springer,2018:1-22.
[21]YU J,ZHU Z,WANG Y,et al.Cross-modal knowledge reaso-ning for knowledge-based visual question answering[J].Pattern Recognition,2020,108:107563.
[22]ZHU Y,GROTH O,BERNSTEIN M,et al.Visual7w:Groun-ded question answering in images[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Las Vegas:IEEE Press,2016:4995-5004.
[23]KRISHNA R,ZHU Y,GROTH O,et al.Visual genome:Connecting language and vision using crowdsourced dense image annotations[J].International Journal of Computer Vision,2017,123(1):32-73.
[24]ZHU Y,ZHANG C,RÉ C,et al.Building a Large-scale Multimodal Knowledge Base for Visual Question Answering[J].ar-Xiv:1507.05670,2015.
[25]SPEER R,CHIN J,HAVASI C.ConceptNet 5.5:an open multilingual graph of general knowledge[C]//Proceedings of the AAAI Conference on Artificial Intelligence.San Francisco:AAAI Press,2017:4444-4451.
[26]MATSUMORI S,SHINGYOUCHI K,ABE Y,et al.Unifiedquestioner transformer for descriptive question generation in goal-oriented visual dialogue[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.Virtually:IEEE Press,2021:1898-1907.
[27]GUO D,WANG H,WANG S,et al.Textual-visual reference-aware attention network for visual dialog[J].IEEE Transactions on Image Processing,2020,29:6655-6666.
[28]LI Y,DUAN N,ZHOU B,et al.Visual question generation as dual task of visual question answering[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Salt Lake City:IEEE Press,2018:6116-6124.
[29]XU X,WANG T,YANG Y,et al.Radial graph convolutionalnetwork for visual question generation[J].IEEE transactions on Neural Networks and Learning Systems,2020,32(4):1654-1667.
[30]HAN X,WANG S,SU C,et al.Greedy gradient ensemble for robust visual question answering[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.Virtually:IEEE Press,2021:1584-1593.
[31]REN S,HE K,GIRSHICK R,et al.Faster R-CNN:Towards real-time object detection with region proposal networks[J].Advances in Neural Information Processing Systems,2015,28:91-99.
[32]FARHADI A,REDMON J.Yolov3:An incremental improve-ment[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Salt Lake City:IEEE Press,2018:1804-2767.
[33]HE K,ZHANG X,REN S,et al.Deep residual learning forimage recognition[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Las Vegas:IEEE Press,2016:770-778.
[34]SIMONYAN K,ZISSERMAN A.Very deep convolutional networks for large-scale image recognition[J].arXiv:1409.1556,2014.
[35]ADITYA S,YANG Y,BARAL C.Explicit reasoning over end-to-end neural architectures for visual question answering[C]//Proceedings of the AAAI Conference on Artificial Intelligence.New Orleans:AAAI Press,2018:629-637.
[36]BAI Y,FU J,ZHAO T,et al.Deep attention neural tensor network for visual question answering[C]//Proceedings of the European Conference on Computer Vision.Munich:Springer,2018:20-35.
[37]GORDON D,KEMBHAVI A,RASTEGARI M,et al.Iqa:Vi-sual question answering in interactive environments[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Salt Lake City:IEEE Press,2018:4089-4098.
[38]LI W,YUAN Z,FANG X,et al.Knowing where to look?Ana-lysis on attention of visual question answering system[C]//Proceedings of the European Conference on Computer Vision.Munich:Springer,2018:145-152.
[39]RAHMAN T,CHOU S H,SIGAL L,et al.An Improved Attention for Visual Question Answering[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Nashville:IEEE Press,2021:1653-1662.
[40]LIANG W,JIANG Y,LIU Z.GraghVQA:Language-GuidedGraph Neural Networks for Graph-based Visual Question Answering[J].arXiv:2104.10283,2021.
[41]KIM J J,LEE D G,WU J,et al.Visual question answering based on local-scene-aware referring expression generation[J].Neural Networks,2021,139:158-167.
[42]ZHANG W,YU J,HU H,et al.Multimodal feature fusion by relational reasoning and attention for visual question answering[J].Information Fusion,2020,55:116-126.
[43]ZHANG L,LIU S,LIU D,et al.Rich Visual Knowledge-Based Augmentation Network for Visual Question Answering[J].IEEE Transactions on Neural Networks and Learning Systems,2020,32(10):4362-4373.
[44]ZHU Y,LIM J J,LI F F.Knowledge acquisition for visual question answering via iterative querying[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Honolulu:IEEE Press,2017:1154-1163.
[45]WU Q,SHEN C,WANG P,et al.Image captioning and visual question answering based on attributes and external knowledge[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2017,40(6):1367-1381.
[46]WANG P,WU Q,SHEN C,et al.Fvqa:Fact-based visual question answering[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2017,40(10):2413-2427.
[47]CHO K,VAN MERRIËNBOER B,GULCEHRE C,et al.Lear-ning phrase representations using RNN encoder-decoder for statistical machine translation[J].arXiv:1406.1078,2014.
[48]PENNINGTON J,SOCHER R,MANNING C D.Glove:Global vectors for word representation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Proces-sing(EMNLP).Doha:Association for Computational Linguistics,2014:1532-1543.
[49]HOCHREITER S,SCHMIDHUBER J.Long short-term memo-ry[J].Neural Computation,1997,9(8):1735-1780.
[50]MIKOLOV T,SUTSKEVER I,CHEN K,et al.Distributed representations of words and phrases and their compositionality[C]//Advances in Neural Information Processing Systems.New York:Curran Associates Inc.,2013:3111-3119.
[51]FUKUI A,PARK D H,YANG D,et al.Multimodal compact bilinear pooling for visual question answering and visual grounding[J].arXiv:1606.01847,2016.
[52]RUWA N,MAO Q,WANG L,et al.Affective visual question answering network[C]//2018 IEEE Conference on Multimedia Information Processing And Retrieval(MIPR).Miami:IEEE Press,2013:170-173.
[53]KRIZHEVSKY A,SUTSKEVER I,HINTON G E.Imagenetclassification with deep convolutional neural networks[J].Advances in Neural Information Processing Systems,2012,25:1097-1105.
[54]ZEILER M D,FERGUS R.Visualizing and understanding con-volutional networks[C]//Proceedings of the European Confe-rence on Computer Vision.Zurich:Springer,2014:818-833.
[55]SZEGEDY C,LIU W,JIA Y,et al.Going deeper with convolutions[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Boston:IEEE Press,2015:1-9.
[56]HARRAG F,GUELIANI S.Event Extraction Based on DeepLearning in Food Hazard Arabic Texts[J].arXiv:2008.05014,2020.
[57]MIKOLOV T,CHEN K,CORRADO G,et al.Efficient estimation of word representations in vector space[J].arXiv:1301.3781,2013.
[58]DEVLIN J,CHANG M W,LEE K,et al.Bert:Pre-training of deep bidirectional transformers for language understanding[J].arXiv:1810.04805,2018.
[59]LIN X,PARIKH D.Leveraging visual question answering forimage-caption ranking[C]//Proceedings of the European Conference on Computer Vision.Amsterdam:Springer,2016:261-277.
[60]ZHANG P,GOYAL Y,SUMMERS-STAY D,et al.Yin andYang:Balancing and Answering Binary Visual Questions[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Las Vegas:IEEE Press,2016:5014-5022.
[61]LIN T Y,MAIRE M,BELONGIE S,et al.Microsoft coco:Common objects in context[C]//Proceedings of the European Conference on Computer Vision.Zurich:Springer,2014:740-755.
[62]YOUNG P,LAI A,HODOSH M,et al.From image descriptions to visual denotations:New similarity metrics for semantic infe-rence over event descriptions[J].Transactions of the Association for Computational Linguistics,2014,2(1):67-78.
[63]GOYAL Y,KHOT T,SUMMERS-STAY D,et al.Making the v in vqa matter:Elevating the role of image understanding in visualquestion answering[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Honolulu:IEEE Press,2017:6904-6913.
[64]HUDSON D A,MANNING C D.Gqa:A new dataset for real-world visual reasoning and compositional question answering[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Long Beach:IEEE Press,2019:6700-6709.
[65]AGRAWAL A,BATRA D,PARIKH D,et al.Don’t justassume;look and answer:Overcoming priors for visual question answering[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Salt Lake City:IEEE Press,2018:4971-4980.
[66]LI Q,TAO Q,JOTY S,et al.Vqa-e:Explaining,elaborating,and enhancing your answers for visual questions[C]//Proceedings of the European Conference on Computer Vision.Munich:Springer,2018:552-567.
[67]MARINO K,RASTEGARI M,FARHADI A,et al.Ok-vqa:Avisual question answering benchmark requiring external know-ledge[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Long Beach:IEEE Press,2019:3195-3204.
[68]KRISHNA R,ZHU Y,GROTH O,et al.Visual Genome:Connecting Language and Vision Using Crowdsourced Dense Image Annotations[J].International Journal of Computer Vision,2017,123(1):32-73.
[69]AUER S,BIZER C,KOBILAROV G,et al.Dbpedia:A nucleus for a web of open data[C]//Proceedings of the 6th International the Semantic Web and 2nd Asian Conference on Asian Semantic Web Conference.Springer,2007:722-735.
[70]TANDON N,DE MELO G,WEIKUM G.Acquiring comparative commonsense knowledge from the web[C]//Proceedings of the AAAI Conference on Artificial Intelligence.Québec:AAAI Press,2014:166-172.
[71]SANTORO A,RAPOSO D,BARRETT D G,et al.A simple neural network module for relational reasoning[J/OL].Advances in Neural Information Processing Systems,2017,30.https://www.researchgate.net/publication/317356629_A_sim-ple_neural_network_module_for_relational_reasoning.
[72]MAO J,GAN C,KOHLI P,et al.The neuro-symbolic concept learner:Interpreting scenes,words,and sentences from natural supervision[J].arXiv:1904.12584,2019.
[73]JOHNSON J,HARIHARAN B,VAN DER MAATEN L,et al.Clevr:A diagnostic dataset for compositional language and elementary visual reasoning[C]//Proceedings of the IEEE Confe-rence on Computer Vision and Pattern Recognition.2017:2901-2910.
[74]HONG Y,YI L,TENENBAUM J,et al.PTR:A Benchmark for Part-based Conceptual,Relational,and Physical Reasoning[J/OL].Advances in Neural Information Processing Systems,2021,34.https://arxiv.org/abs/2112.05136.
[75]HUANG D,CHEN P,ZENG R,et al.Location-aware graphconvolutional networks for video question answering[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2020:11021-11028.
[76]DING M,CHEN Z,DU T,et al.Dynamic visual reasoning by learning differentiable physics models from video and language[J/OL].Advances in Neural Information Processing Systems,2021,34.https://arxiv.org/abs/2110.15358.
[77]GUO D,WANG H,ZHANG H,et al.Iterative context-aware graph inference for visual dialog[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:10055-10064.
[1] YUAN De-sen, LIU Xiu-jing, WU Qing-bo, LI Hong-liang, MENG Fan-man, NGAN King-ngi, XU Lin-feng. Visual Question Answering Method Based on Counterfactual Thinking [J]. Computer Science, 2022, 49(12): 229-235.
[2] NIU Yu-lei, ZHANG Han-wang. Survey on Visual Question Answering and Dialogue [J]. Computer Science, 2021, 48(3): 87-96.
[3] XU Sheng, ZHU Yong-xin. Study on Question Processing Algorithms in Visual Question Answering [J]. Computer Science, 2020, 47(11): 226-230.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!