知识型视觉问答研究综述

doi:10.11896/jsjkx.211100237

摘要/Abstract

摘要： 视觉问答作为人工智能完备性和视觉图灵测试的重要呈现形式,加上其具有潜在的应用价值,受到了计算机视觉和自然语言处理两个领域的广泛关注。知识在视觉问答中发挥着重要作用,特别是在处理复杂且开放的问题时,推理知识和外部知识对获取正确答案极为关键。蕴含知识的问答机制被称为知识型视觉问答,目前还没有针对知识型视觉问答的系统性调查。面向视觉问答中的知识参与方式和表达形式的研究能够有效填补知识型视觉问答体系中在文献综述方面存在的缺口。文中对知识型视觉问答的各组成单元进行了调查,对知识的存在形态进行了研究,提出了知识层级概念。进一步地,针对视觉特征提取、语言特征提取和多模态融合过程中的知识参与方式和表达形式进行了归纳和总结,并对未来发展趋势及研究方向进行了探讨。

关键词: 视觉问答, 知识分层, 内部逻辑推理, 外部知识库, 知识表达形式, 知识参与方式

Abstract: As an important presentation form of the completeness of artificial intelligence and the visual Turing test,visual question answering(VQA),coupled with its potential application value,has received extensive attention from computer vision and na-tural language processing.Knowledge plays an important role in visual question answering,especially when dealing with complex and open questions,reasoning knowledge and external knowledge are critical to obtaining correct answers.The question and answer mechanism that contains knowledge is called knowledge-based visual question answering(Kb-VQA).At present,systematic investigations on Kb-VQA have not been discovered.Research on knowledge participation methods and expression forms in VQA can effectively fill the gaps in the literature review in the knowledge-based visual question answering system.In this paper,the constituent units of Kb-VQA are investigated,the existence of knowledge is studied,and the concept of knowledge hierarchy is proposed.Further,the knowledge participation methods and expression forms in the process of visual feature extraction,language feature extraction and multi-modal fusion are summarized,and future development trends and research directions are discussed.

Key words: Visual question answering, Knowledge stratification, Internal logical reasoning, External knowledge base, Knowledge expression form, Knowledge participation method

中图分类号:

TP391

王瑞平, 吴士泓, 张美航, 王小平. 知识型视觉问答研究综述[J]. 计算机科学, 2023, 50(1): 166-175. https://doi.org/10.11896/jsjkx.211100237

WANG Ruiping, WU Shihong, ZHANG Meihang, WANG Xiaoping. Knowledge-based Visual Question Answering:A Survey[J]. Computer Science, 2023, 50(1): 166-175. https://doi.org/10.11896/jsjkx.211100237

参考文献

[1]ZHANG W,YU J,ZHAO W,et al.DMRFNet:Deep Multimodal Reasoning and Fusion for Visual Question Answering and explanation generation[J].Information Fusion,2021,72:70-79.
[2]LAO M,GUO Y,PU N,et al.Multi-stage hybrid embedding fusion network for visual question answering[J].Neurocompu-ting,2021,423:541-550.
[3]FARAZI M,KHAN S,BARNES N.Accuracy vs.complexity:A trade-off in visual question answering models[J].Pattern Recognition,2021,120(1):108106.
[4]LIN Y T.Research on Visual Question Answering Technology and Its Application[D].Hangzhou:Zhejiang University,2019.
[5]ANTOL S,AGRAWAL A,LU J,et al.Vqa:Visual question answering[C]//Proceedings of the IEEE International Conference on Computer Vision.Santiago:IEEE Press,2015:2425-2433.
[6]SU Z,ZHU C,DONG Y,et al.Learning visual knowledge me-mory networks for visual question answering[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Salt Lake City:IEEE Press,2018:7736-7745.
[7]LIU F,XIANG T,HOSPEDALES T M,et al.iVQA:Inversevisual question answering[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Salt Lake City:IEEE Press,2018:8611-8619.
[8]XIAN G J,HUANG Y Z.Review of research on visual question answering technology based on neural network[J].Network Security Technology and Application,2018,1:42-47.
[9]YU J,WANG L,YU Z.Research on Visual Question Answering Technology[J].Computer Research and Development,2018,55(9):1946-1958.
[10]NIU Y L,ZHANG H W.Overview of Visual Questions and Answers and Dialogues[J].Computer Science,2021,48(3):87-96.
[11]WU A M,JIANG P,HAN Y.A Survey of Cross-Media Question Answering and Reasoning Based on Vision and Language[J].Computer Science,2021,48(3):71-78.
[12]ZHANG D,CAO R,WU S.Information fusion in visual question answering:A Survey[J].Information Fusion,2019,52:268-280.
[13]TENEY D,WU Q,VAN D.Visual question answering:A tutorial[J].IEEE Signal Processing Magazine,2017,34(6):63-75.
[14]WU Q,TENEY D,WANG P,et al.Visual question answering:A survey of methods and datasets[J].Computer Vision and Image Understanding,2017,163:21-40.
[15]MANMADHAN S,KOVOOR B.Visual question answering:a state-of-the-art review[J].Artificial Intelligence Review,2020,53(8):5705-5745.
[16]YU J,ZHANG W,LU Y,et al.Reasoning on the relation:Enhancing visual representation for visual question answering and cross-modal retrieval[J].IEEE Transactions on Multimedia,2020,22(12):3196-3209.
[17]HAN X,WANG S,SU C,et al.Interpretable Visual Reasoning via Probabilistic Formulation Under Natural Supervision[C]//Proceedings of the European Conference on Computer Vision.Glasgow:Springer,2020:553-570.
[18]XU H,SAENKO K.Ask,attend and answer:Exploring question-guided spatial attention for visual question answering[C]//Proceedings of the European Conference on Computer Vision.Amsterdam:Springer,2016:451-466.
[19]DAS A,AGRAWAL H,ZITNICK L,et al.Human attention in visual question answering:Do humans and deep networks look at the same regions?[J].Computer Vision and Image Understanding,2017,163:90-100.
[20]BOLISANI E,BRATIANU C.The elusive definition of know-ledge[M].Springer,2018:1-22.
[21]YU J,ZHU Z,WANG Y,et al.Cross-modal knowledge reaso-ning for knowledge-based visual question answering[J].Pattern Recognition,2020,108:107563.
[22]ZHU Y,GROTH O,BERNSTEIN M,et al.Visual7w:Groun-ded question answering in images[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Las Vegas:IEEE Press,2016:4995-5004.
[23]KRISHNA R,ZHU Y,GROTH O,et al.Visual genome:Connecting language and vision using crowdsourced dense image annotations[J].International Journal of Computer Vision,2017,123(1):32-73.
[24]ZHU Y,ZHANG C,RÉ C,et al.Building a Large-scale Multimodal Knowledge Base for Visual Question Answering[J].ar-Xiv:1507.05670,2015.
[25]SPEER R,CHIN J,HAVASI C.ConceptNet 5.5:an open multilingual graph of general knowledge[C]//Proceedings of the AAAI Conference on Artificial Intelligence.San Francisco:AAAI Press,2017:4444-4451.
[26]MATSUMORI S,SHINGYOUCHI K,ABE Y,et al.Unifiedquestioner transformer for descriptive question generation in goal-oriented visual dialogue[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.Virtually:IEEE Press,2021:1898-1907.
[27]GUO D,WANG H,WANG S,et al.Textual-visual reference-aware attention network for visual dialog[J].IEEE Transactions on Image Processing,2020,29:6655-6666.
[28]LI Y,DUAN N,ZHOU B,et al.Visual question generation as dual task of visual question answering[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Salt Lake City:IEEE Press,2018:6116-6124.
[29]XU X,WANG T,YANG Y,et al.Radial graph convolutionalnetwork for visual question generation[J].IEEE transactions on Neural Networks and Learning Systems,2020,32(4):1654-1667.
[30]HAN X,WANG S,SU C,et al.Greedy gradient ensemble for robust visual question answering[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.Virtually:IEEE Press,2021:1584-1593.
[31]REN S,HE K,GIRSHICK R,et al.Faster R-CNN:Towards real-time object detection with region proposal networks[J].Advances in Neural Information Processing Systems,2015,28:91-99.
[32]FARHADI A,REDMON J.Yolov3:An incremental improve-ment[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Salt Lake City:IEEE Press,2018:1804-2767.
[33]HE K,ZHANG X,REN S,et al.Deep residual learning forimage recognition[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Las Vegas:IEEE Press,2016:770-778.
[34]SIMONYAN K,ZISSERMAN A.Very deep convolutional networks for large-scale image recognition[J].arXiv:1409.1556,2014.
[35]ADITYA S,YANG Y,BARAL C.Explicit reasoning over end-to-end neural architectures for visual question answering[C]//Proceedings of the AAAI Conference on Artificial Intelligence.New Orleans:AAAI Press,2018:629-637.
[36]BAI Y,FU J,ZHAO T,et al.Deep attention neural tensor network for visual question answering[C]//Proceedings of the European Conference on Computer Vision.Munich:Springer,2018:20-35.
[37]GORDON D,KEMBHAVI A,RASTEGARI M,et al.Iqa:Vi-sual question answering in interactive environments[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Salt Lake City:IEEE Press,2018:4089-4098.
[38]LI W,YUAN Z,FANG X,et al.Knowing where to look?Ana-lysis on attention of visual question answering system[C]//Proceedings of the European Conference on Computer Vision.Munich:Springer,2018:145-152.
[39]RAHMAN T,CHOU S H,SIGAL L,et al.An Improved Attention for Visual Question Answering[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Nashville:IEEE Press,2021:1653-1662.
[40]LIANG W,JIANG Y,LIU Z.GraghVQA:Language-GuidedGraph Neural Networks for Graph-based Visual Question Answering[J].arXiv:2104.10283,2021.
[41]KIM J J,LEE D G,WU J,et al.Visual question answering based on local-scene-aware referring expression generation[J].Neural Networks,2021,139:158-167.
[42]ZHANG W,YU J,HU H,et al.Multimodal feature fusion by relational reasoning and attention for visual question answering[J].Information Fusion,2020,55:116-126.
[43]ZHANG L,LIU S,LIU D,et al.Rich Visual Knowledge-Based Augmentation Network for Visual Question Answering[J].IEEE Transactions on Neural Networks and Learning Systems,2020,32(10):4362-4373.
[44]ZHU Y,LIM J J,LI F F.Knowledge acquisition for visual question answering via iterative querying[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Honolulu:IEEE Press,2017:1154-1163.
[45]WU Q,SHEN C,WANG P,et al.Image captioning and visual question answering based on attributes and external knowledge[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2017,40(6):1367-1381.
[46]WANG P,WU Q,SHEN C,et al.Fvqa:Fact-based visual question answering[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2017,40(10):2413-2427.
[47]CHO K,VAN MERRIËNBOER B,GULCEHRE C,et al.Lear-ning phrase representations using RNN encoder-decoder for statistical machine translation[J].arXiv:1406.1078,2014.
[48]PENNINGTON J,SOCHER R,MANNING C D.Glove:Global vectors for word representation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Proces-sing(EMNLP).Doha:Association for Computational Linguistics,2014:1532-1543.
[49]HOCHREITER S,SCHMIDHUBER J.Long short-term memo-ry[J].Neural Computation,1997,9(8):1735-1780.
[50]MIKOLOV T,SUTSKEVER I,CHEN K,et al.Distributed representations of words and phrases and their compositionality[C]//Advances in Neural Information Processing Systems.New York:Curran Associates Inc.,2013:3111-3119.
[51]FUKUI A,PARK D H,YANG D,et al.Multimodal compact bilinear pooling for visual question answering and visual grounding[J].arXiv:1606.01847,2016.
[52]RUWA N,MAO Q,WANG L,et al.Affective visual question answering network[C]//2018 IEEE Conference on Multimedia Information Processing And Retrieval(MIPR).Miami:IEEE Press,2013:170-173.
[53]KRIZHEVSKY A,SUTSKEVER I,HINTON G E.Imagenetclassification with deep convolutional neural networks[J].Advances in Neural Information Processing Systems,2012,25:1097-1105.
[54]ZEILER M D,FERGUS R.Visualizing and understanding con-volutional networks[C]//Proceedings of the European Confe-rence on Computer Vision.Zurich:Springer,2014:818-833.
[55]SZEGEDY C,LIU W,JIA Y,et al.Going deeper with convolutions[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Boston:IEEE Press,2015:1-9.
[56]HARRAG F,GUELIANI S.Event Extraction Based on DeepLearning in Food Hazard Arabic Texts[J].arXiv:2008.05014,2020.
[57]MIKOLOV T,CHEN K,CORRADO G,et al.Efficient estimation of word representations in vector space[J].arXiv:1301.3781,2013.
[58]DEVLIN J,CHANG M W,LEE K,et al.Bert:Pre-training of deep bidirectional transformers for language understanding[J].arXiv:1810.04805,2018.
[59]LIN X,PARIKH D.Leveraging visual question answering forimage-caption ranking[C]//Proceedings of the European Conference on Computer Vision.Amsterdam:Springer,2016:261-277.
[60]ZHANG P,GOYAL Y,SUMMERS-STAY D,et al.Yin andYang:Balancing and Answering Binary Visual Questions[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Las Vegas:IEEE Press,2016:5014-5022.
[61]LIN T Y,MAIRE M,BELONGIE S,et al.Microsoft coco:Common objects in context[C]//Proceedings of the European Conference on Computer Vision.Zurich:Springer,2014:740-755.
[62]YOUNG P,LAI A,HODOSH M,et al.From image descriptions to visual denotations:New similarity metrics for semantic infe-rence over event descriptions[J].Transactions of the Association for Computational Linguistics,2014,2(1):67-78.
[63]GOYAL Y,KHOT T,SUMMERS-STAY D,et al.Making the v in vqa matter:Elevating the role of image understanding in visualquestion answering[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Honolulu:IEEE Press,2017:6904-6913.
[64]HUDSON D A,MANNING C D.Gqa:A new dataset for real-world visual reasoning and compositional question answering[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Long Beach:IEEE Press,2019:6700-6709.
[65]AGRAWAL A,BATRA D,PARIKH D,et al.Don’t justassume;look and answer:Overcoming priors for visual question answering[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Salt Lake City:IEEE Press,2018:4971-4980.
[66]LI Q,TAO Q,JOTY S,et al.Vqa-e:Explaining,elaborating,and enhancing your answers for visual questions[C]//Proceedings of the European Conference on Computer Vision.Munich:Springer,2018:552-567.
[67]MARINO K,RASTEGARI M,FARHADI A,et al.Ok-vqa:Avisual question answering benchmark requiring external know-ledge[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Long Beach:IEEE Press,2019:3195-3204.
[68]KRISHNA R,ZHU Y,GROTH O,et al.Visual Genome:Connecting Language and Vision Using Crowdsourced Dense Image Annotations[J].International Journal of Computer Vision,2017,123(1):32-73.
[69]AUER S,BIZER C,KOBILAROV G,et al.Dbpedia:A nucleus for a web of open data[C]//Proceedings of the 6th International the Semantic Web and 2nd Asian Conference on Asian Semantic Web Conference.Springer,2007:722-735.
[70]TANDON N,DE MELO G,WEIKUM G.Acquiring comparative commonsense knowledge from the web[C]//Proceedings of the AAAI Conference on Artificial Intelligence.Québec:AAAI Press,2014:166-172.
[71]SANTORO A,RAPOSO D,BARRETT D G,et al.A simple neural network module for relational reasoning[J/OL].Advances in Neural Information Processing Systems,2017,30.https://www.researchgate.net/publication/317356629_A_sim-ple_neural_network_module_for_relational_reasoning.
[72]MAO J,GAN C,KOHLI P,et al.The neuro-symbolic concept learner:Interpreting scenes,words,and sentences from natural supervision[J].arXiv:1904.12584,2019.
[73]JOHNSON J,HARIHARAN B,VAN DER MAATEN L,et al.Clevr:A diagnostic dataset for compositional language and elementary visual reasoning[C]//Proceedings of the IEEE Confe-rence on Computer Vision and Pattern Recognition.2017:2901-2910.
[74]HONG Y,YI L,TENENBAUM J,et al.PTR:A Benchmark for Part-based Conceptual,Relational,and Physical Reasoning[J/OL].Advances in Neural Information Processing Systems,2021,34.https://arxiv.org/abs/2112.05136.
[75]HUANG D,CHEN P,ZENG R,et al.Location-aware graphconvolutional networks for video question answering[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2020:11021-11028.
[76]DING M,CHEN Z,DU T,et al.Dynamic visual reasoning by learning differentiable physics models from video and language[J/OL].Advances in Neural Information Processing Systems,2021,34.https://arxiv.org/abs/2110.15358.
[77]GUO D,WANG H,ZHANG H,et al.Iterative context-aware graph inference for visual dialog[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:10055-10064.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed