计算机科学 ›› 2023, Vol. 50 ›› Issue (5): 177-188.doi: 10.11896/jsjkx.220500124

• 人工智能 • 上一篇    下一篇

基于深度学习的视觉问答研究综述

李祥1, 范志广2, 李学相1, 张卫星1, 杨聪1, 曹仰杰1   

  1. 1 郑州大学网络空间安全学院 郑州 450000
    2 郑州大学河南先进技术研究院 郑州 450000
  • 收稿日期:2022-05-16 修回日期:2022-09-05 出版日期:2023-05-15 发布日期:2023-05-06
  • 通讯作者: 李学相(lxx@zzu.edu.cn)
  • 作者简介:(lixiang.zg@qq.com)
  • 基金资助:
    国家自然科学基金面上项目(61972092);郑州市协同创新重大专项(20XTZX06013)

Survey of Visual Question Answering Based on Deep Learning

LI Xiang1, FAN Zhiguang2, LI Xuexiang1, ZHANG Weixing1, YANG Cong1, CAO Yangjie1   

  1. 1 School of Cyber science and Engineering,Zhengzhou University,Zhengzhou 450000,China
    2 Henan Institute of Advanced Technology,Zhengzhou University,Zhengzhou 450000,China
  • Received:2022-05-16 Revised:2022-09-05 Online:2023-05-15 Published:2023-05-06
  • About author:LI Xiang,born in 1997,postgraduate.His main research interests include vi-sual question answering and so on.
    LI Xuexiang,born in 1965,professor,master supervisor.His main research interests include high performance computing and cloud computing.
  • Supported by:
    General Project of National Natural Science Foundation of China(61972092) and Collaborative Innovation Major Project of Zhengzhou(20XTZX06013).

摘要: 视觉问答是计算机视觉和自然语言处理的交叉领域。在视觉问答的任务中,机器首先需要对图像、文本这两种模态数据进行编码,进而学习这两种模态之间的映射,实现图像特征和文本特征的融合,最后给出答案。视觉问答任务考验模型对图像的理解能力以及对答案的推理能力。视觉问答是实现跨模态人机交互的重要途径,具有广阔的应用前景。最近相继涌现出了众多新兴技术,如基于场景推理的方法、基于对比学习的方法和基于三维点云的方法。但是,视觉问答模型普遍存在推理能力不足、缺乏可解释性等问题,值得进一步地探索与研究。文中对视觉问答领域的相关研究和新颖方法进行了深入的调研和总结。首先介绍了视觉问答的背景;其次分析了视觉问答的研究现状并对相关算法的和数据集进行了归纳总结;最后根据当前模型存在的问题对视觉问答的未来研究方向进行了展望。

关键词: 视觉问答, 跨模态, 人机交互, 推理能力, 可解释性

Abstract: Visual question answering(VQA) is an interdisciplinary research paradigm that involves computer vision and natural language processing.VQA generally requires both image and text data to be encoded,their mappings learned,and their features fused,before finally generating an appropriate answer.Image understanding and result reasoning are therefore vital to the performance of VQA.With its importance in realizing cross-modal human-computer interaction and its promising applications,a number of emerging techniques for VQA,including scene-reasoning based methods,contrastive-learning based methods,and 3D-point-cloud based methods,have been recently proposed.These methods,while achieving notable performances,have revealed issues such as insufficient inferential capability and interpretability,which demand further exploration.We hence present in this paper an in-depth survey and summary of related research and proposals in the field of VQA.The essential background of VQA is first introduced,followed by the analysis and summarization of state-of-art approaches and datasets.Last but not least,with the insight of current issues,future research directions in the field of VQA are prospected.

Key words: Visual question answering, Cross-modal, Human-Computer interaction, Reasoning ability, Interpretability

中图分类号: 

  • TP181
[1]KIM J H,JUN J,ZHANG B T.Bilinear attention networks[C]//Advances in Neural Information Processing Systems.2018:1571-1581.
[2]DONAHUE J,ANNE HENDRICKS L,GUADARRAMA S,et al.Long-term recurrent convolutional networks for visual re-cognition and description[C]//Proceedings of the IEEE Confe-rence on Computer Vision and Pattern Recognition.2015:2625-2634.
[3]YUAN D.Language bias in Visual Question Answering:A Survey and Taxonomy[J].arXiv:2111.08531,2021.
[4]DENG J,DONG W,SOCHER R,et al.Imagenet:A large-scale hierarchical image database[C]//2009 IEEE Conference on Computer Vision and Pattern Recognition.IEEE,2009:248-255.
[5]HE K,ZHANG X,REN S,et al.Deep residual learning forimage recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:770-778.
[6]HOWARD A G,ZHU M,CHEN B,et al.Mobilenets:Efficient convolutional neural networks for mobile vision applications[J].arXiv:1704.04861,2017.
[7]SHI H,LI H,WU Q,et al.Query reconstruction network forreferring expression image segmentation[J].IEEE Transactions on Multimedia,2020,23:995-1007.
[8]YANG L,LI H,WU Q,et al.Mono is enough:Instance segmentation from single annotated sample[C]//2020 IEEE International Conference on Visual Communications and Image Proces-sing(VCIP).IEEE,2020:120-123.
[9]MENG F,GUO L,WU Q,et al.A new deep segmentation quality assessment network for refining bounding box based segmentation[J].IEEE Access,2019,7:59514-59523.
[10]XU X,MENG F,LI H,et al.Bounding box based annotation generation for semantic segmentation by boundary detection[C]//2019 International Symposium on Intelligent Signal Processing and Communication Systems(ISPACS).IEEE,2019:1-2.
[11]SHI H,LI H,MENG F,et al.Key-word-aware network for referring expression image segmentation[C]//Proceedings of the European Conference on Computer Vision(ECCV).2018:38-54.
[12]KINGMA D P,WELLING M.Auto-encoding variational bayes[J].arXiv:1312.6114,2013.
[13]VAN DEN OORD A,KALCHBRENNER N,ESPEHOLT L,et al.Conditional image generation with pixelcnn decoders[C]//Advances in Neural Information Processing Systems.2016:4790-4798.
[14]KINGMA D P,DHARIWAL P.Glow:Generative flow with invertible 1x1 convolutions[C]//Advances inNeural Information Processing Systems.2018:10236-10245.
[15]GOODFELLOW I,POUGET-ABADIE J,MIRZA M,et al.Ge-nerative adversarial nets[C]//Advances in Neural Information Processing Systems.2014:2672-2680.
[16]CHEN X,LI H,WU Q,et al.Bal-r2cnn:High quality recurrent object detection with balance optimization[J].IEEE Transactions on Multimedia,2021,24:1558-1569.
[17]CHEN X,LI H,WU Q,et al.High-quality R-CNN object detection using multi-path detection calibration network[J].IEEE Transactions on Circuits and Systems for Video Technology,2020,31(2):715-727.
[18]REN S,HE K,GIRSHICK R,et al.Faster r-cnn:Towards real-time object detection with region proposal networks[J].IEEE Trans.on Pattern Analysis and Machine Intelligence,2017,39(6):1137-1149.
[19]CARION N,MASSA F,SYNNAEVE G,et al.End-to-end object detection with transformers[C]//European Conference on Computer Vision.Cham:Springer,2020:213-229.
[20]REDMON J,FARHADI A.Yolov3:An incremental improve-ment[J].arXiv:1804.02767,2018.
[21]GE Z,LIU S,WANG F,et al.Yolox:Exceeding yolo series in 2021[J].arXiv:2107.08430,2021.
[22]XIA R,DING Z.Emotion-cause pair extraction:A new task to emotion analysis in texts[J].arXiv:1906.01267,2019.
[23]CALEFATO F,LANUBILE F,NOVIELLI N.EmoTxt:a toolkit for emotion recognition from text[C]//2017 Seventh International Conference on Affective Computing and Intelligent Interaction Workshops and Demos(ACIIW).IEEE,2017:79-80.
[24]GHOSAL D,MAJUMDER N,PORIA S,et al.Dialoguegcn:A graph convolutional neural network for emotion recognition in conversation[J].arXiv:1908.11540,2019.
[25]LIU B,LANE I.Attention-based recurrent neural network mo-dels for joint intent detection and slot filling[J].arXiv:1609.01454,2016.
[26]NIU P,CHEN Z,SONG M.A novel bi-directional interrelated model for joint intent detection and slot filling[J].arXiv:1907.00390,2019.
[27]ZHANG H,LI X,XU H,et al.TEXTOIR:An Integrated and Visualized Platform for Text Open Intent Recognition[J].ar-Xiv:2110.15063,2021.
[28]CHO K,VAN MERRIËNBOER B,GULCEHRE C,et al.Learning phrase representations using RNN encoder-decoder for statistical machine translation[J].arXiv:1406.1078,2014.
[29]WU Y,SCHUSTER M,CHEN Z,et al.Google's neural machine translation system:Bridging the gap between human and machine translation[J].arXiv:1609.08144,2016.
[30]LUONG M T,PHAM H,MANNING C D.Effective approaches to attention-based neural machine translation[J].arXiv:1508.04025,2015.
[31]NIU Y L,ZHANG H W.A survey of visual question answering and dialogue[J].Computer Science,2021,48(3):87-96.
[32]YU J,WANG L,YU Z.Research on visual question answering techniques[J].Journal of Computer Research and Development,2018,55(9):1946-1958.
[33]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[C]//Advances in Neural Information Processing Systems.2017:5998-6008.
[34]SIMONYAN K,ZISSERMAN A.Very deep convolutional networks for large-scale image recognition[J].arXiv:1409.1556,2014.
[35]SZEGEDY C,LIU W,JIA Y,et al.Going deeper with convolutions[C]//Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition.2015:1-9.
[36]HOCHREITER S,SCHMIDHUBER J.Long short-term memory[J].Neural Computation,1997,9(8):1735-1780.
[37]CHO K,VAN MERRIËNBOER B,GULCEHRE C,et al.Learning phrase representations using RNN encoder-decoder for statistical machine translation[J].arXiv:1406.1078,2014.
[38]DEVLIN J,CHANG M W,LEE K,et al.Bert:Pre-training of deep bidirectional transformers for language understanding[J].arXiv:1810.04805,2018.
[39]MALINOWSKI M,ROHRBACH M,FRITZ M.Ask your neurons:A neural-based approach to answering questions about images[C]//Proceedings of the IEEE International Conference on Computer Vision.2015:1-9.
[40]GAO H,MAO J,ZHOU J,et al.Are you talking to a machine? dataset and methods for multilingual image question[C]//Advances in Neural Information Processing Systems.2015:2296-2304.
[41]KIM J H,LEE S W,KWAK D,et al.Multimodal residual lear-ning for visual.qa[C]//Advances in Neural Information Proces-sing Systems.2016:361-369.
[42]MA L,LU Z,LI H.Learning to answer questions from imageusing convolutional neural network[C]//Thirtieth AAAI Confe-rence on Artificial Intelligence.2016:3567-3573.
[43]SHIH K J,SINGH S,HOIEM D.Where to look:Focus regions for visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:4613-4621.
[44]KAZEMI V,ELQURSH A.Show,ask,attend,and answer:A strong baseline for visual question answering[J].arXiv:1704.03162,2017.
[45]YANG Z,HE X,GAO J,et al.Stacked attention networks for image question answering[C]//Proceedings of the IEEE Confe-rence on Computer Vision and Pattern Recognition.2016:21-29.
[46]YU Z,YU J,CUI Y,et al.Deep modular co-attention networks for visual question answering[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019:6281-6290.
[47]RAHMAN T,CHOU S H,SIGAL L,et al.An Improved Attention for Visual Question Answering[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2021:1653-1662.
[48]ZHOU Y,REN T,ZHU C,et al.TRAR:Routing the Attention Spans in Transformer for Visual Question Answering[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:2074-2084.
[49]YU Z,JIN Z,YU J,et al.Towards Efficient and Elastic Visual Question Answering with Doubly Slimmable Transformer[J].arXiv:2203.12814,2022.
[50]WANG Z,WANG W,ZHU H,et al.Distilled Dual-EncoderModel for Vision-Language Understanding[J].arXiv:2112.08723,2021.
[51]ZENG Y,ZHANG X,LI H.Multi-Grained Vision LanguagePre-Training:Aligning Texts with Visual Concepts[J].arXiv:2111.08276,2021.
[52]XIONG P,SHEN Y,JIN H.MGA-VQA:Multi-GranularityAlignment for Visual Question Answering[J].arXiv:2201.10656,2022.
[53]LI Y,FAN J,PAN Y,et al.Uni-EDEN:Universal Encoder-Decoder Network by Multi-Granular Vision-Language Pre-training[J].arXiv:2201.04026,2022.
[54]CHANG X,REN P,XU P,et al.Scene graphs:A survey of ge-nerations and applications[J].arXiv:2104.01111,2021.
[55]YANG Z,QIN Z,YU J,et al.Scene graph reasoning with priorvisual relationship for visual question answe-ring[J].arXiv:1812.09681,2018.
[56]LIANG W,JIANG Y,LIU Z.GraphVQA:Language-guidedgraph neural networks for graph-based visual question answe-ring[J].arXiv:2104.10283,2021.
[57]KONER R,LI H,HILDEBRANDT M,et al.Graphhopper:Multi-hop Scene Graph Reasoning for Visual Question Answe-ring[C]//International Semantic Web Conference.Cham:Springer,2021:111-127.
[58]CAO J,QIN X,ZHAO S,et al.Bilateral Cross-Modality Graph Matching Attention for Feature Fusion in Visual Question Answering[J].arXiv:2112.07270,2021.
[59]ZHU Z,YU J,WANG Y,et al.Mucko:multi-layer cross-modal knowledge reasoning for fact-based visual question answering[J].arXiv:2006.09073,2020.
[60]WANG P,WU Q,SHEN C,et al.Explicit knowledge-based reasoning for visual question answering[J].arXiv:1511.02570,2015.
[61]WU Q,WANG P,SHEN C,et al.Ask me anything:Free-form visual question answering based on knowledge from external sources[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:4622-4630.
[62]RAMNATH K,HASEGAWA-JOHNSON M.Seeing is kno-wing!fact-based visual question answering using knowledge graph embeddings[J].arXiv:2012.15484,2020.
[63]CHEN Z,CHEN J,GENG Y,et al.Zero-shot visual question answering using knowledge graph[C]//International Semantic Web Conference.Cham:Springer,2021:146-162.
[64]MISHRA A,ALAHARI K,JAWAHAR C V.Top-down andbottom-up cues for scene text recognition[C]//2012 IEEE Conference on Computer Vision and Pattern Recognition.IEEE,2012:2687-2694.
[65]NEUMANN L,MATAS J.Real-time scene text localization and recognition[C]//2012 IEEE Conference on Computer Vision and Pattern Recognition.IEEE,2012:3538-3545.
[66]WANG K,BABENKO B,BELONGIE S.End-to-end scene text recognition[C]//2011 International Conferenceon Computer Vision.IEEE,2011:1457-1464.
[67]SINGH A K,MISHRA A,SHEKHAR S,et al.From strings to things:Knowledge-enabled vqa model that can read and reason[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2019:4602-4612.
[68]MARINO K,RASTEGARI M,FARHADI A,et al.Ok-vqa:Avisual question answering benchmark requiring external know-ledge[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019:3195-3204.
[69]WU J,LU J,SABHARWAL A,et al.Multi-modal answer validation for knowledge-based vqa[J].arXiv:2103.12248,2021.
[70]QU C,ZAMANI H,YANG L,et al.Passage Retrieval for Outside-Knowledge Visual Question Answering[C]//Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval.2021:1753-1757.
[71]TAN H,BANSAL M.Lxmert:Learning cross-modality encoder representations from transformers[J].arXiv:1908.07490,2019.
[72]LUO M,ZENG Y,BANERJEE P,et al.Weakly-Supervised Vi-sual-Retriever-Reader for Knowledge-based Question Answe-ring[J].arXiv:2109.04014,2021.
[73]LI J,SELVARAJU R,GOTMARE A,et al.Align before fuse:Vision and language representation learning with momentum distillation[C]//Advances in Neural Information Processing Systems.2021:9694-9705.
[74]WANG W,BAO H,DONG L,et al.VLMo:Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts[J].ar-Xiv:2111.02358,2021.
[75]YANG J,DUAN J,TRAN S,et al.Vision-Language Pre-Trai-ning with Triple Contrastive Learning[J].arXiv:2202.10401,2022.
[76]AZUMA D,MIYANISHI T,KURITA S,et al.ScanQA:3DQuestion Answering for Spatial Scene Understanding[J].ar-Xiv:2112.10482,2021.
[77]YE S,CHEN D,HAN S,et al.3D Question Answering[J].ar-Xiv:2112.08359,2021.
[78]YAN X,YUAN Z,DU Y,et al.CLEVR3D:Compositional Language and Elementary Visual Reasoning for Question Answe-ring in 3D Real-World Scenes[J].arXiv:2112.11691,2021.
[79]MALINOWSKI M,FRITZ M.A multi-world approach to question answering about real-world scenes based on uncertain input[C]//Advances in Neural Information Processing Systems.2014:1682-1690.
[80]ANTOL S,AGRAWAL A,LU J,et al.Vqa:Visual question answering[C]//Proceedings of the IEEE International Conference on Computer Vision.2015:2425-2433.
[81]AGRAWAL A,BATRA D,PARIKH D,et al.Don't just as-sume; look and answer:Overcoming priors for visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:4971-4980.
[82]SINGH A,NATARAJAN V,SHAH M,et al.Towards vqamodels that can read[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019:8317-8326.
[83]KRASIN I,DUERIG T,ALLDRIN N,et al.Openimages:Apublic dataset for large-scale multi-label and multi-class image classification[EB/OL].https://github.com/openimages.
[84]JOHNSON J,HARIHARAN B,VAN DER MAATEN L,et al.Clevr:A diagnostic dataset for compositional language and elementary visual reasoning[C]//Proceedings of the IEEE Confe-renceon Computer Vision and Pattern Recognition.2017:2901-2910.
[85]HUDSON D A,MANNING C D.Gqa:A new dataset for real-world visual reasoning and compositional question answering[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019:6700-6709.
[86]WANG P,WU Q,SHEN C,et al.FVQA:Fact-based visualquestion answe-ring[J].IEEE Trans.on Pattern Analysis and Machine Intelligence,2018,40(10):2413-2427.
[87]MANMADHAN S,KOVOOR B C.Visual question answering:a state-of-the-art review[J].Artificial Intelligence Review,2020,53(8):5705-5745.
[88]ANDERSON P,HE X,BUEHLER C,et al.Bottom-up and top-down attention for image captioning and visual question answe-ring[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:6077-6086.
[89]BEN-YOUNES H,CADENE R,THOME N,et al.Block:Bili-near superdiagonal fusion for visual question answering and vi-sual relationship detection[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2019,33(1):8102-8109.
[90]CADENE R,BEN-YOUNES H,CORD M,et al.Murel:Multimodal relational reasoning for visual question answering[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019:1989-1998.
[91]GAO P,JIANG Z,YOU H,et al.Dynamic fusion with intra-and inter-modality attention flow for visual question answering[C]//Proceedings of the IEEE/CVF Conferenceon Computer Vision and Pattern Recognition.2019:6639-6648.
[92]CHEN L,YAN X,XIAO J,et al.Counterfactual samples synthesizing for robust visual question answering[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:10800-10809.
[93]CADENE R,DANCETTE C,CORD M,et al.Rubi:Reducing unimodal biases for visual question answering[C]//Advances in Neural Information Processing Systems.2019:839-850.
[94]ZHU X,MAO Z,LIU C,et al.Overcoming language priors with self-supervised learning for visual question answering[J].arXiv:2012.11528,2020.
[95]YANG C,FENG S,LI D,et al.Learning content and contextwith language bias for visual question answering[C]//2021 IEEE International Conference on Multimedia and Expo(ICME).IEEE,2021:1-6.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!