视觉问答与对话综述

doi:10.11896/jsjkx.201200174

摘要/Abstract

摘要： 视觉问答与对话是人工智能领域的重要研究任务,是计算机视觉与自然语言处理交叉领域的代表性问题之一。视觉问答与对话任务要求机器根据指定的视觉图像内容,对单轮或多轮的自然语言问题进行作答。视觉问答与对话对机器的感知能力、认知能力和推理能力均提出了较高的要求,在跨模态人机交互应用中具有实用前景。文中对近年来视觉问答与对话的研究进展进行了综述,对数据集和算法进行了归纳,对研究挑战和问题进行了总结,最后对视觉问答与对话的未来发展趋势进行了讨论。

关键词: 深度学习, 视觉对话, 视觉推理, 视觉问答, 视觉语言

Abstract: Visual question answering and dialogue are important research tasks in artificial intelligence,and the representative problems in the intersection of computer vision and natural language processing.Visual question answering and dialogue tasks require the machine to answer single-round or multi-round questions based on the specified visual content.Visual question answering and dialogue require the machine’s abilities of perception,cognition and reasoning,and have application prospects in cross-modal human-computer interaction applications.This paper reviews recent research progress of visual question answering and dialogue,and summarizes datasets,algorithms,challenges,and problems.Finally,this paper discusses the future research trend of visual question answering and dialogue.

Key words: Deep learning, Vision and language, Visual dialogue, Visual question answering, Visual reasoning

中图分类号:

TP391

牛玉磊, 张含望. 视觉问答与对话综述[J]. 计算机科学, 2021, 48(3): 87-96. https://doi.org/10.11896/jsjkx.201200174

NIU Yu-lei, ZHANG Han-wang. Survey on Visual Question Answering and Dialogue[J]. Computer Science, 2021, 48(3): 87-96. https://doi.org/10.11896/jsjkx.201200174

参考文献

[1]YU J,WANG L,YU Z.Research on Visual Question Answering Techniques[J].Journal of Computer Research and Development,2018,55(9):1946-1958.
[2]QI J,NIU Y,HUANG J,et al.Two causal principles for improving visual dialog[C]//Proceedings of the IEEE/CVF Confe-rence on Computer Vision and Pattern Recognition.2020:10860-10869.
[3]ANTOL S,AGRAWAL A,LU J,et al.Vqa:Visual question answering[C]//Proceedings of the IEEE International Conference on Computer Vision.2015:2425-2433.
[4]GOYAL Y,KHOT T,SUMMERS-STAY D,et al.Making the V in VQA matter:Elevating the role of image understanding in Visual Question Answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:6904-6913.
[5]LIN T Y,MAIRE M,BELONGIE S,et al.Microsoft coco:Common objects in context[C]//European Conference on Computer Vision.Springer,Cham,2014:740-755.
[6]DAS A,KOTTUR S,GUPTA K,et al.Visual dialog[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:326-335.
[7]MALINOWSKI M,FRITZ M.A Multi-world Approach toQuestion Answering about Real-world Scenes based on Uncertain Input[C]//Twenty-Eighth Annual Conference on Neural Information Processing Systems.Curran,2014:1682-1690.
[8]REN M,KIROS R,ZEMEL R S.Exploring models and data for image question answering[C]//Proceedings of the 28th International Conference on Neural Information Processing Systems.2015:2953-2961.
[9]GAO H,MAO J,ZHOU J,et al.Are you talking to a machine? Dataset and methods for multilingual image question answering[C]//Proceedings of the 28th International Conference on Neural Information Processing Systems.2015:2296-2304.
[10]KRISHNA R,ZHU Y,GROTH O,et al.Visual genome:Connecting language and vision using crowdsourced dense image annotations[J].International Journal of Computer Vision,2017,123(1):32-73.
[11]ZHU Y,GROTH O,BERNSTEIN M,et al.Visual7w:Groun-ded question answering in images[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:4995-5004.
[12]YU L,PARK E,BERG A C,et al.Visual madlibs:Fill in the blank image generation and question answering[J].arXiv:1506.00278,2015.
[13]DE VRIES H,STRUB F,CHANDAR S,et al.Guesswhat?!visual object discovery through multi-modal dialogue[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:5503-5512.
[14]XU H,SAENKO K.Ask,attend and answer:Exploring question-guided spatial attention for visual question answering[C]//European Conference on Computer Vision.Springer,Cham,2016:451-466.
[15]YANG Z,HE X,GAO J,et al.Stacked attention networks for image question answering[C]//Proceedings of the IEEE Confe-rence on Computer Vision and Pattern Recognition.2016:21-29.
[16]LU J,YANG J,BATRA D,et al.Hierarchical question-imageco-attention for visual question answering[C]//Proceedings of the 30th International Conference on Neural Information Processing Systems.2016:289-297.
[17]YU Z,YU J,CUI Y,et al.Deep modular co-attention networks for visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:6281-6290.
[18]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[C]//Proceedings of the 31st International Confe-rence on Neural Information Processing Systems.2017:6000-6010.
[19]KIM J H,JUN J,ZHANG B T.Bilinear attention networks[C]//Proceedings of the 32nd International Conference on Neural Information Processing Systems.2018:1571-1581.
[20]FUKUI A,PARK D H,YANG D,et al.Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding[C]//Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing.2016:457-468.
[21]KIM J H,ON K W,LIM W,et al.Hadamard product for low-rank bilinear pooling[J].arXiv:1610.04325,2016.
[22]YU Z,YU J,FAN J,et al.Multi-modal factorized bilinear pooling with co-attention learning for visual question answering[C]//Proceedings of the IEEE International Conference on Computer Vision.2017:1821-1830.
[23]BEN-YOUNES H,CADENE R,CORD M,et al.Mutan:Multimodal tucker fusion for visual question answering[C]//Procee-dings of the IEEE International Conference on Computer Vision.2017:2612-2620.
[24]BEN-YOUNES H,CADENE R,THOME N,et al.Block:Bili-near superdiagonal fusion for vi-sual question answering and vi-sual relationship detection[C]//Proceedings of the AAAI Confe-rence on Artificial Intelligence.2019,33:8102-8109.
[25]ANDREAS J,ROHRBACH M,DARRELL T,et al.Neural module networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:39-48.
[26]HU R,ANDREAS J,ROHRBACH M,et al.Learning to rea-son:End-to-end module networks for visual question answering[C]//Proceedings of the IEEE International Conference on Computer Vision.2017:804-813.
[27]HU R,ANDREAS J,DARRELL T,et al.Explainable neuralcomputation via stack neural module networks[C]//Procee-dings of the European Conference on Computer Vision (ECCV).2018:53-69.
[28]JOHNSON J,HARIHARAN B,VAN DER MAATEN L,et al.Inferring and executing programs for visual reasoning[C]//Proceedings of the IEEE International Conference on Computer Vision.2017:2989-2998.
[29]MASCHARKA D,TRAN P,SOKLASKI R,et al.Transparency by design:Closing the gap between performance and interpretability in visual reasoning[C]//Proceedings of the IEEE Confe-rence on Computer Vision and Pattern Recognition.2018:4942-4950.
[30]YI K,WU J,GAN C,et al.Neural-symbolic VQA:disentangling reasoning from vision and language understanding[C]//Proceedings of the 32nd International Conference on Neural Information Processing Systems.2018:1039-1050.
[31]SHI J,ZHANG H,LI J.Explainable and explicit visual reaso-ning over scene graphs[C]//Proceedings of the IEEE Confe-rence on Computer Vision and Pattern Recognition.2019:8376-8384.
[32]VEDANTAM R,DESAI K,LEE S,et al.Probabilistic Neural Symbolic Models for Interpretable Visual Question Answering[C]//International Conference on Machine Learning.2019:6428-6437.
[33]CHEN W,GAN Z,LI L,et al.Meta module network for compositional visual reasoning[J].arXiv:1910.03230,2019.
[34]JOHNSON J,HARIHARAN B,VAN DER MAATEN L,et al.Clevr:A diagnostic dataset for compositional language and elementary visual reasoning[C]//Proceedings of the IEEE Confe-rence on Computer Vision and Pattern Recognition.2017:2901-2910.
[35]CADENE R,BEN-YOUNES H,CORD M,et al.Murel:Multimodal relational reasoning for visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:1989-1998.
[36]LI L,GAN Z,CHENG Y,et al.Relation-aware graph attention network for visual question answering[C]//Proceedings of the IEEE International Conference on Computer Vision.2019:10313-10322.
[37]HU R,ROHRBACH A,DARRELL T,et al.Language-conditioned graph networks for relational reasoning[C]//Proceedings of the IEEE International Conference on Computer Vision.2019:10294-10303.
[38]KAFLE K,YOUSEFHUSSIEN M,KANAN C.Data augmentation for visual question answering[C]//Proceedings of the 10th International Conference on Natural Language Generation.2017:198-202.
[39]RAY A,SIKKA K,DIVAKARAN A,et al.Sunny and DarkOutside?! Improving Answer Consistency in VQA through Entailed Question Generation[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP).2019:5863-5868.
[40]SHAH M,CHEN X,ROHRBACH M,et al.Cycle-consistencyfor robust visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:6649-6658.
[41]SIMONYAN K,ZISSERMAN A.Very deep convolutional networks for large-scale image recognition[J].arXiv:1409.1556,2014.
[42]HE K,ZHANG X,REN S,et al.Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:770-778.
[43]DENG J,DONG W,SOCHER R,et al.Imagenet:A large-scale hierarchical image database[C]//2009 IEEE Conference on Computer Vision and Pattern Recognition.IEEE,2009:248-255.
[44]PENNINGTON J,SOCHER R,MANNING C D.Glove:Global vectors for word representation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Proces-sing(EMNLP).2014:1532-1543.
[45]ANDERSON P,HE X,BUEHLER C,et al.Bottom-up and top-down attention for image captioning and visual question answe-ring[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:6077-6086.
[46]JIANG H,MISRA I,ROHRBACH M,et al.In Defense of Grid Features for Visual Question Answering[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:10267-10276.
[47]LU J,BATRA D,PARIKH D,et al.Vilbert:Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks[C]//Advances in Neural Information Processing Systems.2019:13-23.
[48]SU W,ZHU X,CAO Y,et al.V-bert:Pre-training of genericvisual-linguistic representations[J].arXiv:1908.08530,2019.
[49]TAN H,BANSAL M.LXMERT:Learning Cross-Modality Encoder Representations from Transformers[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP).2019:5103-5114.
[50]CHEN Y C,LI L,YU L,et al.Uniter:Learning universal image-text representations[J].arXiv:1909.11740,2019.
[51]LI X,YIN X,LI C,et al.Oscar:Object-semantics aligned pre-training for vision-language tasks[C]//European Conference on Computer Vision.Springer,Cham,2020:121-137.
[52]LU J,GOSWAMI V,ROHRBACH M,et al.12-in-1:Multi-task vision and language representation learning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:10437-10446.
[53]LI L H,YATSKAR M,YIN D,et al.Visualbert:A simple and performant baseline for vision and language[J].arXiv:1908.03557,2019.
[54]SHARMA P,DING N,GOODMAN S,et al.Conceptual cap-tions:A cleaned,hypernymed,image alt-text dataset for automatic image captioning[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics.2018:2556-2565.
[55]ORDONEZ V,KULKARNI G,BERG T L.Im2Text:describing images using 1 million captioned photographs[C]//Proceedings of the 24th International Conference on Neural Information Processing Systems.2011:1143-1151.
[56]JING C,WU Y,ZHANG X,et al.Overcoming Language Priors in VQA via Decomposed Linguistic Representations[C]//AAAI.2020:11181-11188.
[57]KV G,MITTAL A.Reducing Language Biases in Visual Question Answering with Visually-Grounded Question Encoder[J].arXiv:2007.06198,2020.
[58]SELVARAJU R R,LEE S,SHEN Y,et al.Taking a hint:Leveraging explanations to make vision and language models more grounded[C]//Proceedings of the IEEE International Confe-rence on Computer Vision.2019:2591-2600.
[59]WU J,MOONEY R.Self-critical reasoning for robust visualquestion answering[C]//Advances in Neural Information Processing Systems.2019:8604-8614.
[60]DAS A,AGRAWAL H,ZITNICK L,et al.Human Attention in Visual Question Answering:Do Humans and Deep Networks look at the same regions?[C]//Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing.2016:932-937.
[61]HUK PARK D,ANNE HENDRICKS L,AKATA Z,et al.Multimodal explanations:Justifying decisions and pointing to the evi-dence[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:8779-8788.
[62]RAMAKRISHNAN S,AGRAWAL A,LEE S.Overcoming language priors in visual question answering with adversarial regularization[C]//Proceedings of the 32nd International Conference on Neural Information Processing Systems.2018:1548-1558.
[63]CADENE R,DANCETTE C,CORD M,et al.Rubi:Reducingunimodal biases for visual question answering[C]//Advances in neural information processing systems.2019:841-852.
[64]CLARK C,YATSKAR M,ZETTLEMOYER L.Don’t Takethe Easy Way Out:Ensemble Based Methods for Avoiding Known Dataset Biases[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP).2019:4060-4073.
[65]ABBASNEJAD E,TENEY D,PARVANEH A,et al.Counterfactual vision and language learning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:10044-10054.
[66]TENEY D,ABBASNEJAD E,HENGEL A.Unshuffling datafor improved generalization[J].arXiv:2002.11894,2020.
[67]TENEY D,KAFLE K,SHRESTHA R,et al.On the Value of Out-of-Distribution Testing:An Example of Goodhart’s Law[J].arXiv:2005.09241,2020.
[68]ZHU X,MAO Z,LIU C,et al.Overcoming Language Priorswith Self-supervised Learning for Visual Question Answering[J].arXiv:2012.11528,2020.
[69]CHEN L,YAN X,XIAO J,et al.Counterfactual samples synthesizing for robust visual question answering[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:10800-10809.
[70]LIANG Z,JIANG W,HU H,et al.Learning to Contrast the Counterfactual Samples for Robust Visual Question Answering[C]//Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP).2020:3285-3292.
[71]GOKHALE T,BANERJEE P,BARAL C,et al.MUTANT:A Training Paradigm for Out-of-Distribution Generalization in Visual Question Answering[C]//Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP).2020:878-892.
[72]LU J,KANNAN A,YANG J,et al.Best of both worlds:transferring knowledge from discriminative learning to a generative visual dialog model[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems.2017:313-323.
[73]WU Q,WANG P,SHEN C,et al.Are you talking to me? reasoned visual dialog generation through adversarial learning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:6106-6115.
[74]GUO D,XU C,TAO D.Image-question-answer synergistic network for visual dialog[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:10434-10443.
[75]SEO P H,LEHRMANN A,HAN B,et al.Visual reference re-solution using attention memory for visual dialog[C]//Advances in Neural Information Processing Systems.2017:3719-3729.
[76]KOTTUR S,MOURA J M F,PARIKH D,et al.Visual corefe-rence resolution in visual dialog using neural module networks[C]//Proceedings of the European Conference on Computer Vision (ECCV).2018:153-169.
[77]NIU Y,ZHANG H,ZHANG M,et al.Recursive visual attention in visual dialog[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:6679-6688.
[78]KANG G C,LIM J,ZHANG B T.Dual Attention Networks for Visual Reference Resolution in Visual Dialog[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP).2019:2024-2033.
[79]GAN Z,CHENG Y,KHOLY A,et al.Multi-step Reasoning via Recurrent Dual Attention for Visual Dialog[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.2019:6463-6474.
[80]SCHWARTZ I,YU S,HAZAN T,et al.Factor graph attention[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:2039-2048.
[81]ZHENG Z,WANG W,QI S,et al.Reasoning visual dialogs with structural and partial observations[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:6669-6678.
[82]JIANG X,YU J,QIN Z,et al.DualVD:An Adaptive Dual Encoding Model for Deep Visual Understanding in Visual Dialogue[C]//AAAI.2020,1(3):5.
[83]MURAHARI V,BATRA D,PARIKH D,et al.Large-scale pretraining for visual dialog:A simple state-of-the-art baseline[J].arXiv:1912.02379,2019.
[84]WANG Y,JOTY S,LYU M R,et al.Vd-bert:A unified vision and dialog transformer with bert[J].arXiv:2004.13278,2020.
[85]NIU Y,TANG K,ZHANG H,et al.Counterfactual VQA:A Cause-Effect Look at Language Bias[J].arXiv:2006.04315,2020.
[86]TANG K,NIU Y,HUANG J,et al.Unbiased scene graph generation from biased training[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:3716-3725.
[87]YANG X,ZHANG H,CAI J.Deconfounded image captioning:A causal retrospect[J].arXiv:2003.03923,2020.
[88]WANG T,HUANG J,ZHANG H,et al.Visual commonsense r-cnn[C]//Proceedings of the IEEE/CVF Conference on Compu-ter Vision and Pattern Recognition.2020:10760-10770.
[89]TANG K,HUANG J,ZHANG H.Long-tailed classification by keeping the good and removing the bad momentum causal effect[J].arXiv:2009.12991,2020.
[90]YUE Z,ZHANG H,SUN Q,et al.Interventional few-shotlearning[J].arXiv:2009.13000,2020.
[91]ZHANG D,ZHANG H,TANG J,et al.Causal intervention for weakly-supervised semantic segmentation[J].arXiv:2009.12547,2020.
[92]MASSICETI D,DOKANIA P K,SIDDHARTH N,et al.Visual dialogue without vision or dialogue[J].arXiv:1812.06417,2018.
[93]AGARWAL S,BUI T,LEE J Y,et al.History for Visual Dialog:Do we really need it?[J].arXiv:2005.07493,2020.
[94]MASSICETI D,KULHARIA V,DOKANIA P K,et al.A Revised Generative Evaluation of Visual Dialogue[J].arXiv:2004.09272,2020.
[95]SINGH A,NATARAJAN V,SHAH M,et al.Towards vqamodels that can read[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:8317-8326.
[96]BITEN A F,TITO R,MAFLA A,et al.Scene text visual question answering[C]//Proceedings of the IEEE International Conference on Computer Vision.2019:4291-4301.
[97]GURARI D,LI Q,STANGL A J,et al.Vizwiz grand challenge:Answering visual questions from blind people[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:3608-3617.
[98]HUDSON D A,MANNING C D.Gqa:A new dataset for real-world visual reasoning and compositional question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:6700-6709.
[99]ZELLERS R,BISK Y,FARHADI A,et al.From recognition to cognition:Visual commonsense reasoning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:6720-6731.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed