视觉问答与对话综述

doi:10.11896/jsjkx.201200174

Abstract

Abstract: Visual question answering and dialogue are important research tasks in artificial intelligence,and the representative problems in the intersection of computer vision and natural language processing.Visual question answering and dialogue tasks require the machine to answer single-round or multi-round questions based on the specified visual content.Visual question answering and dialogue require the machine’s abilities of perception,cognition and reasoning,and have application prospects in cross-modal human-computer interaction applications.This paper reviews recent research progress of visual question answering and dialogue,and summarizes datasets,algorithms,challenges,and problems.Finally,this paper discusses the future research trend of visual question answering and dialogue.

Key words: Deep learning, Vision and language, Visual dialogue, Visual question answering, Visual reasoning

CLC Number:

TP391

NIU Yu-lei, ZHANG Han-wang. Survey on Visual Question Answering and Dialogue[J].Computer Science, 2021, 48(3): 87-96.

References

[1]YU J,WANG L,YU Z.Research on Visual Question Answering Techniques[J].Journal of Computer Research and Development,2018,55(9):1946-1958.
[2]QI J,NIU Y,HUANG J,et al.Two causal principles for improving visual dialog[C]//Proceedings of the IEEE/CVF Confe-rence on Computer Vision and Pattern Recognition.2020:10860-10869.
[3]ANTOL S,AGRAWAL A,LU J,et al.Vqa:Visual question answering[C]//Proceedings of the IEEE International Conference on Computer Vision.2015:2425-2433.
[4]GOYAL Y,KHOT T,SUMMERS-STAY D,et al.Making the V in VQA matter:Elevating the role of image understanding in Visual Question Answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:6904-6913.
[5]LIN T Y,MAIRE M,BELONGIE S,et al.Microsoft coco:Common objects in context[C]//European Conference on Computer Vision.Springer,Cham,2014:740-755.
[6]DAS A,KOTTUR S,GUPTA K,et al.Visual dialog[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:326-335.
[7]MALINOWSKI M,FRITZ M.A Multi-world Approach toQuestion Answering about Real-world Scenes based on Uncertain Input[C]//Twenty-Eighth Annual Conference on Neural Information Processing Systems.Curran,2014:1682-1690.
[8]REN M,KIROS R,ZEMEL R S.Exploring models and data for image question answering[C]//Proceedings of the 28th International Conference on Neural Information Processing Systems.2015:2953-2961.
[9]GAO H,MAO J,ZHOU J,et al.Are you talking to a machine? Dataset and methods for multilingual image question answering[C]//Proceedings of the 28th International Conference on Neural Information Processing Systems.2015:2296-2304.
[10]KRISHNA R,ZHU Y,GROTH O,et al.Visual genome:Connecting language and vision using crowdsourced dense image annotations[J].International Journal of Computer Vision,2017,123(1):32-73.
[11]ZHU Y,GROTH O,BERNSTEIN M,et al.Visual7w:Groun-ded question answering in images[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:4995-5004.
[12]YU L,PARK E,BERG A C,et al.Visual madlibs:Fill in the blank image generation and question answering[J].arXiv:1506.00278,2015.
[13]DE VRIES H,STRUB F,CHANDAR S,et al.Guesswhat?!visual object discovery through multi-modal dialogue[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:5503-5512.
[14]XU H,SAENKO K.Ask,attend and answer:Exploring question-guided spatial attention for visual question answering[C]//European Conference on Computer Vision.Springer,Cham,2016:451-466.
[15]YANG Z,HE X,GAO J,et al.Stacked attention networks for image question answering[C]//Proceedings of the IEEE Confe-rence on Computer Vision and Pattern Recognition.2016:21-29.
[16]LU J,YANG J,BATRA D,et al.Hierarchical question-imageco-attention for visual question answering[C]//Proceedings of the 30th International Conference on Neural Information Processing Systems.2016:289-297.
[17]YU Z,YU J,CUI Y,et al.Deep modular co-attention networks for visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:6281-6290.
[18]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[C]//Proceedings of the 31st International Confe-rence on Neural Information Processing Systems.2017:6000-6010.
[19]KIM J H,JUN J,ZHANG B T.Bilinear attention networks[C]//Proceedings of the 32nd International Conference on Neural Information Processing Systems.2018:1571-1581.
[20]FUKUI A,PARK D H,YANG D,et al.Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding[C]//Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing.2016:457-468.
[21]KIM J H,ON K W,LIM W,et al.Hadamard product for low-rank bilinear pooling[J].arXiv:1610.04325,2016.
[22]YU Z,YU J,FAN J,et al.Multi-modal factorized bilinear pooling with co-attention learning for visual question answering[C]//Proceedings of the IEEE International Conference on Computer Vision.2017:1821-1830.
[23]BEN-YOUNES H,CADENE R,CORD M,et al.Mutan:Multimodal tucker fusion for visual question answering[C]//Procee-dings of the IEEE International Conference on Computer Vision.2017:2612-2620.
[24]BEN-YOUNES H,CADENE R,THOME N,et al.Block:Bili-near superdiagonal fusion for vi-sual question answering and vi-sual relationship detection[C]//Proceedings of the AAAI Confe-rence on Artificial Intelligence.2019,33:8102-8109.
[25]ANDREAS J,ROHRBACH M,DARRELL T,et al.Neural module networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:39-48.
[26]HU R,ANDREAS J,ROHRBACH M,et al.Learning to rea-son:End-to-end module networks for visual question answering[C]//Proceedings of the IEEE International Conference on Computer Vision.2017:804-813.
[27]HU R,ANDREAS J,DARRELL T,et al.Explainable neuralcomputation via stack neural module networks[C]//Procee-dings of the European Conference on Computer Vision (ECCV).2018:53-69.
[28]JOHNSON J,HARIHARAN B,VAN DER MAATEN L,et al.Inferring and executing programs for visual reasoning[C]//Proceedings of the IEEE International Conference on Computer Vision.2017:2989-2998.
[29]MASCHARKA D,TRAN P,SOKLASKI R,et al.Transparency by design:Closing the gap between performance and interpretability in visual reasoning[C]//Proceedings of the IEEE Confe-rence on Computer Vision and Pattern Recognition.2018:4942-4950.
[30]YI K,WU J,GAN C,et al.Neural-symbolic VQA:disentangling reasoning from vision and language understanding[C]//Proceedings of the 32nd International Conference on Neural Information Processing Systems.2018:1039-1050.
[31]SHI J,ZHANG H,LI J.Explainable and explicit visual reaso-ning over scene graphs[C]//Proceedings of the IEEE Confe-rence on Computer Vision and Pattern Recognition.2019:8376-8384.
[32]VEDANTAM R,DESAI K,LEE S,et al.Probabilistic Neural Symbolic Models for Interpretable Visual Question Answering[C]//International Conference on Machine Learning.2019:6428-6437.
[33]CHEN W,GAN Z,LI L,et al.Meta module network for compositional visual reasoning[J].arXiv:1910.03230,2019.
[34]JOHNSON J,HARIHARAN B,VAN DER MAATEN L,et al.Clevr:A diagnostic dataset for compositional language and elementary visual reasoning[C]//Proceedings of the IEEE Confe-rence on Computer Vision and Pattern Recognition.2017:2901-2910.
[35]CADENE R,BEN-YOUNES H,CORD M,et al.Murel:Multimodal relational reasoning for visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:1989-1998.
[36]LI L,GAN Z,CHENG Y,et al.Relation-aware graph attention network for visual question answering[C]//Proceedings of the IEEE International Conference on Computer Vision.2019:10313-10322.
[37]HU R,ROHRBACH A,DARRELL T,et al.Language-conditioned graph networks for relational reasoning[C]//Proceedings of the IEEE International Conference on Computer Vision.2019:10294-10303.
[38]KAFLE K,YOUSEFHUSSIEN M,KANAN C.Data augmentation for visual question answering[C]//Proceedings of the 10th International Conference on Natural Language Generation.2017:198-202.
[39]RAY A,SIKKA K,DIVAKARAN A,et al.Sunny and DarkOutside?! Improving Answer Consistency in VQA through Entailed Question Generation[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP).2019:5863-5868.
[40]SHAH M,CHEN X,ROHRBACH M,et al.Cycle-consistencyfor robust visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:6649-6658.
[41]SIMONYAN K,ZISSERMAN A.Very deep convolutional networks for large-scale image recognition[J].arXiv:1409.1556,2014.
[42]HE K,ZHANG X,REN S,et al.Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:770-778.
[43]DENG J,DONG W,SOCHER R,et al.Imagenet:A large-scale hierarchical image database[C]//2009 IEEE Conference on Computer Vision and Pattern Recognition.IEEE,2009:248-255.
[44]PENNINGTON J,SOCHER R,MANNING C D.Glove:Global vectors for word representation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Proces-sing(EMNLP).2014:1532-1543.
[45]ANDERSON P,HE X,BUEHLER C,et al.Bottom-up and top-down attention for image captioning and visual question answe-ring[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:6077-6086.
[46]JIANG H,MISRA I,ROHRBACH M,et al.In Defense of Grid Features for Visual Question Answering[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:10267-10276.
[47]LU J,BATRA D,PARIKH D,et al.Vilbert:Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks[C]//Advances in Neural Information Processing Systems.2019:13-23.
[48]SU W,ZHU X,CAO Y,et al.V-bert:Pre-training of genericvisual-linguistic representations[J].arXiv:1908.08530,2019.
[49]TAN H,BANSAL M.LXMERT:Learning Cross-Modality Encoder Representations from Transformers[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP).2019:5103-5114.
[50]CHEN Y C,LI L,YU L,et al.Uniter:Learning universal image-text representations[J].arXiv:1909.11740,2019.
[51]LI X,YIN X,LI C,et al.Oscar:Object-semantics aligned pre-training for vision-language tasks[C]//European Conference on Computer Vision.Springer,Cham,2020:121-137.
[52]LU J,GOSWAMI V,ROHRBACH M,et al.12-in-1:Multi-task vision and language representation learning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:10437-10446.
[53]LI L H,YATSKAR M,YIN D,et al.Visualbert:A simple and performant baseline for vision and language[J].arXiv:1908.03557,2019.
[54]SHARMA P,DING N,GOODMAN S,et al.Conceptual cap-tions:A cleaned,hypernymed,image alt-text dataset for automatic image captioning[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics.2018:2556-2565.
[55]ORDONEZ V,KULKARNI G,BERG T L.Im2Text:describing images using 1 million captioned photographs[C]//Proceedings of the 24th International Conference on Neural Information Processing Systems.2011:1143-1151.
[56]JING C,WU Y,ZHANG X,et al.Overcoming Language Priors in VQA via Decomposed Linguistic Representations[C]//AAAI.2020:11181-11188.
[57]KV G,MITTAL A.Reducing Language Biases in Visual Question Answering with Visually-Grounded Question Encoder[J].arXiv:2007.06198,2020.
[58]SELVARAJU R R,LEE S,SHEN Y,et al.Taking a hint:Leveraging explanations to make vision and language models more grounded[C]//Proceedings of the IEEE International Confe-rence on Computer Vision.2019:2591-2600.
[59]WU J,MOONEY R.Self-critical reasoning for robust visualquestion answering[C]//Advances in Neural Information Processing Systems.2019:8604-8614.
[60]DAS A,AGRAWAL H,ZITNICK L,et al.Human Attention in Visual Question Answering:Do Humans and Deep Networks look at the same regions?[C]//Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing.2016:932-937.
[61]HUK PARK D,ANNE HENDRICKS L,AKATA Z,et al.Multimodal explanations:Justifying decisions and pointing to the evi-dence[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:8779-8788.
[62]RAMAKRISHNAN S,AGRAWAL A,LEE S.Overcoming language priors in visual question answering with adversarial regularization[C]//Proceedings of the 32nd International Conference on Neural Information Processing Systems.2018:1548-1558.
[63]CADENE R,DANCETTE C,CORD M,et al.Rubi:Reducingunimodal biases for visual question answering[C]//Advances in neural information processing systems.2019:841-852.
[64]CLARK C,YATSKAR M,ZETTLEMOYER L.Don’t Takethe Easy Way Out:Ensemble Based Methods for Avoiding Known Dataset Biases[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP).2019:4060-4073.
[65]ABBASNEJAD E,TENEY D,PARVANEH A,et al.Counterfactual vision and language learning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:10044-10054.
[66]TENEY D,ABBASNEJAD E,HENGEL A.Unshuffling datafor improved generalization[J].arXiv:2002.11894,2020.
[67]TENEY D,KAFLE K,SHRESTHA R,et al.On the Value of Out-of-Distribution Testing:An Example of Goodhart’s Law[J].arXiv:2005.09241,2020.
[68]ZHU X,MAO Z,LIU C,et al.Overcoming Language Priorswith Self-supervised Learning for Visual Question Answering[J].arXiv:2012.11528,2020.
[69]CHEN L,YAN X,XIAO J,et al.Counterfactual samples synthesizing for robust visual question answering[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:10800-10809.
[70]LIANG Z,JIANG W,HU H,et al.Learning to Contrast the Counterfactual Samples for Robust Visual Question Answering[C]//Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP).2020:3285-3292.
[71]GOKHALE T,BANERJEE P,BARAL C,et al.MUTANT:A Training Paradigm for Out-of-Distribution Generalization in Visual Question Answering[C]//Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP).2020:878-892.
[72]LU J,KANNAN A,YANG J,et al.Best of both worlds:transferring knowledge from discriminative learning to a generative visual dialog model[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems.2017:313-323.
[73]WU Q,WANG P,SHEN C,et al.Are you talking to me? reasoned visual dialog generation through adversarial learning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:6106-6115.
[74]GUO D,XU C,TAO D.Image-question-answer synergistic network for visual dialog[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:10434-10443.
[75]SEO P H,LEHRMANN A,HAN B,et al.Visual reference re-solution using attention memory for visual dialog[C]//Advances in Neural Information Processing Systems.2017:3719-3729.
[76]KOTTUR S,MOURA J M F,PARIKH D,et al.Visual corefe-rence resolution in visual dialog using neural module networks[C]//Proceedings of the European Conference on Computer Vision (ECCV).2018:153-169.
[77]NIU Y,ZHANG H,ZHANG M,et al.Recursive visual attention in visual dialog[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:6679-6688.
[78]KANG G C,LIM J,ZHANG B T.Dual Attention Networks for Visual Reference Resolution in Visual Dialog[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP).2019:2024-2033.
[79]GAN Z,CHENG Y,KHOLY A,et al.Multi-step Reasoning via Recurrent Dual Attention for Visual Dialog[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.2019:6463-6474.
[80]SCHWARTZ I,YU S,HAZAN T,et al.Factor graph attention[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:2039-2048.
[81]ZHENG Z,WANG W,QI S,et al.Reasoning visual dialogs with structural and partial observations[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:6669-6678.
[82]JIANG X,YU J,QIN Z,et al.DualVD:An Adaptive Dual Encoding Model for Deep Visual Understanding in Visual Dialogue[C]//AAAI.2020,1(3):5.
[83]MURAHARI V,BATRA D,PARIKH D,et al.Large-scale pretraining for visual dialog:A simple state-of-the-art baseline[J].arXiv:1912.02379,2019.
[84]WANG Y,JOTY S,LYU M R,et al.Vd-bert:A unified vision and dialog transformer with bert[J].arXiv:2004.13278,2020.
[85]NIU Y,TANG K,ZHANG H,et al.Counterfactual VQA:A Cause-Effect Look at Language Bias[J].arXiv:2006.04315,2020.
[86]TANG K,NIU Y,HUANG J,et al.Unbiased scene graph generation from biased training[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:3716-3725.
[87]YANG X,ZHANG H,CAI J.Deconfounded image captioning:A causal retrospect[J].arXiv:2003.03923,2020.
[88]WANG T,HUANG J,ZHANG H,et al.Visual commonsense r-cnn[C]//Proceedings of the IEEE/CVF Conference on Compu-ter Vision and Pattern Recognition.2020:10760-10770.
[89]TANG K,HUANG J,ZHANG H.Long-tailed classification by keeping the good and removing the bad momentum causal effect[J].arXiv:2009.12991,2020.
[90]YUE Z,ZHANG H,SUN Q,et al.Interventional few-shotlearning[J].arXiv:2009.13000,2020.
[91]ZHANG D,ZHANG H,TANG J,et al.Causal intervention for weakly-supervised semantic segmentation[J].arXiv:2009.12547,2020.
[92]MASSICETI D,DOKANIA P K,SIDDHARTH N,et al.Visual dialogue without vision or dialogue[J].arXiv:1812.06417,2018.
[93]AGARWAL S,BUI T,LEE J Y,et al.History for Visual Dialog:Do we really need it?[J].arXiv:2005.07493,2020.
[94]MASSICETI D,KULHARIA V,DOKANIA P K,et al.A Revised Generative Evaluation of Visual Dialogue[J].arXiv:2004.09272,2020.
[95]SINGH A,NATARAJAN V,SHAH M,et al.Towards vqamodels that can read[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:8317-8326.
[96]BITEN A F,TITO R,MAFLA A,et al.Scene text visual question answering[C]//Proceedings of the IEEE International Conference on Computer Vision.2019:4291-4301.
[97]GURARI D,LI Q,STANGL A J,et al.Vizwiz grand challenge:Answering visual questions from blind people[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:3608-3617.
[98]HUDSON D A,MANNING C D.Gqa:A new dataset for real-world visual reasoning and compositional question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:6700-6709.
[99]ZELLERS R,BISK Y,FARHADI A,et al.From recognition to cognition:Visual commonsense reasoning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:6720-6731.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Survey on Visual Question Answering and Dialogue

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 15

Metrics

Comments

Recommended 0

[1]	XU Yong-xin, ZHAO Jun-feng, WANG Ya-sha, XIE Bing, YANG Kai. Temporal Knowledge Graph Representation Learning [J]. Computer Science, 2022, 49(9): 162-171.
[2]	RAO Zhi-shuang, JIA Zhen, ZHANG Fan, LI Tian-rui. Key-Value Relational Memory Networks for Question Answering over Knowledge Graph [J]. Computer Science, 2022, 49(9): 202-207.
[3]	TANG Ling-tao, WANG Di, ZHANG Lu-fei, LIU Sheng-yun. Federated Learning Scheme Based on Secure Multi-party Computation and Differential Privacy [J]. Computer Science, 2022, 49(9): 297-305.
[4]	WANG Jian, PENG Yu-qi, ZHAO Yu-fei, YANG Jian. Survey of Social Network Public Opinion Information Extraction Based on Deep Learning [J]. Computer Science, 2022, 49(8): 279-293.
[5]	HAO Zhi-rong, CHEN Long, HUANG Jia-cheng. Class Discriminative Universal Adversarial Attack for Text Classification [J]. Computer Science, 2022, 49(8): 323-329.
[6]	JIANG Meng-han, LI Shao-mei, ZHENG Hong-hao, ZHANG Jian-peng. Rumor Detection Model Based on Improved Position Embedding [J]. Computer Science, 2022, 49(8): 330-335.
[7]	SUN Qi, JI Gen-lin, ZHANG Jie. Non-local Attention Based Generative Adversarial Network for Video Abnormal Event Detection [J]. Computer Science, 2022, 49(8): 172-177.
[8]	HU Yan-yu, ZHAO Long, DONG Xiang-jun. Two-stage Deep Feature Selection Extraction Algorithm for Cancer Classification [J]. Computer Science, 2022, 49(7): 73-78.
[9]	CHENG Cheng, JIANG Ai-lian. Real-time Semantic Segmentation Method Based on Multi-path Feature Extraction [J]. Computer Science, 2022, 49(7): 120-126.
[10]	HOU Yu-tao, ABULIZI Abudukelimu, ABUDUKELIMU Halidanmu. Advances in Chinese Pre-training Models [J]. Computer Science, 2022, 49(7): 148-163.
[11]	ZHOU Hui, SHI Hao-chen, TU Yao-feng, HUANG Sheng-jun. Robust Deep Neural Network Learning Based on Active Sampling [J]. Computer Science, 2022, 49(7): 164-169.
[12]	SU Dan-ning, CAO Gui-tao, WANG Yan-nan, WANG Hong, REN He. Survey of Deep Learning for Radar Emitter Identification Based on Small Sample [J]. Computer Science, 2022, 49(7): 226-235.
[13]	WANG Jun-feng, LIU Fan, YANG Sai, LYU Tan-yue, CHEN Zhi-yu, XU Feng. Dam Crack Detection Based on Multi-source Transfer Learning [J]. Computer Science, 2022, 49(6A): 319-324.
[14]	CHU Yu-chun, GONG Hang, Wang Xue-fang, LIU Pei-shun. Study on Knowledge Distillation of Target Detection Algorithm Based on YOLOv4 [J]. Computer Science, 2022, 49(6A): 337-344.
[15]	LIU Wei-ye, LU Hui-min, LI Yu-peng, MA Ning. Survey on Finger Vein Recognition Research [J]. Computer Science, 2022, 49(6A): 1-11.