基于视觉和语言的跨媒体问答与推理研究综述

doi:10.11896/jsjkx.201100176

摘要/Abstract

摘要： 基于视觉和语言的跨媒体问答与推理是人工智能领域的研究热点之一,其目的是基于给定的视觉内容和相关问题,模型能够返回正确的答案。随着深度学习的飞速发展及其在计算机视觉和自然语言处理领域的广泛应用,基于视觉和语言的跨媒体问答与推理也取得了较快的发展。文中首先系统地梳理了当前基于视觉和语言的跨媒体问答与推理的相关工作,具体介绍了基于图像的视觉问答与推理、基于视频的视觉问答与推理以及基于视觉常识推理模型与算法的研究进展,并将基于图像的视觉问答与推理细分为基于多模态融合、基于注意力机制和基于推理3类,将基于视觉常识推理细分为基于推理和基于预训练2类;然后总结了目前常用的问答与推理数据集,以及代表性的问答与推理模型在这些数据集上的实验结果;最后展望了基于视觉和语言的跨媒体问答与推理的未来发展方向。

关键词: 多模态融合, 跨媒体问答与推理, 视觉常识问答与推理, 视频问答与推理, 图像问答与推理, 预训练, 注意力机制

Abstract: Cross-media question answering and reasoning based on vision and language is one of the popular research hotspots of artificial intelligence.It aims to return a correct answer based on understanding of the given visual content and related questions.With the rapid development of deep learning and its wide application in computer vision and natural language processing,cross-media question answering and reasoning based on vision and language has also achieved rapid development.This paper systematically surveys the current researches on cross-media question answering and reasoning based on vision and language,and specifi-cally introduces the research progress of image-based visual question answe-ring and reasoning,video-based visual question answering and reasoning,and visual commonsense reasoning.Particularly,image-based visual question answering and reasoning is subdivided into three categories,i.e.,multi-modal fusion,attention mechanism,and reasoning based methods.Meanwhile,visual commonsense reasoning is subdivided into reasoning and pre-training based methods.Moreover,this paper summarizes the commonly used datasets of question answering and reasoning,as well as the experimental results of representative methods.Finally,this paper looks forward to the future development direction of cross-media question answering and reasoning based on vision and language.

Key words: Attention mechanism, Cross-media question answering and reasoning, Image-based question answering and reasoning, Multi-modal fusion, Pre-training, Video-based question answering and reasoning, Visual commonsense question answering and reasoning

中图分类号:

TP391

武阿明, 姜品, 韩亚洪. 基于视觉和语言的跨媒体问答与推理研究综述[J]. 计算机科学, 2021, 48(3): 71-78. https://doi.org/10.11896/jsjkx.201100176

WU A-ming, JIANG Pin, HAN Ya-hong. Survey of Cross-media Question Answering and Reasoning Based on Vision and Language[J]. Computer Science, 2021, 48(3): 71-78. https://doi.org/10.11896/jsjkx.201100176

参考文献

[1]TURING A M.Computing machinery and intelligence[J].Mind,1950,59(236):433-460.
[2]TENEY D,ANDERSON P,HE X,et al.Tips and tricks for visual question answering:Learnings from the 2017 challenge[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:4223-4232.
[3]JABRI A,JOULIN A,VAN DER MAATEN L.Revisiting visual question answering baselines[C]//European Conference on Computer Vision.Springer,Cham,2016:727-739.
[4]ZHU L,XU Z,YANG Y,et al.Uncovering the temporal context for video question answering[J].International Journal of Computer Vision,2017,124(3):409-421.
[5]ZELLERS R,BISK Y,FARHADI A,et al.From recognition to cognition:Visual commonsense reasoning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:6720-6731.
[6]WU Q,TENEY D,WANG P,et al.Visual question answering:A survey of methods and datasets[J].Computer Vision and Image Understanding,2017,163:21-40.
[7]DRUZHKOV P N,KUSTIKOVA V D.A survey of deep learning methods and software tools for image classification and object detection[J].Pattern Recognition and Image Analysis,2016,26(1):9-15.
[8]YANG S,WANG Y,CHU X.A Survey of Deep Learning Techniques for Neural Machine Translation[J].arXiv:2002.07526,2020.
[9]FUKUI A,PARK D H,YANG D,et al.Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding[C]//In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing.2016:457-468.
[10]LU J,YANG J,BATRA D,et al.Hierarchical question-imageco-attention for visual question answering[C]//Advances in Neural Information Processing Systems.2016:289-297.
[11]YU Z,YU J,CUI Y,et al.Deep modular co-attention networks for visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:6281-6290.
[12]NGUYEN B D,DO T T,NGUYEN B X,et al.Overcoming data limitation in medical visual question answering[C]//International Conference on Medical Image Computing and Computer-Assisted Intervention.Springer,Cham,2019:522-530.
[13]DAS A,KOTTUR S,GUPTA K,et al.Visual dialog[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:326-335.
[14]SEO P H,LEHRMANN A,HAN B,et al.Visual reference resolution using attention memory for visual dialog[C]//Advances in Neural Information Processing Systems.2017:3719-3729.
[15]ANTOL S,AGRAWAL A,LU J,et al.VQA:Visual question answering [C]//Proceedings of IEEE Conference on Computer Vision.New York:IEEE Press,2015:2425-2433.
[16]REN M,KIROS R,ZEMEL R.Exploring models and data for image question answering[C]//Advances in Neural Information Processing Systems.2015:2953-2961.
[17]SHIH K J,SINGH S,HOIEM D.Where to look:Focus regions for visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:4613-4621.
[18]KIM J H,LEE S W,KWAK D,et al.Multimodal residual learn-ing for visual qa[C]//Advances in Neural Information Processing Systems.2016:361-369.
[19]LI R,JIA J.Visual question answering with question representation update (qru)[C]//Advances in Neural Information Processing Systems.2016:4655-4663.
[20]CHARIKAR M,CHEN K,FARACH-COLTON M.Finding frequent items in data streams[C]//International Colloquium on Automata,Languages,and Programming.Berlin,Heidelberg:Springer,2002:693-703.
[21]KIM J H,ON K W,LIM W,et al.Hadamard Product for Low-rank Bilinear Pooling[C]//In ICLR.2016.
[22]BEN-YOUNES H,CADENE R,CORD M,et al.Mutan:Multimodal tucker fusion for visual question answering[C]//Proceedings of the IEEE International Conference on Computer Vision.2017:2612-2620.
[23]YU Z,YU J,FAN J,et al.Multi-modal factorized bilinear pooling with co-attention learning for visual question answering[C]//Proceedings of the IEEE International Conference on Computer Vision.2017:1821-1830.
[24]BEN-YOUNES H,CADENE R,THOME N,et al.Block:Bilinear superdiagonal fusion for visual question answering and visual relationship detection[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2019,33:8102-8109.
[25]XU K,BA J,KIROS R,et al.Show,attend and tell:Neural image caption generation with visual attention[C]//International Conference on Machine Learning.2015:2048-2057.
[26]YANG Z,HE X,GAO J,et al.Stacked attention networks for image question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:21-29.
[27]LI R,JIA J.Visual question answering with question representation update (qru)[C]//Advances in Neural Information Processing Systems.2016:4655-4663.
[28]ANDERSON P,HE X,BUEHLER C,et al.Bottom-up and top-down attention for image captioning and visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:6077-6086.
[29]REN S,HE K,GIRSHICK R,et al.Faster r-cnn:Towards real-time object detection with region proposal networks[C]//Advances in Neural Information Processing Systems.2015:91-99.
[30]SCHWARTZ I,SCHWING A,HAZAN T.High-order attention models for visual question answering[C]//Advances in Neural Information Processing Systems.2017:3664-3674.
[31]LI Y,KAISER L,BENGIO S,et al.Area attention[C]//International Conference on Machine Learning.PMLR,2019:3846-3855.
[32]PATRO B,NAMBOODIRI V P.Differential attention for visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:7680-7688.
[33]GUO W,ZHANG Y,WU X,et al.Re-Attention for VisualQuestion Answering[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2020:91-98.
[34]ANDREAS J,ROHRBACH M,DARRELL T,et al.Neuralmodule networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:39-48.
[35]HU R,ANDREAS J,ROHRBACH M,et al.Learning to reason:End-to-end module networks for visual question answering[C]//Proceedings of the IEEE International Conference on Computer Vision.2017:804-813.
[36]HUDSON D A,MANNING C D.Compositional Attention Networks for Machine Reasoning[C]//International Conference on Learning Representations.2018.
[37]GAO P,YOU H,ZHANG Z,et al.Multi-modality latent interaction network for visual question answering[C]//Proceedings of the IEEE International Conference on Computer Vision.2019:5825-5835.
[38]CADENE R,BEN-YOUNES H,CORD M,et al.Murel:Multimodal relational reasoning for visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:1989-1998.
[39]GAO P,JIANG Z,YOU H,et al.Dynamic fusion with intra-and inter-modality attention flow for visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:6639-6648.
[40]KIPF T N,WELLING M.Semi-Supervised Classification with Graph Convolutional Networks[C]//International Conference on Learning Representations.2016.
[41]VELIKOVI P,CUCURULL G,CASANOVA A,et al.GraphAttention Networks[C]//International Conference on Learning Representations.2018.
[42]MONTI F,BOSCAINI D,MASCI J,et al.Geometric deep learning on graphs and manifolds using mixture model cnns[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:5115-5124.
[43]TENEY D,LIU L,VAN DEN HENGEL A.Graph-structured representations for visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:1-9.
[44]NORCLIFFE-BROWN W,VAFEIAS S,PARISOT S.Learning conditioned graph structures for interpretable visual question answering[C]//Advances in Neural Information Processing Systems.2018:8334-8343.
[45]HU R,ROHRBACH A,DARRELL T,et al.Language-condi-tioned graph networks for relational reasoning[C]//Proceedings of the IEEE International Conference on Computer Vision.2019:10294-10303.
[46]KHADEMI M.Multimodal Neural Graph Memory Networksfor Visual Question Answering[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.2020:7177-7188.
[47]SUKHBAATAR S,WESTON J,FERGUS R.End-to-end memory networks[C]//Advances in Neural Information Processing Systems.2015:2440-2448.
[48]HUDSON D,MANNING C D.Learning by abstraction:Theneural state machine[C]//Advances in Neural Information Processing Systems.2019:5903-5916.
[49]HAN Y,WANG B,HONG R,et al.Movie question answering via textual memory and plot graph[J].IEEE Transactions on Circuits and Systems for Video Technology,2019,30(3):875-887.
[50]WANG B,XU Y,HAN Y,et al.Movie question answering:Remembering the textual cues for layered visual contents[J].ar-Xiv:1804.09412,2018.
[51]HOCHREITER S,SCHMIDHUBER J.Long short-term memory[J].Neural Computation,1997,9(8):1735-1780.
[52]TAPASWI M,ZHU Y,STIEFELHAGEN R,et al.Movieqa:Understanding stories in movies through question-answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:4631-4640.
[53]GAO J,GE R,CHEN K,et al.Motion-appearance co-memory networks for video question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:6576-6585.
[54]KIM J,MA M,KIM K,et al.Progressive attention memory network for movie story question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:8337-8346.
[55]LI X,SONG J,GAO L,et al.Beyond rnns:Positional self-attention with co-attention for video question answering[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2019,33:8658-8665.
[56]KIM J,MA M,PHAM T,et al.Modality Shifting Attention Network for Multi-Modal Video Question Answering[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:10106-10115.
[57]GAN Z,GAN C,HE X,et al.Semantic compositional networks for visual captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:5630-5639.
[58]YAO T,PAN Y,LI Y,et al.Exploring visual relationship for image captioning[C]//Proceedings of the European Conference on Computer Vision (ECCV).2018:684-699.
[59]CHEN L,ZHANG H,XIAO J,et al.Sca-cnn:Spatial and channel-wise attention in convolutional networks for image captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:5659-5667.
[60]JIANG P,HAN Y.Reasoning with Heterogeneous GraphAlignment for Video Question Answering[C]//AAAI.2020:11109-11116.
[61]SONG X,SHI Y,CHEN X,et al.Explore multi-step reasoning in video question answering[C]//Proceedings of the 26th ACM International Conference on Multimedia.2018:239-247.
[62]WU A,ZHU L,HAN Y,et al.Connective Cognition Networkfor Directional Visual Commonsense Reasoning[C]//Advances in Neural Information Processing Systems.2019:5669-5679.
[63]YU W,ZHOU J,YU W,et al.Heterogeneous Graph Learning for Visual Commonsense Reasoning[C]//Advances in Neural Information Processing Systems.2019:2769-2779.
[64]LIN J,JAIN U,SCHWING A G.TAB-VCR:Tags and Attributes based Visual Commonsense Reasoning Baselines[C]//Advances in Neural Information Processing Systems.2019.
[65]DEVLIN J,CHANG M W,LEE K,et al.BERT:Pre-Training of Deep Bidirectional Transformers for Language Understanding[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies,Volume 1 (Long and Short Papers).2018:4171-4186.
[66]LU J,BATRA D,PARIKH D,et al.Vilbert:Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks[C]//Advances in Neural Information Processing Systems.2019:13-23.
[67]SU W,ZHU X,CAO Y,et al.Vl-bert:Pre-training of generic visual-linguistic representations[C]//International Conference on Learning Representations.2020.
[68]GOYAL Y,KHOT T,SUMMERS-STAY D,et al.Making the V in VQA matter:Elevating the role of image understanding in Visual Question Answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:6904-6913.
[69]ZHANG Y,HARE J,PRÜGEL-BENNETT A.Learning tocount objects in natural images for visual question answering[J].arXiv:1802.05766,2018.
[70]YU Z,YU J,XIANG C,et al.Beyond bilinear:Generalized multimodal factorized high-order pooling for visual question answering[J].IEEE Transactions on Neural Networks and Learning Systems,2018,29(12):5947-5959.
[71]KIM J H,JUN J,ZHANG B T.Bilinear attention networks[C]//Advances in Neural Information Processing Systems.2018:1564-1574.
[72]JANG Y,SONG Y,YU Y,et al.Tgif-qa:Toward spatio-temporal reasoning in visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:2758-2766.
[73]FAN C,ZHANG X,ZHANG S,et al.Heterogeneous memoryenhanced multimodal attention model for video question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:1999-2007.
[74]PENNINGTON J,SOCHER R,MANNING C D.Glove:Global vectors for word representation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP).2014:1532-1543.
[75]RUSSAKOVSKY O,DENG J,SU H,et al.Imagenet large scale visual recognition challenge[J].International Journal of Computer Vision,2015,115(3):211-252.
[76]HE K,ZHANG X,REN S,et al.Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:770-778.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed