基于视觉和语言的跨媒体问答与推理研究综述

doi:10.11896/jsjkx.201100176

Abstract

Abstract: Cross-media question answering and reasoning based on vision and language is one of the popular research hotspots of artificial intelligence.It aims to return a correct answer based on understanding of the given visual content and related questions.With the rapid development of deep learning and its wide application in computer vision and natural language processing,cross-media question answering and reasoning based on vision and language has also achieved rapid development.This paper systematically surveys the current researches on cross-media question answering and reasoning based on vision and language,and specifi-cally introduces the research progress of image-based visual question answe-ring and reasoning,video-based visual question answering and reasoning,and visual commonsense reasoning.Particularly,image-based visual question answering and reasoning is subdivided into three categories,i.e.,multi-modal fusion,attention mechanism,and reasoning based methods.Meanwhile,visual commonsense reasoning is subdivided into reasoning and pre-training based methods.Moreover,this paper summarizes the commonly used datasets of question answering and reasoning,as well as the experimental results of representative methods.Finally,this paper looks forward to the future development direction of cross-media question answering and reasoning based on vision and language.

Key words: Attention mechanism, Cross-media question answering and reasoning, Image-based question answering and reasoning, Multi-modal fusion, Pre-training, Video-based question answering and reasoning, Visual commonsense question answering and reasoning

CLC Number:

TP391

WU A-ming, JIANG Pin, HAN Ya-hong. Survey of Cross-media Question Answering and Reasoning Based on Vision and Language[J].Computer Science, 2021, 48(3): 71-78.

References

[1]TURING A M.Computing machinery and intelligence[J].Mind,1950,59(236):433-460.
[2]TENEY D,ANDERSON P,HE X,et al.Tips and tricks for visual question answering:Learnings from the 2017 challenge[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:4223-4232.
[3]JABRI A,JOULIN A,VAN DER MAATEN L.Revisiting visual question answering baselines[C]//European Conference on Computer Vision.Springer,Cham,2016:727-739.
[4]ZHU L,XU Z,YANG Y,et al.Uncovering the temporal context for video question answering[J].International Journal of Computer Vision,2017,124(3):409-421.
[5]ZELLERS R,BISK Y,FARHADI A,et al.From recognition to cognition:Visual commonsense reasoning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:6720-6731.
[6]WU Q,TENEY D,WANG P,et al.Visual question answering:A survey of methods and datasets[J].Computer Vision and Image Understanding,2017,163:21-40.
[7]DRUZHKOV P N,KUSTIKOVA V D.A survey of deep learning methods and software tools for image classification and object detection[J].Pattern Recognition and Image Analysis,2016,26(1):9-15.
[8]YANG S,WANG Y,CHU X.A Survey of Deep Learning Techniques for Neural Machine Translation[J].arXiv:2002.07526,2020.
[9]FUKUI A,PARK D H,YANG D,et al.Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding[C]//In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing.2016:457-468.
[10]LU J,YANG J,BATRA D,et al.Hierarchical question-imageco-attention for visual question answering[C]//Advances in Neural Information Processing Systems.2016:289-297.
[11]YU Z,YU J,CUI Y,et al.Deep modular co-attention networks for visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:6281-6290.
[12]NGUYEN B D,DO T T,NGUYEN B X,et al.Overcoming data limitation in medical visual question answering[C]//International Conference on Medical Image Computing and Computer-Assisted Intervention.Springer,Cham,2019:522-530.
[13]DAS A,KOTTUR S,GUPTA K,et al.Visual dialog[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:326-335.
[14]SEO P H,LEHRMANN A,HAN B,et al.Visual reference resolution using attention memory for visual dialog[C]//Advances in Neural Information Processing Systems.2017:3719-3729.
[15]ANTOL S,AGRAWAL A,LU J,et al.VQA:Visual question answering [C]//Proceedings of IEEE Conference on Computer Vision.New York:IEEE Press,2015:2425-2433.
[16]REN M,KIROS R,ZEMEL R.Exploring models and data for image question answering[C]//Advances in Neural Information Processing Systems.2015:2953-2961.
[17]SHIH K J,SINGH S,HOIEM D.Where to look:Focus regions for visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:4613-4621.
[18]KIM J H,LEE S W,KWAK D,et al.Multimodal residual learn-ing for visual qa[C]//Advances in Neural Information Processing Systems.2016:361-369.
[19]LI R,JIA J.Visual question answering with question representation update (qru)[C]//Advances in Neural Information Processing Systems.2016:4655-4663.
[20]CHARIKAR M,CHEN K,FARACH-COLTON M.Finding frequent items in data streams[C]//International Colloquium on Automata,Languages,and Programming.Berlin,Heidelberg:Springer,2002:693-703.
[21]KIM J H,ON K W,LIM W,et al.Hadamard Product for Low-rank Bilinear Pooling[C]//In ICLR.2016.
[22]BEN-YOUNES H,CADENE R,CORD M,et al.Mutan:Multimodal tucker fusion for visual question answering[C]//Proceedings of the IEEE International Conference on Computer Vision.2017:2612-2620.
[23]YU Z,YU J,FAN J,et al.Multi-modal factorized bilinear pooling with co-attention learning for visual question answering[C]//Proceedings of the IEEE International Conference on Computer Vision.2017:1821-1830.
[24]BEN-YOUNES H,CADENE R,THOME N,et al.Block:Bilinear superdiagonal fusion for visual question answering and visual relationship detection[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2019,33:8102-8109.
[25]XU K,BA J,KIROS R,et al.Show,attend and tell:Neural image caption generation with visual attention[C]//International Conference on Machine Learning.2015:2048-2057.
[26]YANG Z,HE X,GAO J,et al.Stacked attention networks for image question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:21-29.
[27]LI R,JIA J.Visual question answering with question representation update (qru)[C]//Advances in Neural Information Processing Systems.2016:4655-4663.
[28]ANDERSON P,HE X,BUEHLER C,et al.Bottom-up and top-down attention for image captioning and visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:6077-6086.
[29]REN S,HE K,GIRSHICK R,et al.Faster r-cnn:Towards real-time object detection with region proposal networks[C]//Advances in Neural Information Processing Systems.2015:91-99.
[30]SCHWARTZ I,SCHWING A,HAZAN T.High-order attention models for visual question answering[C]//Advances in Neural Information Processing Systems.2017:3664-3674.
[31]LI Y,KAISER L,BENGIO S,et al.Area attention[C]//International Conference on Machine Learning.PMLR,2019:3846-3855.
[32]PATRO B,NAMBOODIRI V P.Differential attention for visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:7680-7688.
[33]GUO W,ZHANG Y,WU X,et al.Re-Attention for VisualQuestion Answering[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2020:91-98.
[34]ANDREAS J,ROHRBACH M,DARRELL T,et al.Neuralmodule networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:39-48.
[35]HU R,ANDREAS J,ROHRBACH M,et al.Learning to reason:End-to-end module networks for visual question answering[C]//Proceedings of the IEEE International Conference on Computer Vision.2017:804-813.
[36]HUDSON D A,MANNING C D.Compositional Attention Networks for Machine Reasoning[C]//International Conference on Learning Representations.2018.
[37]GAO P,YOU H,ZHANG Z,et al.Multi-modality latent interaction network for visual question answering[C]//Proceedings of the IEEE International Conference on Computer Vision.2019:5825-5835.
[38]CADENE R,BEN-YOUNES H,CORD M,et al.Murel:Multimodal relational reasoning for visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:1989-1998.
[39]GAO P,JIANG Z,YOU H,et al.Dynamic fusion with intra-and inter-modality attention flow for visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:6639-6648.
[40]KIPF T N,WELLING M.Semi-Supervised Classification with Graph Convolutional Networks[C]//International Conference on Learning Representations.2016.
[41]VELIKOVI P,CUCURULL G,CASANOVA A,et al.GraphAttention Networks[C]//International Conference on Learning Representations.2018.
[42]MONTI F,BOSCAINI D,MASCI J,et al.Geometric deep learning on graphs and manifolds using mixture model cnns[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:5115-5124.
[43]TENEY D,LIU L,VAN DEN HENGEL A.Graph-structured representations for visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:1-9.
[44]NORCLIFFE-BROWN W,VAFEIAS S,PARISOT S.Learning conditioned graph structures for interpretable visual question answering[C]//Advances in Neural Information Processing Systems.2018:8334-8343.
[45]HU R,ROHRBACH A,DARRELL T,et al.Language-condi-tioned graph networks for relational reasoning[C]//Proceedings of the IEEE International Conference on Computer Vision.2019:10294-10303.
[46]KHADEMI M.Multimodal Neural Graph Memory Networksfor Visual Question Answering[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.2020:7177-7188.
[47]SUKHBAATAR S,WESTON J,FERGUS R.End-to-end memory networks[C]//Advances in Neural Information Processing Systems.2015:2440-2448.
[48]HUDSON D,MANNING C D.Learning by abstraction:Theneural state machine[C]//Advances in Neural Information Processing Systems.2019:5903-5916.
[49]HAN Y,WANG B,HONG R,et al.Movie question answering via textual memory and plot graph[J].IEEE Transactions on Circuits and Systems for Video Technology,2019,30(3):875-887.
[50]WANG B,XU Y,HAN Y,et al.Movie question answering:Remembering the textual cues for layered visual contents[J].ar-Xiv:1804.09412,2018.
[51]HOCHREITER S,SCHMIDHUBER J.Long short-term memory[J].Neural Computation,1997,9(8):1735-1780.
[52]TAPASWI M,ZHU Y,STIEFELHAGEN R,et al.Movieqa:Understanding stories in movies through question-answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:4631-4640.
[53]GAO J,GE R,CHEN K,et al.Motion-appearance co-memory networks for video question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:6576-6585.
[54]KIM J,MA M,KIM K,et al.Progressive attention memory network for movie story question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:8337-8346.
[55]LI X,SONG J,GAO L,et al.Beyond rnns:Positional self-attention with co-attention for video question answering[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2019,33:8658-8665.
[56]KIM J,MA M,PHAM T,et al.Modality Shifting Attention Network for Multi-Modal Video Question Answering[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:10106-10115.
[57]GAN Z,GAN C,HE X,et al.Semantic compositional networks for visual captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:5630-5639.
[58]YAO T,PAN Y,LI Y,et al.Exploring visual relationship for image captioning[C]//Proceedings of the European Conference on Computer Vision (ECCV).2018:684-699.
[59]CHEN L,ZHANG H,XIAO J,et al.Sca-cnn:Spatial and channel-wise attention in convolutional networks for image captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:5659-5667.
[60]JIANG P,HAN Y.Reasoning with Heterogeneous GraphAlignment for Video Question Answering[C]//AAAI.2020:11109-11116.
[61]SONG X,SHI Y,CHEN X,et al.Explore multi-step reasoning in video question answering[C]//Proceedings of the 26th ACM International Conference on Multimedia.2018:239-247.
[62]WU A,ZHU L,HAN Y,et al.Connective Cognition Networkfor Directional Visual Commonsense Reasoning[C]//Advances in Neural Information Processing Systems.2019:5669-5679.
[63]YU W,ZHOU J,YU W,et al.Heterogeneous Graph Learning for Visual Commonsense Reasoning[C]//Advances in Neural Information Processing Systems.2019:2769-2779.
[64]LIN J,JAIN U,SCHWING A G.TAB-VCR:Tags and Attributes based Visual Commonsense Reasoning Baselines[C]//Advances in Neural Information Processing Systems.2019.
[65]DEVLIN J,CHANG M W,LEE K,et al.BERT:Pre-Training of Deep Bidirectional Transformers for Language Understanding[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies,Volume 1 (Long and Short Papers).2018:4171-4186.
[66]LU J,BATRA D,PARIKH D,et al.Vilbert:Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks[C]//Advances in Neural Information Processing Systems.2019:13-23.
[67]SU W,ZHU X,CAO Y,et al.Vl-bert:Pre-training of generic visual-linguistic representations[C]//International Conference on Learning Representations.2020.
[68]GOYAL Y,KHOT T,SUMMERS-STAY D,et al.Making the V in VQA matter:Elevating the role of image understanding in Visual Question Answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:6904-6913.
[69]ZHANG Y,HARE J,PRÜGEL-BENNETT A.Learning tocount objects in natural images for visual question answering[J].arXiv:1802.05766,2018.
[70]YU Z,YU J,XIANG C,et al.Beyond bilinear:Generalized multimodal factorized high-order pooling for visual question answering[J].IEEE Transactions on Neural Networks and Learning Systems,2018,29(12):5947-5959.
[71]KIM J H,JUN J,ZHANG B T.Bilinear attention networks[C]//Advances in Neural Information Processing Systems.2018:1564-1574.
[72]JANG Y,SONG Y,YU Y,et al.Tgif-qa:Toward spatio-temporal reasoning in visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:2758-2766.
[73]FAN C,ZHANG X,ZHANG S,et al.Heterogeneous memoryenhanced multimodal attention model for video question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:1999-2007.
[74]PENNINGTON J,SOCHER R,MANNING C D.Glove:Global vectors for word representation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP).2014:1532-1543.
[75]RUSSAKOVSKY O,DENG J,SU H,et al.Imagenet large scale visual recognition challenge[J].International Journal of Computer Vision,2015,115(3):211-252.
[76]HE K,ZHANG X,REN S,et al.Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:770-778.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Survey of Cross-media Question Answering and Reasoning Based on Vision and Language

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 15

Metrics

Comments

Recommended 0

[1]	ZHOU Fang-quan, CHENG Wei-qing. Sequence Recommendation Based on Global Enhanced Graph Neural Network [J]. Computer Science, 2022, 49(9): 55-63.
[2]	DAI Yu, XU Lin-feng. Cross-image Text Reading Method Based on Text Line Matching [J]. Computer Science, 2022, 49(9): 139-145.
[3]	ZHOU Le-yuan, ZHANG Jian-hua, YUAN Tian-tian, CHEN Sheng-yong. Sequence-to-Sequence Chinese Continuous Sign Language Recognition and Translation with Multi- layer Attention Mechanism Fusion [J]. Computer Science, 2022, 49(9): 155-161.
[4]	XIONG Li-qin, CAO Lei, LAI Jun, CHEN Xi-liang. Overview of Multi-agent Deep Reinforcement Learning Based on Value Factorization [J]. Computer Science, 2022, 49(9): 172-182.
[5]	RAO Zhi-shuang, JIA Zhen, ZHANG Fan, LI Tian-rui. Key-Value Relational Memory Networks for Question Answering over Knowledge Graph [J]. Computer Science, 2022, 49(9): 202-207.
[6]	ZHU Cheng-zhang, HUANG Jia-er, XIAO Ya-long, WANG Han, ZOU Bei-ji. Deep Hash Retrieval Algorithm for Medical Images Based on Attention Mechanism [J]. Computer Science, 2022, 49(8): 113-119.
[7]	SUN Qi, JI Gen-lin, ZHANG Jie. Non-local Attention Based Generative Adversarial Network for Video Abnormal Event Detection [J]. Computer Science, 2022, 49(8): 172-177.
[8]	YAN Jia-dan, JIA Cai-yan. Text Classification Method Based on Information Fusion of Dual-graph Neural Network [J]. Computer Science, 2022, 49(8): 230-236.
[9]	WANG Ming, PENG Jian, HUANG Fei-hu. Multi-time Scale Spatial-Temporal Graph Neural Network for Traffic Flow Prediction [J]. Computer Science, 2022, 49(8): 40-48.
[10]	JIANG Meng-han, LI Shao-mei, ZHENG Hong-hao, ZHANG Jian-peng. Rumor Detection Model Based on Improved Position Embedding [J]. Computer Science, 2022, 49(8): 330-335.
[11]	HOU Yu-tao, ABULIZI Abudukelimu, ABUDUKELIMU Halidanmu. Advances in Chinese Pre-training Models [J]. Computer Science, 2022, 49(7): 148-163.
[12]	JIN Fang-yan, WANG Xiu-li. Implicit Causality Extraction of Financial Events Integrating RACNN and BiLSTM [J]. Computer Science, 2022, 49(7): 179-186.
[13]	XIONG Luo-geng, ZHENG Shang, ZOU Hai-tao, YU Hua-long, GAO Shang. Software Self-admitted Technical Debt Identification with Bidirectional Gate Recurrent Unit and Attention Mechanism [J]. Computer Science, 2022, 49(7): 212-219.
[14]	PENG Shuang, WU Jiang-jiang, CHEN Hao, DU Chun, LI Jun. Satellite Onboard Observation Task Planning Based on Attention Neural Network [J]. Computer Science, 2022, 49(7): 242-247.
[15]	ZHANG Ying-tao, ZHANG Jie, ZHANG Rui, ZHANG Wen-qiang. Photorealistic Style Transfer Guided by Global Information [J]. Computer Science, 2022, 49(7): 100-105.