计算机科学 ›› 2021, Vol. 48 ›› Issue (3): 71-78.doi: 10.11896/jsjkx.201100176

所属专题: 多媒体技术进展

• 多媒体技术进展* 上一篇    下一篇

基于视觉和语言的跨媒体问答与推理研究综述

武阿明, 姜品, 韩亚洪   

  1. 天津大学智能与计算学部 天津300350
  • 收稿日期:2020-10-25 修回日期:2021-01-01 出版日期:2021-03-15 发布日期:2021-03-05
  • 通讯作者: 韩亚洪(yahong@tju.edu.cn)
  • 作者简介:tjwam@tju.edu.cn
  • 基金资助:
    国家自然科学基金“重点项目”(61932009):跨媒体智能问答与推理关键理论与方法研究(2020/01-2024/12)

Survey of Cross-media Question Answering and Reasoning Based on Vision and Language

WU A-ming, JIANG Pin, HAN Ya-hong   

  1. College of Intelligence and Computing,Tianjin University,Tianjin 300350,China
  • Received:2020-10-25 Revised:2021-01-01 Online:2021-03-15 Published:2021-03-05
  • About author:WU A-ming,born in 1987,Ph.D.His main research interests include multimedia analysis and machine learning.
    HAN Ya-hong,born in 1977,Ph.D,professor.His main research interests include multimedia analysis,computer vision and machine learning.
  • Supported by:
    National Natural Science Foundation of China Key Program (61932009):Research on Key Theories and Methods of Cross-media Intelligent Question Answering and Reasoning(2020/01-2024/12).

摘要: 基于视觉和语言的跨媒体问答与推理是人工智能领域的研究热点之一,其目的是基于给定的视觉内容和相关问题,模型能够返回正确的答案。随着深度学习的飞速发展及其在计算机视觉和自然语言处理领域的广泛应用,基于视觉和语言的跨媒体问答与推理也取得了较快的发展。文中首先系统地梳理了当前基于视觉和语言的跨媒体问答与推理的相关工作,具体介绍了基于图像的视觉问答与推理、基于视频的视觉问答与推理以及基于视觉常识推理模型与算法的研究进展,并将基于图像的视觉问答与推理细分为基于多模态融合、基于注意力机制和基于推理3类,将基于视觉常识推理细分为基于推理和基于预训练2类;然后总结了目前常用的问答与推理数据集,以及代表性的问答与推理模型在这些数据集上的实验结果;最后展望了基于视觉和语言的跨媒体问答与推理的未来发展方向。

关键词: 多模态融合, 跨媒体问答与推理, 视觉常识问答与推理, 视频问答与推理, 图像问答与推理, 预训练, 注意力机制

Abstract: Cross-media question answering and reasoning based on vision and language is one of the popular research hotspots of artificial intelligence.It aims to return a correct answer based on understanding of the given visual content and related questions.With the rapid development of deep learning and its wide application in computer vision and natural language processing,cross-media question answering and reasoning based on vision and language has also achieved rapid development.This paper systematically surveys the current researches on cross-media question answering and reasoning based on vision and language,and specifi-cally introduces the research progress of image-based visual question answe-ring and reasoning,video-based visual question answering and reasoning,and visual commonsense reasoning.Particularly,image-based visual question answering and reasoning is subdivided into three categories,i.e.,multi-modal fusion,attention mechanism,and reasoning based methods.Meanwhile,visual commonsense reasoning is subdivided into reasoning and pre-training based methods.Moreover,this paper summarizes the commonly used datasets of question answering and reasoning,as well as the experimental results of representative methods.Finally,this paper looks forward to the future development direction of cross-media question answering and reasoning based on vision and language.

Key words: Attention mechanism, Cross-media question answering and reasoning, Image-based question answering and reasoning, Multi-modal fusion, Pre-training, Video-based question answering and reasoning, Visual commonsense question answering and reasoning

中图分类号: 

  • TP391
[1]TURING A M.Computing machinery and intelligence[J].Mind,1950,59(236):433-460.
[2]TENEY D,ANDERSON P,HE X,et al.Tips and tricks for visual question answering:Learnings from the 2017 challenge[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:4223-4232.
[3]JABRI A,JOULIN A,VAN DER MAATEN L.Revisiting visual question answering baselines[C]//European Conference on Computer Vision.Springer,Cham,2016:727-739.
[4]ZHU L,XU Z,YANG Y,et al.Uncovering the temporal context for video question answering[J].International Journal of Computer Vision,2017,124(3):409-421.
[5]ZELLERS R,BISK Y,FARHADI A,et al.From recognition to cognition:Visual commonsense reasoning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:6720-6731.
[6]WU Q,TENEY D,WANG P,et al.Visual question answering:A survey of methods and datasets[J].Computer Vision and Image Understanding,2017,163:21-40.
[7]DRUZHKOV P N,KUSTIKOVA V D.A survey of deep learning methods and software tools for image classification and object detection[J].Pattern Recognition and Image Analysis,2016,26(1):9-15.
[8]YANG S,WANG Y,CHU X.A Survey of Deep Learning Techniques for Neural Machine Translation[J].arXiv:2002.07526,2020.
[9]FUKUI A,PARK D H,YANG D,et al.Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding[C]//In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing.2016:457-468.
[10]LU J,YANG J,BATRA D,et al.Hierarchical question-imageco-attention for visual question answering[C]//Advances in Neural Information Processing Systems.2016:289-297.
[11]YU Z,YU J,CUI Y,et al.Deep modular co-attention networks for visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:6281-6290.
[12]NGUYEN B D,DO T T,NGUYEN B X,et al.Overcoming data limitation in medical visual question answering[C]//International Conference on Medical Image Computing and Computer-Assisted Intervention.Springer,Cham,2019:522-530.
[13]DAS A,KOTTUR S,GUPTA K,et al.Visual dialog[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:326-335.
[14]SEO P H,LEHRMANN A,HAN B,et al.Visual reference resolution using attention memory for visual dialog[C]//Advances in Neural Information Processing Systems.2017:3719-3729.
[15]ANTOL S,AGRAWAL A,LU J,et al.VQA:Visual question answering [C]//Proceedings of IEEE Conference on Computer Vision.New York:IEEE Press,2015:2425-2433.
[16]REN M,KIROS R,ZEMEL R.Exploring models and data for image question answering[C]//Advances in Neural Information Processing Systems.2015:2953-2961.
[17]SHIH K J,SINGH S,HOIEM D.Where to look:Focus regions for visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:4613-4621.
[18]KIM J H,LEE S W,KWAK D,et al.Multimodal residual learn-ing for visual qa[C]//Advances in Neural Information Processing Systems.2016:361-369.
[19]LI R,JIA J.Visual question answering with question representation update (qru)[C]//Advances in Neural Information Processing Systems.2016:4655-4663.
[20]CHARIKAR M,CHEN K,FARACH-COLTON M.Finding frequent items in data streams[C]//International Colloquium on Automata,Languages,and Programming.Berlin,Heidelberg:Springer,2002:693-703.
[21]KIM J H,ON K W,LIM W,et al.Hadamard Product for Low-rank Bilinear Pooling[C]//In ICLR.2016.
[22]BEN-YOUNES H,CADENE R,CORD M,et al.Mutan:Multimodal tucker fusion for visual question answering[C]//Proceedings of the IEEE International Conference on Computer Vision.2017:2612-2620.
[23]YU Z,YU J,FAN J,et al.Multi-modal factorized bilinear pooling with co-attention learning for visual question answering[C]//Proceedings of the IEEE International Conference on Computer Vision.2017:1821-1830.
[24]BEN-YOUNES H,CADENE R,THOME N,et al.Block:Bilinear superdiagonal fusion for visual question answering and visual relationship detection[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2019,33:8102-8109.
[25]XU K,BA J,KIROS R,et al.Show,attend and tell:Neural image caption generation with visual attention[C]//International Conference on Machine Learning.2015:2048-2057.
[26]YANG Z,HE X,GAO J,et al.Stacked attention networks for image question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:21-29.
[27]LI R,JIA J.Visual question answering with question representation update (qru)[C]//Advances in Neural Information Processing Systems.2016:4655-4663.
[28]ANDERSON P,HE X,BUEHLER C,et al.Bottom-up and top-down attention for image captioning and visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:6077-6086.
[29]REN S,HE K,GIRSHICK R,et al.Faster r-cnn:Towards real-time object detection with region proposal networks[C]//Advances in Neural Information Processing Systems.2015:91-99.
[30]SCHWARTZ I,SCHWING A,HAZAN T.High-order attention models for visual question answering[C]//Advances in Neural Information Processing Systems.2017:3664-3674.
[31]LI Y,KAISER L,BENGIO S,et al.Area attention[C]//International Conference on Machine Learning.PMLR,2019:3846-3855.
[32]PATRO B,NAMBOODIRI V P.Differential attention for visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:7680-7688.
[33]GUO W,ZHANG Y,WU X,et al.Re-Attention for VisualQuestion Answering[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2020:91-98.
[34]ANDREAS J,ROHRBACH M,DARRELL T,et al.Neuralmodule networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:39-48.
[35]HU R,ANDREAS J,ROHRBACH M,et al.Learning to reason:End-to-end module networks for visual question answering[C]//Proceedings of the IEEE International Conference on Computer Vision.2017:804-813.
[36]HUDSON D A,MANNING C D.Compositional Attention Networks for Machine Reasoning[C]//International Conference on Learning Representations.2018.
[37]GAO P,YOU H,ZHANG Z,et al.Multi-modality latent interaction network for visual question answering[C]//Proceedings of the IEEE International Conference on Computer Vision.2019:5825-5835.
[38]CADENE R,BEN-YOUNES H,CORD M,et al.Murel:Multimodal relational reasoning for visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:1989-1998.
[39]GAO P,JIANG Z,YOU H,et al.Dynamic fusion with intra-and inter-modality attention flow for visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:6639-6648.
[40]KIPF T N,WELLING M.Semi-Supervised Classification with Graph Convolutional Networks[C]//International Conference on Learning Representations.2016.
[41]VELIKOVI P,CUCURULL G,CASANOVA A,et al.GraphAttention Networks[C]//International Conference on Learning Representations.2018.
[42]MONTI F,BOSCAINI D,MASCI J,et al.Geometric deep learning on graphs and manifolds using mixture model cnns[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:5115-5124.
[43]TENEY D,LIU L,VAN DEN HENGEL A.Graph-structured representations for visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:1-9.
[44]NORCLIFFE-BROWN W,VAFEIAS S,PARISOT S.Learning conditioned graph structures for interpretable visual question answering[C]//Advances in Neural Information Processing Systems.2018:8334-8343.
[45]HU R,ROHRBACH A,DARRELL T,et al.Language-condi-tioned graph networks for relational reasoning[C]//Proceedings of the IEEE International Conference on Computer Vision.2019:10294-10303.
[46]KHADEMI M.Multimodal Neural Graph Memory Networksfor Visual Question Answering[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.2020:7177-7188.
[47]SUKHBAATAR S,WESTON J,FERGUS R.End-to-end memory networks[C]//Advances in Neural Information Processing Systems.2015:2440-2448.
[48]HUDSON D,MANNING C D.Learning by abstraction:Theneural state machine[C]//Advances in Neural Information Processing Systems.2019:5903-5916.
[49]HAN Y,WANG B,HONG R,et al.Movie question answering via textual memory and plot graph[J].IEEE Transactions on Circuits and Systems for Video Technology,2019,30(3):875-887.
[50]WANG B,XU Y,HAN Y,et al.Movie question answering:Remembering the textual cues for layered visual contents[J].ar-Xiv:1804.09412,2018.
[51]HOCHREITER S,SCHMIDHUBER J.Long short-term memory[J].Neural Computation,1997,9(8):1735-1780.
[52]TAPASWI M,ZHU Y,STIEFELHAGEN R,et al.Movieqa:Understanding stories in movies through question-answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:4631-4640.
[53]GAO J,GE R,CHEN K,et al.Motion-appearance co-memory networks for video question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:6576-6585.
[54]KIM J,MA M,KIM K,et al.Progressive attention memory network for movie story question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:8337-8346.
[55]LI X,SONG J,GAO L,et al.Beyond rnns:Positional self-attention with co-attention for video question answering[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2019,33:8658-8665.
[56]KIM J,MA M,PHAM T,et al.Modality Shifting Attention Network for Multi-Modal Video Question Answering[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:10106-10115.
[57]GAN Z,GAN C,HE X,et al.Semantic compositional networks for visual captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:5630-5639.
[58]YAO T,PAN Y,LI Y,et al.Exploring visual relationship for image captioning[C]//Proceedings of the European Conference on Computer Vision (ECCV).2018:684-699.
[59]CHEN L,ZHANG H,XIAO J,et al.Sca-cnn:Spatial and channel-wise attention in convolutional networks for image captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:5659-5667.
[60]JIANG P,HAN Y.Reasoning with Heterogeneous GraphAlignment for Video Question Answering[C]//AAAI.2020:11109-11116.
[61]SONG X,SHI Y,CHEN X,et al.Explore multi-step reasoning in video question answering[C]//Proceedings of the 26th ACM International Conference on Multimedia.2018:239-247.
[62]WU A,ZHU L,HAN Y,et al.Connective Cognition Networkfor Directional Visual Commonsense Reasoning[C]//Advances in Neural Information Processing Systems.2019:5669-5679.
[63]YU W,ZHOU J,YU W,et al.Heterogeneous Graph Learning for Visual Commonsense Reasoning[C]//Advances in Neural Information Processing Systems.2019:2769-2779.
[64]LIN J,JAIN U,SCHWING A G.TAB-VCR:Tags and Attributes based Visual Commonsense Reasoning Baselines[C]//Advances in Neural Information Processing Systems.2019.
[65]DEVLIN J,CHANG M W,LEE K,et al.BERT:Pre-Training of Deep Bidirectional Transformers for Language Understanding[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies,Volume 1 (Long and Short Papers).2018:4171-4186.
[66]LU J,BATRA D,PARIKH D,et al.Vilbert:Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks[C]//Advances in Neural Information Processing Systems.2019:13-23.
[67]SU W,ZHU X,CAO Y,et al.Vl-bert:Pre-training of generic visual-linguistic representations[C]//International Conference on Learning Representations.2020.
[68]GOYAL Y,KHOT T,SUMMERS-STAY D,et al.Making the V in VQA matter:Elevating the role of image understanding in Visual Question Answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:6904-6913.
[69]ZHANG Y,HARE J,PRÜGEL-BENNETT A.Learning tocount objects in natural images for visual question answering[J].arXiv:1802.05766,2018.
[70]YU Z,YU J,XIANG C,et al.Beyond bilinear:Generalized multimodal factorized high-order pooling for visual question answering[J].IEEE Transactions on Neural Networks and Learning Systems,2018,29(12):5947-5959.
[71]KIM J H,JUN J,ZHANG B T.Bilinear attention networks[C]//Advances in Neural Information Processing Systems.2018:1564-1574.
[72]JANG Y,SONG Y,YU Y,et al.Tgif-qa:Toward spatio-temporal reasoning in visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:2758-2766.
[73]FAN C,ZHANG X,ZHANG S,et al.Heterogeneous memoryenhanced multimodal attention model for video question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:1999-2007.
[74]PENNINGTON J,SOCHER R,MANNING C D.Glove:Global vectors for word representation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP).2014:1532-1543.
[75]RUSSAKOVSKY O,DENG J,SU H,et al.Imagenet large scale visual recognition challenge[J].International Journal of Computer Vision,2015,115(3):211-252.
[76]HE K,ZHANG X,REN S,et al.Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:770-778.
[1] 周芳泉, 成卫青.
基于全局增强图神经网络的序列推荐
Sequence Recommendation Based on Global Enhanced Graph Neural Network
计算机科学, 2022, 49(9): 55-63. https://doi.org/10.11896/jsjkx.210700085
[2] 戴禹, 许林峰.
基于文本行匹配的跨图文本阅读方法
Cross-image Text Reading Method Based on Text Line Matching
计算机科学, 2022, 49(9): 139-145. https://doi.org/10.11896/jsjkx.220600032
[3] 周乐员, 张剑华, 袁甜甜, 陈胜勇.
多层注意力机制融合的序列到序列中国连续手语识别和翻译
Sequence-to-Sequence Chinese Continuous Sign Language Recognition and Translation with Multi- layer Attention Mechanism Fusion
计算机科学, 2022, 49(9): 155-161. https://doi.org/10.11896/jsjkx.210800026
[4] 熊丽琴, 曹雷, 赖俊, 陈希亮.
基于值分解的多智能体深度强化学习综述
Overview of Multi-agent Deep Reinforcement Learning Based on Value Factorization
计算机科学, 2022, 49(9): 172-182. https://doi.org/10.11896/jsjkx.210800112
[5] 饶志双, 贾真, 张凡, 李天瑞.
基于Key-Value关联记忆网络的知识图谱问答方法
Key-Value Relational Memory Networks for Question Answering over Knowledge Graph
计算机科学, 2022, 49(9): 202-207. https://doi.org/10.11896/jsjkx.220300277
[6] 朱承璋, 黄嘉儿, 肖亚龙, 王晗, 邹北骥.
基于注意力机制的医学影像深度哈希检索算法
Deep Hash Retrieval Algorithm for Medical Images Based on Attention Mechanism
计算机科学, 2022, 49(8): 113-119. https://doi.org/10.11896/jsjkx.210700153
[7] 孙奇, 吉根林, 张杰.
基于非局部注意力生成对抗网络的视频异常事件检测方法
Non-local Attention Based Generative Adversarial Network for Video Abnormal Event Detection
计算机科学, 2022, 49(8): 172-177. https://doi.org/10.11896/jsjkx.210600061
[8] 闫佳丹, 贾彩燕.
基于双图神经网络信息融合的文本分类方法
Text Classification Method Based on Information Fusion of Dual-graph Neural Network
计算机科学, 2022, 49(8): 230-236. https://doi.org/10.11896/jsjkx.210600042
[9] 姜梦函, 李邵梅, 郑洪浩, 张建朋.
基于改进位置编码的谣言检测模型
Rumor Detection Model Based on Improved Position Embedding
计算机科学, 2022, 49(8): 330-335. https://doi.org/10.11896/jsjkx.210600046
[10] 汪鸣, 彭舰, 黄飞虎.
基于多时间尺度时空图网络的交通流量预测模型
Multi-time Scale Spatial-Temporal Graph Neural Network for Traffic Flow Prediction
计算机科学, 2022, 49(8): 40-48. https://doi.org/10.11896/jsjkx.220100188
[11] 侯钰涛, 阿布都克力木·阿布力孜, 哈里旦木·阿布都克里木.
中文预训练模型研究进展
Advances in Chinese Pre-training Models
计算机科学, 2022, 49(7): 148-163. https://doi.org/10.11896/jsjkx.211200018
[12] 金方焱, 王秀利.
融合RACNN和BiLSTM的金融领域事件隐式因果关系抽取
Implicit Causality Extraction of Financial Events Integrating RACNN and BiLSTM
计算机科学, 2022, 49(7): 179-186. https://doi.org/10.11896/jsjkx.210500190
[13] 熊罗庚, 郑尚, 邹海涛, 于化龙, 高尚.
融合双向门控循环单元和注意力机制的软件自承认技术债识别方法
Software Self-admitted Technical Debt Identification with Bidirectional Gate Recurrent Unit and Attention Mechanism
计算机科学, 2022, 49(7): 212-219. https://doi.org/10.11896/jsjkx.210500075
[14] 彭双, 伍江江, 陈浩, 杜春, 李军.
基于注意力神经网络的对地观测卫星星上自主任务规划方法
Satellite Onboard Observation Task Planning Based on Attention Neural Network
计算机科学, 2022, 49(7): 242-247. https://doi.org/10.11896/jsjkx.210500093
[15] 张颖涛, 张杰, 张睿, 张文强.
全局信息引导的真实图像风格迁移
Photorealistic Style Transfer Guided by Global Information
计算机科学, 2022, 49(7): 100-105. https://doi.org/10.11896/jsjkx.210600036
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!