计算机科学 ›› 2021, Vol. 48 ›› Issue (3): 79-86.doi: 10.11896/jsjkx.210200086

• 多媒体技术进展* 上一篇    下一篇

跨媒体分析与推理技术研究综述

王树徽, 闫旭, 黄庆明   

  1. 中国科学院计算技术研究所 北京100190
  • 收稿日期:2021-01-20 修回日期:2021-02-09 出版日期:2021-03-15 发布日期:2021-03-05
  • 通讯作者: 黄庆明(qmhuang@ucas.ac.cn)
  • 作者简介:wangshuhui@ict.ac.cn
  • 基金资助:
    科技部重点研究发展计划项目 (2018AAA0102003);国家自然科学基金项目(62022083,61672497);中国科学院前沿科学重点研究项目(QYZDJ-SSW-SYS013)

Overview of Research on Cross-media Analysis and Reasoning Technology

WANG Shu-hui, YAN Xu, HUANG Qing-ming   

  1. Institute of Computing Technology,Chinese Academy of Sciences,Beijing 100190,China
  • Received:2021-01-20 Revised:2021-02-09 Online:2021-03-15 Published:2021-03-05
  • About author:WANG Shu-hui,born in 1983,Ph.D,professor,Ph.D supervisor.His main research interests include cross-media understanding,multi-modal learning/reasoning and large-scale Web multimedia data mining.
    HUANG Qing-ming,born in 1965,Ph.D,professor,Ph.D supervisor.His main research interests include multimedia computing,image/video proces-sing,pattern recognition and computer vision.
  • Supported by:
    National Key R&D Program of China(2018AAA0102003),National Natural Science Foundation of China(62022083,61672497) and Key Research Program of Frontier Sciences of CAS (QYZDJ-SSW-SYS013).

摘要: 当前,以网络数据为代表的跨媒体数据呈现爆炸式增长的趋势,呈现出了跨模态、跨数据源的复杂关联及动态演化特性,跨媒体分析与推理技术针对多模态信息理解、交互、内容管理等需求,通过构建跨模态、跨平台的语义贯通与统一表征机制,进一步实现分析和推理以及对复杂认知目标的不断逼近,建立语义层级的逻辑推理机制,最终实现跨媒体类人智能推理。文中对跨媒体分析推理技术的研究背景和发展历史进行概述,归纳总结视觉-语言关联等任务的关键技术,并对研究应用进行举例。基于已有结论,分析目前跨媒体分析领域所面临的关键问题,最后探讨未来的发展趋势。

关键词: 跨媒体分析与推理, 深度学习, 多模态融合, 视觉-语言分析

Abstract: Cross-media presents complex correlation characteristics across modalities and data sources.Cross-media analysis and reasoning technology is aimed at multimodal information understanding and interaction tasks.Through the construction of cross-modal and cross-platform semantic transformation mechanisms,as well as further question-and-answer interactions,it is constantly approaching complex cognitive goals and modeling high-level cross the logical reasoning process of modal information,finally multimodal artificial intelligence is realized.This paper summarizes the research background and development history of cross-media analysis and reasoning technology,and summarizes the key technologies of cross-modal tasks involving vision and language.Based on the existing research,this paper analyzes the existing problems in the field of multimedia analysis,and finally discusses the future development trend.

Key words: Cross-media analysis and reasoning, Deep learning, Multi-modal fusion, Visual-and-language analysis

中图分类号: 

  • TP181
[1]SRIVASTAVA N,RUSLAN S.Multimodal learning with deep boltzmann machines[J].The Journal of Machine Learning Research,2014,15(1):2949-2980.
[2]ATREY P K,HOSSAIN M A,SADDIK A E,et al.Multimodal fusion for multimedia analysis:a survey[J].Multimedia Systems,2010,16(6):345-379.
[3]LONG J,SHELHAMER E,DARRELL T.Fully convolutional networks for semantic segmentation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2015:3431-3440.
[4]HOTELLING H.Relations Between Two Sets of Variates[J].Biometrika,1935,28:321-377.
[5]SHAWE-TAYLOR J,CRISTIANINI N.Kernel Methods forPattern Analysis[M].Taylor & Francis Group,2004.
[6]SHARMAA,KUMAR A, DAUME H,et al.Generalized multiview analysis:A discriminative latent space[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2012:2160-2167.
[7]SONG G L,WANG S H,HUANG Q M,et al.Multimodal Similarity Gaussian Process Latent Variable Model[J].IEEE Transactions on Image Processing,2017,26(9):4168-4181.
[8]YAN H,WANG S,LIU S,et al.Cross-modal correlation learning by adaptive hierarchical semantic aggregation[J].IEEE Transactions on Multimedia,2016,18(6):1201-1216.
[9]WANG L,LI Y,LAZEBNIK S.Learning deep structure-preserving image-text embeddings[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:5005-5013.
[10]WANG L,LI Y,SVETLANA L.Learning a recurrent residual fusion network for multimodal matching[C]//Proceedings of the IEEE International Conference on Computer Vision.2017:4107-4116.
[11]ANDREW G,RAMAN A,JEFF B,et al.Deep canonical correlation analysis[C]//International Conference on Machine Learning.2013:1247-1255.
[12]WU Y L,WANG S H,HUANG Q M.Online asymmetric similarity learning for cross-modal retrieval[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:4269-4278.
[13]KARPATH Y,ANDRE J,FEI-FEI L.Deep visual-semanticalignments for generating image descriptions[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2015:3128-3137.
[14]MA L,LU Z,SHANG L.Multimodal convolutional neural networks for matching image and sentence[C]//Proceedings of the IEEE International Conference on Computer Vision.2015:2623-2631.
[15]HUANG Y,WU Q,WANG W,et al.Image and sentence matching via semantic concepts and order learning[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2020,42(3):636-650.
[16]WANG S H,CHEN Y Y,ZUO J B,et al.Joint global and co-attentive representation learning for image-sentence retrieval[C]//Proceedings of the 26th ACM international conference on Multimedia.2018:1398-1406.
[17]WU Y,WANG S,SONG G,et al.Augmented AdversarialTraining for Cross-modal Retrieval[J].IEEE Transactions on Multimedia,2021,23:559-571.
[18]VINYALS O,TOSHEV A,BENGIO S,et al.Show and tell:A neural image caption generator[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2015:3156-3164.
[19]VENUGOPALAN S,XU H,DONAHUE J,et al.Translating Videos to Natural Language Using Deep Recurrent Neural Networks[J].Human Language Technologies,arXiv:1412.4729,2015.
[20]YAO L,TORABI A,CHO K,et al.Describing videos by exploiting temporal structure[C]//Proceedings of the IEEE International Conference on Computer Vision.2015:4507-4515.
[21]CORNIA,MARCELLA,LORENZO B.Show,control and tell:A framework for generating controllable and grounded captions[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:8307-8316.
[22]YIN G,SHENG L,LIU B,et al.Context and attribute grounded dense captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:6241-6250.
[23]ZHENG Y,LI Y,WANG S.Intention oriented image captions with guiding objects[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:8395-8404.
[24]KRISHNA R,HATA K,REN F,et al.Dense-captioning events in videos[C]//Proceedings of the IEEE International Conference on Computer Vision.2017:706-715.
[25]QI Z B,WANG S H,SU C.Modeling Temporal Concept Receptive Field Dynamically for Untrimmed Video Analysis[C]//Proceedings of the 28th ACM International Conference on Multimedia.2020:3798-3806.
[26]ZHOU L,ZHOU Y,CORSO J,et al.End-to-End Dense Video Captioning with Masked Transformer[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:8739-8748.
[27]MUN J,YANG L,REN Z,et al.Streamlined dense video captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:6588-6597.
[28]YU L,ZHANG W,WANG J,et al.Seqgan:Sequence generative adversarial nets with policy gradient[C]//Thirty-first AAAI Conference On Artificial Intelligence.2017:2852-2858.
[29]CHEN Y,WANG S,ZHANG W,et al.Less is more:Picking informative frames for video captioning[C]//European Conference on Computer Vision.2018:358-373.
[30]GUO L,LIU J,YAO P,et al.Mscap:Multi-style image captioning with unpaired stylized text[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:4204-4213.
[31]SHUSTER K,HUMEAU S,HU H,et al.Engaging image captioning via personality[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:12516-12526.
[32]XU Y,WU B,SHEN F,et al.Exact adversarial attack to image captioning via structured output learning with latent variables[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:4135-4144.
[33]DOGNIN P,MELNYK I,MROUE H,et al.Adversarial semantic alignment for improved image captions[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:10463-10471.
[34]REE D,SCOTT E.Generative adversarial text to image synthesis[C]//International Conference on Machine Learning.2016:1060-1069.
[35]REED,SCOTT E.Learning what and where to draw[C]//Neural Information Processing Systems.2016:217-225.
[36]HAN Z,XU T,HONGSHENG L.StackGAN:Text to Photo-Realistic Image Synthesis with Stacked Generative Adversarial Networks[C]//Proceedings of the IEEE International Conference on Computer Vision.2017:5908-5916.
[37]ZHANG H,XU T,LI H,et al.StackGAN++:Realistic Image Synthesis with Stacked Generative Adversarial Networks[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2018,41(8):1947-1962.
[38]XU T,ZHANG P,HUANG Q,et al.AttnGAN:Fine-GrainedText to Image Generation with Attentional Generative Adversarial Networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:1316-1324.
[39]JOHNSON J,HARIHARAN B,VAN DER MAATEN L,et al.Clevr:A diagnostic dataset for compositional language and elementary visual reasoning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:2901-2910.
[40]ANTOL S,AGRAWAL A,LU J,et al.Vqa:Visual question answering[C]//Proceedings of the IEEE International Conference on Computer Vision.2015:2425-2433.
[41]WU Q,WANG P,SHEN C,et al.Are you talking to me?reasoned visual dialog generation through adversarial learning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:6106-6115.
[42]KIM J H,ON K W,LIM W,et al.Hadamard product for low-rank bilinear pooling[C]//International Conference on Learning Representations.2017:1-13.
[43]YU Z,YU J,XIANG C,et al.Beyond Bilinear:Generalized Multimodal Factorized High-Order Pooling for Visual Question Answering[J].IEEE Transactions on Neural Networks and Learning Systems,2018,29(12):5947-5959.
[44]HAN X,WANG S,SU C,et al.Interpretable Visual Reasoning via Probabilistic Formulation Under Natural Supervision[C]//European Conference on Computer Vision.2020:553-570.
[45]WANG P,WU Q,SHEN C,et al.Explicit Knowledge-basedReasoning for Visual Question Answering[J].Computer Science,arXiv:1511.02570,2015.
[46]ANDERSON P,WU Q,TENEY D,et al.Image captioning and visual question answering based on attributes and external knowledge[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2018,40(6):1367-1381.
[47]NARASIMHAN M,LAZEBNIK S,SCHWING A.Out of thebox:Reasoning with graph convolution nets for factual visual question answering[C]//Advances in Neural Information Processing Systems.2018:2654-2665.
[48]ANDERSON P,WU Q,TENY D,et al.Vision-and-language navigation:Interpreting visually-grounded navigation instructions in real environments[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:3674-3683.
[49]WANG X,XIONG W,WANG H,et al.Look before you leap:Bridgingmodel-free and model-based reinforcement learning for planned-ahead vision-and-language navigation[C]//Proceedings of the European Conference on Computer Vision.2018:37-53.
[50]FRIED D,HU R,CIRIK V,et al.Speaker-follower models for vision-and-language navigation[C]//Advances in Neural Information Processing Systems.2018:3314-3325.
[51]WANG X,HUANG Q,CELIKYILMAZ A,et al.Reinforcedcross-modal matching and self-supervised imitation learning for vision-language navigation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:6629-6638.
[52]TAN H,YU L,BANSAL M.Learning to navigate unseen environments:Back translation with environmental dropout[C]//International Conference on Learning Representations.2019.
[53]MA C Y,LU J,WU Z,et al.Self-monitoring navigation agent via auxiliary progress estimation[C]//International Conference on Learning Representations.2019.
[54]ZHU F,ZHU Y,CHANG X,et al.Vision-language navigation with self-supervised auxiliary reasoning tasks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2020:10012-10022.
[55]DEVLIN J,CHANG M W,LEE K,et al.Bert:Pre-training of deep bidirectional transformers for language understanding[J].arXiv:1810.04805,2018.
[56]SUN C,MYERS A,VONDRICK C,et al.Videobert:A jointmodel for video and language representation learning[C]//Proceedings of the IEEE International Conference on Computer Vision.2019:7464-7473.
[57]LI L N,YATSKARM,YIN D,et al.Visualbert:A simple and performant baseline for vision and language[J].arXiv:1908.03557,2019.
[58]SU W,ZHU X,CAO Y,et al.Vl-bert:Pre-training of generic visual-linguistic representations[J].arXiv:1908.08530,2019.
[59]LU J,BATRA D,PARIKH D,et al.Vilbert:Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks[C]//Advances in Neural Information Processing Systems.2019:13-23.
[60]TAN H,MOHIT B.Lxmert:Learning cross-modality encoder representations from transformers[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing.2019.
[61]HOOD B M,ATKINSON J.Disengaging visual attention in the infant and adult[J].Infant Behavior & Development,1993,16(4):405-422.
[62]LIU X J,LI L,WANG S H,et al.Adaptive reconstruction network for weakly supervised referring expression grounding[C]//Proceedings of the IEEE International Conference on Computer Vision.2019:2611-2620.
[63]LIU CX,MAO J H,SHA F,et al.Attention correctness in neural image captioning[C]//Proceedings of the Conference on Artificial Intelligence.2017:4176-4182.
[64]JI S,PAN S,CAMBRIA E,et al.A Survey on KnowledgeGraphs:Representation,Acquisition and Applications[C]//Proceedings of the Conference on Artificial Inelligence.2020.
[65]MALINOWSKI M,FRITZ M.A multi-world approach to question answering about real-world scenes based on uncertain input[C]//Advances in Neural Information Processing Systems.2014:1682-1690.
[66]WU Q,WANG P,SHEN C,et al.Ask me anything:Free-form visualquestion answering based on knowledge from external sources[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:4622-4630.
[1] 潘金山. 基于深度学习的图像去模糊方法研究进展[J]. 计算机科学, 2021, 48(3): 9-13.
[2] 赵露露, 沈玲, 洪日昌. 图像修复研究进展综述[J]. 计算机科学, 2021, 48(3): 14-26.
[3] 陈凯, 魏志鹏, 陈静静, 姜育刚. 多媒体模型对抗攻防综述[J]. 计算机科学, 2021, 48(3): 27-39.
[4] 张开华, 樊佳庆, 刘青山. 视觉目标跟踪十年研究进展[J]. 计算机科学, 2021, 48(3): 40-49.
[5] 白子轶, 毛懿荣, 王瑞平. 视频人脸识别进展综述[J]. 计算机科学, 2021, 48(3): 50-59.
[6] 武阿明, 姜品, 韩亚洪. 基于视觉和语言的跨媒体问答与推理研究综述[J]. 计算机科学, 2021, 48(3): 71-78.
[7] 牛玉磊, 张含望. 视觉问答与对话综述[J]. 计算机科学, 2021, 48(3): 87-96.
[8] 钱胜胜, 张天柱, 徐常胜. 多媒体社会事件分析综述[J]. 计算机科学, 2021, 48(3): 97-112.
[9] 张春云, 曲浩, 崔超然, 孙皓亮, 尹义龙. 基于过程监督的序列多任务法律判决预测方法[J]. 计算机科学, 2021, 48(3): 227-232.
[10] 胡妤婕, 常建慧, 张健. 语义区域风格约束下的图像合成[J]. 计算机科学, 2021, 48(2): 134-141.
[11] 孙文赟, 金忠, 赵海涛, 陈昌盛. 基于深度特征增广的跨域小样本人脸欺诈检测算法[J]. 计算机科学, 2021, 48(2): 330-336.
[12] 王瑞平, 贾真, 刘畅, 陈泽威, 李天瑞. 基于DeepFM的深度兴趣因子分解机网络[J]. 计算机科学, 2021, 48(1): 226-232.
[13] 于文家, 丁世飞. 基于自注意力机制的条件生成对抗网络[J]. 计算机科学, 2021, 48(1): 241-246.
[14] 仝鑫, 王斌君, 王润正, 潘孝勤. 面向自然语言处理的深度学习对抗样本综述[J]. 计算机科学, 2021, 48(1): 258-267.
[15] 丁钰, 魏浩, 潘志松, 刘鑫. 网络表示学习算法综述[J]. 计算机科学, 2020, 47(9): 52-59.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 李辉, 李秀华, 熊庆宇, 文俊浩, 程路熙, 邢镔. 边缘计算助力工业互联网:架构、应用与挑战[J]. 计算机科学, 2021, 48(1): 1 -10 .
[2] 刘通, 方璐, 高洪皓. 边缘计算中任务卸载研究综述[J]. 计算机科学, 2021, 48(1): 11 -15 .
[3] . 智能化边缘计算专题序言[J]. 计算机科学, 2021, 48(1): 0 -00 .
[4] . 目录[J]. 计算机科学, 2021, 48(1): 0 .
[5] 张晓, 张思蒙, 石佳, 董聪, 李战怀. Ceph分布式存储系统性能优化技术研究综述[J]. 计算机科学, 2021, 48(2): 1 -12 .
[6] 蒋慧敏, 蒋哲远. 企业云服务体系结构的参考模型与开发方法[J]. 计算机科学, 2021, 48(2): 13 -22 .
[7] 王锡龙, 李鑫, 秦小麟. 电力物联网下分布式状态感知的源网荷储协同调度[J]. 计算机科学, 2021, 48(2): 23 -32 .
[8] 刘东, 王叶斐, 林建平, 马海川, 杨闰宇. 端到端优化的图像压缩技术进展[J]. 计算机科学, 2021, 48(3): 1 -8 .
[9] 潘金山. 基于深度学习的图像去模糊方法研究进展[J]. 计算机科学, 2021, 48(3): 9 -13 .
[10] 赵露露, 沈玲, 洪日昌. 图像修复研究进展综述[J]. 计算机科学, 2021, 48(3): 14 -26 .