计算机科学 ›› 2021, Vol. 48 ›› Issue (3): 79-86.doi: 10.11896/jsjkx.210200086

所属专题: 多媒体技术进展

• 多媒体技术进展* 上一篇    下一篇

跨媒体分析与推理技术研究综述

王树徽, 闫旭, 黄庆明   

  1. 中国科学院计算技术研究所 北京100190
  • 收稿日期:2021-01-20 修回日期:2021-02-09 出版日期:2021-03-15 发布日期:2021-03-05
  • 通讯作者: 黄庆明(qmhuang@ucas.ac.cn)
  • 作者简介:wangshuhui@ict.ac.cn
  • 基金资助:
    科技部重点研究发展计划项目 (2018AAA0102003);国家自然科学基金项目(62022083,61672497);中国科学院前沿科学重点研究项目(QYZDJ-SSW-SYS013)

Overview of Research on Cross-media Analysis and Reasoning Technology

WANG Shu-hui, YAN Xu, HUANG Qing-ming   

  1. Institute of Computing Technology,Chinese Academy of Sciences,Beijing 100190,China
  • Received:2021-01-20 Revised:2021-02-09 Online:2021-03-15 Published:2021-03-05
  • About author:WANG Shu-hui,born in 1983,Ph.D,professor,Ph.D supervisor.His main research interests include cross-media understanding,multi-modal learning/reasoning and large-scale Web multimedia data mining.
    HUANG Qing-ming,born in 1965,Ph.D,professor,Ph.D supervisor.His main research interests include multimedia computing,image/video proces-sing,pattern recognition and computer vision.
  • Supported by:
    National Key R&D Program of China(2018AAA0102003),National Natural Science Foundation of China(62022083,61672497) and Key Research Program of Frontier Sciences of CAS (QYZDJ-SSW-SYS013).

摘要: 当前,以网络数据为代表的跨媒体数据呈现爆炸式增长的趋势,呈现出了跨模态、跨数据源的复杂关联及动态演化特性,跨媒体分析与推理技术针对多模态信息理解、交互、内容管理等需求,通过构建跨模态、跨平台的语义贯通与统一表征机制,进一步实现分析和推理以及对复杂认知目标的不断逼近,建立语义层级的逻辑推理机制,最终实现跨媒体类人智能推理。文中对跨媒体分析推理技术的研究背景和发展历史进行概述,归纳总结视觉-语言关联等任务的关键技术,并对研究应用进行举例。基于已有结论,分析目前跨媒体分析领域所面临的关键问题,最后探讨未来的发展趋势。

关键词: 多模态融合, 跨媒体分析与推理, 深度学习, 视觉-语言分析

Abstract: Cross-media presents complex correlation characteristics across modalities and data sources.Cross-media analysis and reasoning technology is aimed at multimodal information understanding and interaction tasks.Through the construction of cross-modal and cross-platform semantic transformation mechanisms,as well as further question-and-answer interactions,it is constantly approaching complex cognitive goals and modeling high-level cross the logical reasoning process of modal information,finally multimodal artificial intelligence is realized.This paper summarizes the research background and development history of cross-media analysis and reasoning technology,and summarizes the key technologies of cross-modal tasks involving vision and language.Based on the existing research,this paper analyzes the existing problems in the field of multimedia analysis,and finally discusses the future development trend.

Key words: Cross-media analysis and reasoning, Deep learning, Multi-modal fusion, Visual-and-language analysis

中图分类号: 

  • TP181
[1]SRIVASTAVA N,RUSLAN S.Multimodal learning with deep boltzmann machines[J].The Journal of Machine Learning Research,2014,15(1):2949-2980.
[2]ATREY P K,HOSSAIN M A,SADDIK A E,et al.Multimodal fusion for multimedia analysis:a survey[J].Multimedia Systems,2010,16(6):345-379.
[3]LONG J,SHELHAMER E,DARRELL T.Fully convolutional networks for semantic segmentation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2015:3431-3440.
[4]HOTELLING H.Relations Between Two Sets of Variates[J].Biometrika,1935,28:321-377.
[5]SHAWE-TAYLOR J,CRISTIANINI N.Kernel Methods forPattern Analysis[M].Taylor & Francis Group,2004.
[6]SHARMAA,KUMAR A, DAUME H,et al.Generalized multiview analysis:A discriminative latent space[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2012:2160-2167.
[7]SONG G L,WANG S H,HUANG Q M,et al.Multimodal Similarity Gaussian Process Latent Variable Model[J].IEEE Transactions on Image Processing,2017,26(9):4168-4181.
[8]YAN H,WANG S,LIU S,et al.Cross-modal correlation learning by adaptive hierarchical semantic aggregation[J].IEEE Transactions on Multimedia,2016,18(6):1201-1216.
[9]WANG L,LI Y,LAZEBNIK S.Learning deep structure-preserving image-text embeddings[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:5005-5013.
[10]WANG L,LI Y,SVETLANA L.Learning a recurrent residual fusion network for multimodal matching[C]//Proceedings of the IEEE International Conference on Computer Vision.2017:4107-4116.
[11]ANDREW G,RAMAN A,JEFF B,et al.Deep canonical correlation analysis[C]//International Conference on Machine Learning.2013:1247-1255.
[12]WU Y L,WANG S H,HUANG Q M.Online asymmetric similarity learning for cross-modal retrieval[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:4269-4278.
[13]KARPATH Y,ANDRE J,FEI-FEI L.Deep visual-semanticalignments for generating image descriptions[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2015:3128-3137.
[14]MA L,LU Z,SHANG L.Multimodal convolutional neural networks for matching image and sentence[C]//Proceedings of the IEEE International Conference on Computer Vision.2015:2623-2631.
[15]HUANG Y,WU Q,WANG W,et al.Image and sentence matching via semantic concepts and order learning[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2020,42(3):636-650.
[16]WANG S H,CHEN Y Y,ZUO J B,et al.Joint global and co-attentive representation learning for image-sentence retrieval[C]//Proceedings of the 26th ACM international conference on Multimedia.2018:1398-1406.
[17]WU Y,WANG S,SONG G,et al.Augmented AdversarialTraining for Cross-modal Retrieval[J].IEEE Transactions on Multimedia,2021,23:559-571.
[18]VINYALS O,TOSHEV A,BENGIO S,et al.Show and tell:A neural image caption generator[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2015:3156-3164.
[19]VENUGOPALAN S,XU H,DONAHUE J,et al.Translating Videos to Natural Language Using Deep Recurrent Neural Networks[J].Human Language Technologies,arXiv:1412.4729,2015.
[20]YAO L,TORABI A,CHO K,et al.Describing videos by exploiting temporal structure[C]//Proceedings of the IEEE International Conference on Computer Vision.2015:4507-4515.
[21]CORNIA,MARCELLA,LORENZO B.Show,control and tell:A framework for generating controllable and grounded captions[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:8307-8316.
[22]YIN G,SHENG L,LIU B,et al.Context and attribute grounded dense captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:6241-6250.
[23]ZHENG Y,LI Y,WANG S.Intention oriented image captions with guiding objects[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:8395-8404.
[24]KRISHNA R,HATA K,REN F,et al.Dense-captioning events in videos[C]//Proceedings of the IEEE International Conference on Computer Vision.2017:706-715.
[25]QI Z B,WANG S H,SU C.Modeling Temporal Concept Receptive Field Dynamically for Untrimmed Video Analysis[C]//Proceedings of the 28th ACM International Conference on Multimedia.2020:3798-3806.
[26]ZHOU L,ZHOU Y,CORSO J,et al.End-to-End Dense Video Captioning with Masked Transformer[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:8739-8748.
[27]MUN J,YANG L,REN Z,et al.Streamlined dense video captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:6588-6597.
[28]YU L,ZHANG W,WANG J,et al.Seqgan:Sequence generative adversarial nets with policy gradient[C]//Thirty-first AAAI Conference On Artificial Intelligence.2017:2852-2858.
[29]CHEN Y,WANG S,ZHANG W,et al.Less is more:Picking informative frames for video captioning[C]//European Conference on Computer Vision.2018:358-373.
[30]GUO L,LIU J,YAO P,et al.Mscap:Multi-style image captioning with unpaired stylized text[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:4204-4213.
[31]SHUSTER K,HUMEAU S,HU H,et al.Engaging image captioning via personality[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:12516-12526.
[32]XU Y,WU B,SHEN F,et al.Exact adversarial attack to image captioning via structured output learning with latent variables[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:4135-4144.
[33]DOGNIN P,MELNYK I,MROUE H,et al.Adversarial semantic alignment for improved image captions[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:10463-10471.
[34]REE D,SCOTT E.Generative adversarial text to image synthesis[C]//International Conference on Machine Learning.2016:1060-1069.
[35]REED,SCOTT E.Learning what and where to draw[C]//Neural Information Processing Systems.2016:217-225.
[36]HAN Z,XU T,HONGSHENG L.StackGAN:Text to Photo-Realistic Image Synthesis with Stacked Generative Adversarial Networks[C]//Proceedings of the IEEE International Conference on Computer Vision.2017:5908-5916.
[37]ZHANG H,XU T,LI H,et al.StackGAN++:Realistic Image Synthesis with Stacked Generative Adversarial Networks[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2018,41(8):1947-1962.
[38]XU T,ZHANG P,HUANG Q,et al.AttnGAN:Fine-GrainedText to Image Generation with Attentional Generative Adversarial Networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:1316-1324.
[39]JOHNSON J,HARIHARAN B,VAN DER MAATEN L,et al.Clevr:A diagnostic dataset for compositional language and elementary visual reasoning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:2901-2910.
[40]ANTOL S,AGRAWAL A,LU J,et al.Vqa:Visual question answering[C]//Proceedings of the IEEE International Conference on Computer Vision.2015:2425-2433.
[41]WU Q,WANG P,SHEN C,et al.Are you talking to me?reasoned visual dialog generation through adversarial learning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:6106-6115.
[42]KIM J H,ON K W,LIM W,et al.Hadamard product for low-rank bilinear pooling[C]//International Conference on Learning Representations.2017:1-13.
[43]YU Z,YU J,XIANG C,et al.Beyond Bilinear:Generalized Multimodal Factorized High-Order Pooling for Visual Question Answering[J].IEEE Transactions on Neural Networks and Learning Systems,2018,29(12):5947-5959.
[44]HAN X,WANG S,SU C,et al.Interpretable Visual Reasoning via Probabilistic Formulation Under Natural Supervision[C]//European Conference on Computer Vision.2020:553-570.
[45]WANG P,WU Q,SHEN C,et al.Explicit Knowledge-basedReasoning for Visual Question Answering[J].Computer Science,arXiv:1511.02570,2015.
[46]ANDERSON P,WU Q,TENEY D,et al.Image captioning and visual question answering based on attributes and external knowledge[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2018,40(6):1367-1381.
[47]NARASIMHAN M,LAZEBNIK S,SCHWING A.Out of thebox:Reasoning with graph convolution nets for factual visual question answering[C]//Advances in Neural Information Processing Systems.2018:2654-2665.
[48]ANDERSON P,WU Q,TENY D,et al.Vision-and-language navigation:Interpreting visually-grounded navigation instructions in real environments[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:3674-3683.
[49]WANG X,XIONG W,WANG H,et al.Look before you leap:Bridgingmodel-free and model-based reinforcement learning for planned-ahead vision-and-language navigation[C]//Proceedings of the European Conference on Computer Vision.2018:37-53.
[50]FRIED D,HU R,CIRIK V,et al.Speaker-follower models for vision-and-language navigation[C]//Advances in Neural Information Processing Systems.2018:3314-3325.
[51]WANG X,HUANG Q,CELIKYILMAZ A,et al.Reinforcedcross-modal matching and self-supervised imitation learning for vision-language navigation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:6629-6638.
[52]TAN H,YU L,BANSAL M.Learning to navigate unseen environments:Back translation with environmental dropout[C]//International Conference on Learning Representations.2019.
[53]MA C Y,LU J,WU Z,et al.Self-monitoring navigation agent via auxiliary progress estimation[C]//International Conference on Learning Representations.2019.
[54]ZHU F,ZHU Y,CHANG X,et al.Vision-language navigation with self-supervised auxiliary reasoning tasks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2020:10012-10022.
[55]DEVLIN J,CHANG M W,LEE K,et al.Bert:Pre-training of deep bidirectional transformers for language understanding[J].arXiv:1810.04805,2018.
[56]SUN C,MYERS A,VONDRICK C,et al.Videobert:A jointmodel for video and language representation learning[C]//Proceedings of the IEEE International Conference on Computer Vision.2019:7464-7473.
[57]LI L N,YATSKARM,YIN D,et al.Visualbert:A simple and performant baseline for vision and language[J].arXiv:1908.03557,2019.
[58]SU W,ZHU X,CAO Y,et al.Vl-bert:Pre-training of generic visual-linguistic representations[J].arXiv:1908.08530,2019.
[59]LU J,BATRA D,PARIKH D,et al.Vilbert:Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks[C]//Advances in Neural Information Processing Systems.2019:13-23.
[60]TAN H,MOHIT B.Lxmert:Learning cross-modality encoder representations from transformers[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing.2019.
[61]HOOD B M,ATKINSON J.Disengaging visual attention in the infant and adult[J].Infant Behavior & Development,1993,16(4):405-422.
[62]LIU X J,LI L,WANG S H,et al.Adaptive reconstruction network for weakly supervised referring expression grounding[C]//Proceedings of the IEEE International Conference on Computer Vision.2019:2611-2620.
[63]LIU CX,MAO J H,SHA F,et al.Attention correctness in neural image captioning[C]//Proceedings of the Conference on Artificial Intelligence.2017:4176-4182.
[64]JI S,PAN S,CAMBRIA E,et al.A Survey on KnowledgeGraphs:Representation,Acquisition and Applications[C]//Proceedings of the Conference on Artificial Inelligence.2020.
[65]MALINOWSKI M,FRITZ M.A multi-world approach to question answering about real-world scenes based on uncertain input[C]//Advances in Neural Information Processing Systems.2014:1682-1690.
[66]WU Q,WANG P,SHEN C,et al.Ask me anything:Free-form visualquestion answering based on knowledge from external sources[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:4622-4630.
[1] 徐涌鑫, 赵俊峰, 王亚沙, 谢冰, 杨恺.
时序知识图谱表示学习
Temporal Knowledge Graph Representation Learning
计算机科学, 2022, 49(9): 162-171. https://doi.org/10.11896/jsjkx.220500204
[2] 饶志双, 贾真, 张凡, 李天瑞.
基于Key-Value关联记忆网络的知识图谱问答方法
Key-Value Relational Memory Networks for Question Answering over Knowledge Graph
计算机科学, 2022, 49(9): 202-207. https://doi.org/10.11896/jsjkx.220300277
[3] 汤凌韬, 王迪, 张鲁飞, 刘盛云.
基于安全多方计算和差分隐私的联邦学习方案
Federated Learning Scheme Based on Secure Multi-party Computation and Differential Privacy
计算机科学, 2022, 49(9): 297-305. https://doi.org/10.11896/jsjkx.210800108
[4] 王剑, 彭雨琦, 赵宇斐, 杨健.
基于深度学习的社交网络舆情信息抽取方法综述
Survey of Social Network Public Opinion Information Extraction Based on Deep Learning
计算机科学, 2022, 49(8): 279-293. https://doi.org/10.11896/jsjkx.220300099
[5] 郝志荣, 陈龙, 黄嘉成.
面向文本分类的类别区分式通用对抗攻击方法
Class Discriminative Universal Adversarial Attack for Text Classification
计算机科学, 2022, 49(8): 323-329. https://doi.org/10.11896/jsjkx.220200077
[6] 姜梦函, 李邵梅, 郑洪浩, 张建朋.
基于改进位置编码的谣言检测模型
Rumor Detection Model Based on Improved Position Embedding
计算机科学, 2022, 49(8): 330-335. https://doi.org/10.11896/jsjkx.210600046
[7] 孙奇, 吉根林, 张杰.
基于非局部注意力生成对抗网络的视频异常事件检测方法
Non-local Attention Based Generative Adversarial Network for Video Abnormal Event Detection
计算机科学, 2022, 49(8): 172-177. https://doi.org/10.11896/jsjkx.210600061
[8] 侯钰涛, 阿布都克力木·阿布力孜, 哈里旦木·阿布都克里木.
中文预训练模型研究进展
Advances in Chinese Pre-training Models
计算机科学, 2022, 49(7): 148-163. https://doi.org/10.11896/jsjkx.211200018
[9] 周慧, 施皓晨, 屠要峰, 黄圣君.
基于主动采样的深度鲁棒神经网络学习
Robust Deep Neural Network Learning Based on Active Sampling
计算机科学, 2022, 49(7): 164-169. https://doi.org/10.11896/jsjkx.210600044
[10] 苏丹宁, 曹桂涛, 王燕楠, 王宏, 任赫.
小样本雷达辐射源识别的深度学习方法综述
Survey of Deep Learning for Radar Emitter Identification Based on Small Sample
计算机科学, 2022, 49(7): 226-235. https://doi.org/10.11896/jsjkx.210600138
[11] 胡艳羽, 赵龙, 董祥军.
一种用于癌症分类的两阶段深度特征选择提取算法
Two-stage Deep Feature Selection Extraction Algorithm for Cancer Classification
计算机科学, 2022, 49(7): 73-78. https://doi.org/10.11896/jsjkx.210500092
[12] 程成, 降爱莲.
基于多路径特征提取的实时语义分割方法
Real-time Semantic Segmentation Method Based on Multi-path Feature Extraction
计算机科学, 2022, 49(7): 120-126. https://doi.org/10.11896/jsjkx.210500157
[13] 王君锋, 刘凡, 杨赛, 吕坦悦, 陈峙宇, 许峰.
基于多源迁移学习的大坝裂缝检测
Dam Crack Detection Based on Multi-source Transfer Learning
计算机科学, 2022, 49(6A): 319-324. https://doi.org/10.11896/jsjkx.210500124
[14] 楚玉春, 龚航, 王学芳, 刘培顺.
基于YOLOv4的目标检测知识蒸馏算法研究
Study on Knowledge Distillation of Target Detection Algorithm Based on YOLOv4
计算机科学, 2022, 49(6A): 337-344. https://doi.org/10.11896/jsjkx.210600204
[15] 祝文韬, 兰先超, 罗唤霖, 岳彬, 汪洋.
改进Faster R-CNN的光学遥感飞机目标检测
Remote Sensing Aircraft Target Detection Based on Improved Faster R-CNN
计算机科学, 2022, 49(6A): 378-383. https://doi.org/10.11896/jsjkx.210300121
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!