计算机科学 ›› 2021, Vol. 48 ›› Issue (3): 79-86.doi: 10.11896/jsjkx.210200086
所属专题: 多媒体技术进展
王树徽, 闫旭, 黄庆明
WANG Shu-hui, YAN Xu, HUANG Qing-ming
摘要: 当前,以网络数据为代表的跨媒体数据呈现爆炸式增长的趋势,呈现出了跨模态、跨数据源的复杂关联及动态演化特性,跨媒体分析与推理技术针对多模态信息理解、交互、内容管理等需求,通过构建跨模态、跨平台的语义贯通与统一表征机制,进一步实现分析和推理以及对复杂认知目标的不断逼近,建立语义层级的逻辑推理机制,最终实现跨媒体类人智能推理。文中对跨媒体分析推理技术的研究背景和发展历史进行概述,归纳总结视觉-语言关联等任务的关键技术,并对研究应用进行举例。基于已有结论,分析目前跨媒体分析领域所面临的关键问题,最后探讨未来的发展趋势。
中图分类号:
[1]SRIVASTAVA N,RUSLAN S.Multimodal learning with deep boltzmann machines[J].The Journal of Machine Learning Research,2014,15(1):2949-2980. [2]ATREY P K,HOSSAIN M A,SADDIK A E,et al.Multimodal fusion for multimedia analysis:a survey[J].Multimedia Systems,2010,16(6):345-379. [3]LONG J,SHELHAMER E,DARRELL T.Fully convolutional networks for semantic segmentation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2015:3431-3440. [4]HOTELLING H.Relations Between Two Sets of Variates[J].Biometrika,1935,28:321-377. [5]SHAWE-TAYLOR J,CRISTIANINI N.Kernel Methods forPattern Analysis[M].Taylor & Francis Group,2004. [6]SHARMAA,KUMAR A, DAUME H,et al.Generalized multiview analysis:A discriminative latent space[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2012:2160-2167. [7]SONG G L,WANG S H,HUANG Q M,et al.Multimodal Similarity Gaussian Process Latent Variable Model[J].IEEE Transactions on Image Processing,2017,26(9):4168-4181. [8]YAN H,WANG S,LIU S,et al.Cross-modal correlation learning by adaptive hierarchical semantic aggregation[J].IEEE Transactions on Multimedia,2016,18(6):1201-1216. [9]WANG L,LI Y,LAZEBNIK S.Learning deep structure-preserving image-text embeddings[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:5005-5013. [10]WANG L,LI Y,SVETLANA L.Learning a recurrent residual fusion network for multimodal matching[C]//Proceedings of the IEEE International Conference on Computer Vision.2017:4107-4116. [11]ANDREW G,RAMAN A,JEFF B,et al.Deep canonical correlation analysis[C]//International Conference on Machine Learning.2013:1247-1255. [12]WU Y L,WANG S H,HUANG Q M.Online asymmetric similarity learning for cross-modal retrieval[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:4269-4278. [13]KARPATH Y,ANDRE J,FEI-FEI L.Deep visual-semanticalignments for generating image descriptions[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2015:3128-3137. [14]MA L,LU Z,SHANG L.Multimodal convolutional neural networks for matching image and sentence[C]//Proceedings of the IEEE International Conference on Computer Vision.2015:2623-2631. [15]HUANG Y,WU Q,WANG W,et al.Image and sentence matching via semantic concepts and order learning[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2020,42(3):636-650. [16]WANG S H,CHEN Y Y,ZUO J B,et al.Joint global and co-attentive representation learning for image-sentence retrieval[C]//Proceedings of the 26th ACM international conference on Multimedia.2018:1398-1406. [17]WU Y,WANG S,SONG G,et al.Augmented AdversarialTraining for Cross-modal Retrieval[J].IEEE Transactions on Multimedia,2021,23:559-571. [18]VINYALS O,TOSHEV A,BENGIO S,et al.Show and tell:A neural image caption generator[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2015:3156-3164. [19]VENUGOPALAN S,XU H,DONAHUE J,et al.Translating Videos to Natural Language Using Deep Recurrent Neural Networks[J].Human Language Technologies,arXiv:1412.4729,2015. [20]YAO L,TORABI A,CHO K,et al.Describing videos by exploiting temporal structure[C]//Proceedings of the IEEE International Conference on Computer Vision.2015:4507-4515. [21]CORNIA,MARCELLA,LORENZO B.Show,control and tell:A framework for generating controllable and grounded captions[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:8307-8316. [22]YIN G,SHENG L,LIU B,et al.Context and attribute grounded dense captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:6241-6250. [23]ZHENG Y,LI Y,WANG S.Intention oriented image captions with guiding objects[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:8395-8404. [24]KRISHNA R,HATA K,REN F,et al.Dense-captioning events in videos[C]//Proceedings of the IEEE International Conference on Computer Vision.2017:706-715. [25]QI Z B,WANG S H,SU C.Modeling Temporal Concept Receptive Field Dynamically for Untrimmed Video Analysis[C]//Proceedings of the 28th ACM International Conference on Multimedia.2020:3798-3806. [26]ZHOU L,ZHOU Y,CORSO J,et al.End-to-End Dense Video Captioning with Masked Transformer[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:8739-8748. [27]MUN J,YANG L,REN Z,et al.Streamlined dense video captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:6588-6597. [28]YU L,ZHANG W,WANG J,et al.Seqgan:Sequence generative adversarial nets with policy gradient[C]//Thirty-first AAAI Conference On Artificial Intelligence.2017:2852-2858. [29]CHEN Y,WANG S,ZHANG W,et al.Less is more:Picking informative frames for video captioning[C]//European Conference on Computer Vision.2018:358-373. [30]GUO L,LIU J,YAO P,et al.Mscap:Multi-style image captioning with unpaired stylized text[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:4204-4213. [31]SHUSTER K,HUMEAU S,HU H,et al.Engaging image captioning via personality[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:12516-12526. [32]XU Y,WU B,SHEN F,et al.Exact adversarial attack to image captioning via structured output learning with latent variables[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:4135-4144. [33]DOGNIN P,MELNYK I,MROUE H,et al.Adversarial semantic alignment for improved image captions[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:10463-10471. [34]REE D,SCOTT E.Generative adversarial text to image synthesis[C]//International Conference on Machine Learning.2016:1060-1069. [35]REED,SCOTT E.Learning what and where to draw[C]//Neural Information Processing Systems.2016:217-225. [36]HAN Z,XU T,HONGSHENG L.StackGAN:Text to Photo-Realistic Image Synthesis with Stacked Generative Adversarial Networks[C]//Proceedings of the IEEE International Conference on Computer Vision.2017:5908-5916. [37]ZHANG H,XU T,LI H,et al.StackGAN++:Realistic Image Synthesis with Stacked Generative Adversarial Networks[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2018,41(8):1947-1962. [38]XU T,ZHANG P,HUANG Q,et al.AttnGAN:Fine-GrainedText to Image Generation with Attentional Generative Adversarial Networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:1316-1324. [39]JOHNSON J,HARIHARAN B,VAN DER MAATEN L,et al.Clevr:A diagnostic dataset for compositional language and elementary visual reasoning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:2901-2910. [40]ANTOL S,AGRAWAL A,LU J,et al.Vqa:Visual question answering[C]//Proceedings of the IEEE International Conference on Computer Vision.2015:2425-2433. [41]WU Q,WANG P,SHEN C,et al.Are you talking to me?reasoned visual dialog generation through adversarial learning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:6106-6115. [42]KIM J H,ON K W,LIM W,et al.Hadamard product for low-rank bilinear pooling[C]//International Conference on Learning Representations.2017:1-13. [43]YU Z,YU J,XIANG C,et al.Beyond Bilinear:Generalized Multimodal Factorized High-Order Pooling for Visual Question Answering[J].IEEE Transactions on Neural Networks and Learning Systems,2018,29(12):5947-5959. [44]HAN X,WANG S,SU C,et al.Interpretable Visual Reasoning via Probabilistic Formulation Under Natural Supervision[C]//European Conference on Computer Vision.2020:553-570. [45]WANG P,WU Q,SHEN C,et al.Explicit Knowledge-basedReasoning for Visual Question Answering[J].Computer Science,arXiv:1511.02570,2015. [46]ANDERSON P,WU Q,TENEY D,et al.Image captioning and visual question answering based on attributes and external knowledge[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2018,40(6):1367-1381. [47]NARASIMHAN M,LAZEBNIK S,SCHWING A.Out of thebox:Reasoning with graph convolution nets for factual visual question answering[C]//Advances in Neural Information Processing Systems.2018:2654-2665. [48]ANDERSON P,WU Q,TENY D,et al.Vision-and-language navigation:Interpreting visually-grounded navigation instructions in real environments[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:3674-3683. [49]WANG X,XIONG W,WANG H,et al.Look before you leap:Bridgingmodel-free and model-based reinforcement learning for planned-ahead vision-and-language navigation[C]//Proceedings of the European Conference on Computer Vision.2018:37-53. [50]FRIED D,HU R,CIRIK V,et al.Speaker-follower models for vision-and-language navigation[C]//Advances in Neural Information Processing Systems.2018:3314-3325. [51]WANG X,HUANG Q,CELIKYILMAZ A,et al.Reinforcedcross-modal matching and self-supervised imitation learning for vision-language navigation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:6629-6638. [52]TAN H,YU L,BANSAL M.Learning to navigate unseen environments:Back translation with environmental dropout[C]//International Conference on Learning Representations.2019. [53]MA C Y,LU J,WU Z,et al.Self-monitoring navigation agent via auxiliary progress estimation[C]//International Conference on Learning Representations.2019. [54]ZHU F,ZHU Y,CHANG X,et al.Vision-language navigation with self-supervised auxiliary reasoning tasks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2020:10012-10022. [55]DEVLIN J,CHANG M W,LEE K,et al.Bert:Pre-training of deep bidirectional transformers for language understanding[J].arXiv:1810.04805,2018. [56]SUN C,MYERS A,VONDRICK C,et al.Videobert:A jointmodel for video and language representation learning[C]//Proceedings of the IEEE International Conference on Computer Vision.2019:7464-7473. [57]LI L N,YATSKARM,YIN D,et al.Visualbert:A simple and performant baseline for vision and language[J].arXiv:1908.03557,2019. [58]SU W,ZHU X,CAO Y,et al.Vl-bert:Pre-training of generic visual-linguistic representations[J].arXiv:1908.08530,2019. [59]LU J,BATRA D,PARIKH D,et al.Vilbert:Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks[C]//Advances in Neural Information Processing Systems.2019:13-23. [60]TAN H,MOHIT B.Lxmert:Learning cross-modality encoder representations from transformers[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing.2019. [61]HOOD B M,ATKINSON J.Disengaging visual attention in the infant and adult[J].Infant Behavior & Development,1993,16(4):405-422. [62]LIU X J,LI L,WANG S H,et al.Adaptive reconstruction network for weakly supervised referring expression grounding[C]//Proceedings of the IEEE International Conference on Computer Vision.2019:2611-2620. [63]LIU CX,MAO J H,SHA F,et al.Attention correctness in neural image captioning[C]//Proceedings of the Conference on Artificial Intelligence.2017:4176-4182. [64]JI S,PAN S,CAMBRIA E,et al.A Survey on KnowledgeGraphs:Representation,Acquisition and Applications[C]//Proceedings of the Conference on Artificial Inelligence.2020. [65]MALINOWSKI M,FRITZ M.A multi-world approach to question answering about real-world scenes based on uncertain input[C]//Advances in Neural Information Processing Systems.2014:1682-1690. [66]WU Q,WANG P,SHEN C,et al.Ask me anything:Free-form visualquestion answering based on knowledge from external sources[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:4622-4630. |
[1] | 徐涌鑫, 赵俊峰, 王亚沙, 谢冰, 杨恺. 时序知识图谱表示学习 Temporal Knowledge Graph Representation Learning 计算机科学, 2022, 49(9): 162-171. https://doi.org/10.11896/jsjkx.220500204 |
[2] | 饶志双, 贾真, 张凡, 李天瑞. 基于Key-Value关联记忆网络的知识图谱问答方法 Key-Value Relational Memory Networks for Question Answering over Knowledge Graph 计算机科学, 2022, 49(9): 202-207. https://doi.org/10.11896/jsjkx.220300277 |
[3] | 汤凌韬, 王迪, 张鲁飞, 刘盛云. 基于安全多方计算和差分隐私的联邦学习方案 Federated Learning Scheme Based on Secure Multi-party Computation and Differential Privacy 计算机科学, 2022, 49(9): 297-305. https://doi.org/10.11896/jsjkx.210800108 |
[4] | 王剑, 彭雨琦, 赵宇斐, 杨健. 基于深度学习的社交网络舆情信息抽取方法综述 Survey of Social Network Public Opinion Information Extraction Based on Deep Learning 计算机科学, 2022, 49(8): 279-293. https://doi.org/10.11896/jsjkx.220300099 |
[5] | 郝志荣, 陈龙, 黄嘉成. 面向文本分类的类别区分式通用对抗攻击方法 Class Discriminative Universal Adversarial Attack for Text Classification 计算机科学, 2022, 49(8): 323-329. https://doi.org/10.11896/jsjkx.220200077 |
[6] | 姜梦函, 李邵梅, 郑洪浩, 张建朋. 基于改进位置编码的谣言检测模型 Rumor Detection Model Based on Improved Position Embedding 计算机科学, 2022, 49(8): 330-335. https://doi.org/10.11896/jsjkx.210600046 |
[7] | 孙奇, 吉根林, 张杰. 基于非局部注意力生成对抗网络的视频异常事件检测方法 Non-local Attention Based Generative Adversarial Network for Video Abnormal Event Detection 计算机科学, 2022, 49(8): 172-177. https://doi.org/10.11896/jsjkx.210600061 |
[8] | 侯钰涛, 阿布都克力木·阿布力孜, 哈里旦木·阿布都克里木. 中文预训练模型研究进展 Advances in Chinese Pre-training Models 计算机科学, 2022, 49(7): 148-163. https://doi.org/10.11896/jsjkx.211200018 |
[9] | 周慧, 施皓晨, 屠要峰, 黄圣君. 基于主动采样的深度鲁棒神经网络学习 Robust Deep Neural Network Learning Based on Active Sampling 计算机科学, 2022, 49(7): 164-169. https://doi.org/10.11896/jsjkx.210600044 |
[10] | 苏丹宁, 曹桂涛, 王燕楠, 王宏, 任赫. 小样本雷达辐射源识别的深度学习方法综述 Survey of Deep Learning for Radar Emitter Identification Based on Small Sample 计算机科学, 2022, 49(7): 226-235. https://doi.org/10.11896/jsjkx.210600138 |
[11] | 胡艳羽, 赵龙, 董祥军. 一种用于癌症分类的两阶段深度特征选择提取算法 Two-stage Deep Feature Selection Extraction Algorithm for Cancer Classification 计算机科学, 2022, 49(7): 73-78. https://doi.org/10.11896/jsjkx.210500092 |
[12] | 程成, 降爱莲. 基于多路径特征提取的实时语义分割方法 Real-time Semantic Segmentation Method Based on Multi-path Feature Extraction 计算机科学, 2022, 49(7): 120-126. https://doi.org/10.11896/jsjkx.210500157 |
[13] | 王君锋, 刘凡, 杨赛, 吕坦悦, 陈峙宇, 许峰. 基于多源迁移学习的大坝裂缝检测 Dam Crack Detection Based on Multi-source Transfer Learning 计算机科学, 2022, 49(6A): 319-324. https://doi.org/10.11896/jsjkx.210500124 |
[14] | 楚玉春, 龚航, 王学芳, 刘培顺. 基于YOLOv4的目标检测知识蒸馏算法研究 Study on Knowledge Distillation of Target Detection Algorithm Based on YOLOv4 计算机科学, 2022, 49(6A): 337-344. https://doi.org/10.11896/jsjkx.210600204 |
[15] | 祝文韬, 兰先超, 罗唤霖, 岳彬, 汪洋. 改进Faster R-CNN的光学遥感飞机目标检测 Remote Sensing Aircraft Target Detection Based on Improved Faster R-CNN 计算机科学, 2022, 49(6A): 378-383. https://doi.org/10.11896/jsjkx.210300121 |
|