Computer Science ›› 2021, Vol. 48 ›› Issue (3): 79-86.doi: 10.11896/jsjkx.210200086

Special Issue: Advances on Multimedia Technology

• Advances on Multimedia Technology • Previous Articles     Next Articles

Overview of Research on Cross-media Analysis and Reasoning Technology

WANG Shu-hui, YAN Xu, HUANG Qing-ming   

  1. Institute of Computing Technology,Chinese Academy of Sciences,Beijing 100190,China
  • Received:2021-01-20 Revised:2021-02-09 Online:2021-03-15 Published:2021-03-05
  • About author:WANG Shu-hui,born in 1983,Ph.D,professor,Ph.D supervisor.His main research interests include cross-media understanding,multi-modal learning/reasoning and large-scale Web multimedia data mining.
    HUANG Qing-ming,born in 1965,Ph.D,professor,Ph.D supervisor.His main research interests include multimedia computing,image/video proces-sing,pattern recognition and computer vision.
  • Supported by:
    National Key R&D Program of China(2018AAA0102003),National Natural Science Foundation of China(62022083,61672497) and Key Research Program of Frontier Sciences of CAS (QYZDJ-SSW-SYS013).

Abstract: Cross-media presents complex correlation characteristics across modalities and data sources.Cross-media analysis and reasoning technology is aimed at multimodal information understanding and interaction tasks.Through the construction of cross-modal and cross-platform semantic transformation mechanisms,as well as further question-and-answer interactions,it is constantly approaching complex cognitive goals and modeling high-level cross the logical reasoning process of modal information,finally multimodal artificial intelligence is realized.This paper summarizes the research background and development history of cross-media analysis and reasoning technology,and summarizes the key technologies of cross-modal tasks involving vision and language.Based on the existing research,this paper analyzes the existing problems in the field of multimedia analysis,and finally discusses the future development trend.

Key words: Cross-media analysis and reasoning, Deep learning, Multi-modal fusion, Visual-and-language analysis

CLC Number: 

  • TP181
[1]SRIVASTAVA N,RUSLAN S.Multimodal learning with deep boltzmann machines[J].The Journal of Machine Learning Research,2014,15(1):2949-2980.
[2]ATREY P K,HOSSAIN M A,SADDIK A E,et al.Multimodal fusion for multimedia analysis:a survey[J].Multimedia Systems,2010,16(6):345-379.
[3]LONG J,SHELHAMER E,DARRELL T.Fully convolutional networks for semantic segmentation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2015:3431-3440.
[4]HOTELLING H.Relations Between Two Sets of Variates[J].Biometrika,1935,28:321-377.
[5]SHAWE-TAYLOR J,CRISTIANINI N.Kernel Methods forPattern Analysis[M].Taylor & Francis Group,2004.
[6]SHARMAA,KUMAR A, DAUME H,et al.Generalized multiview analysis:A discriminative latent space[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2012:2160-2167.
[7]SONG G L,WANG S H,HUANG Q M,et al.Multimodal Similarity Gaussian Process Latent Variable Model[J].IEEE Transactions on Image Processing,2017,26(9):4168-4181.
[8]YAN H,WANG S,LIU S,et al.Cross-modal correlation learning by adaptive hierarchical semantic aggregation[J].IEEE Transactions on Multimedia,2016,18(6):1201-1216.
[9]WANG L,LI Y,LAZEBNIK S.Learning deep structure-preserving image-text embeddings[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:5005-5013.
[10]WANG L,LI Y,SVETLANA L.Learning a recurrent residual fusion network for multimodal matching[C]//Proceedings of the IEEE International Conference on Computer Vision.2017:4107-4116.
[11]ANDREW G,RAMAN A,JEFF B,et al.Deep canonical correlation analysis[C]//International Conference on Machine Learning.2013:1247-1255.
[12]WU Y L,WANG S H,HUANG Q M.Online asymmetric similarity learning for cross-modal retrieval[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:4269-4278.
[13]KARPATH Y,ANDRE J,FEI-FEI L.Deep visual-semanticalignments for generating image descriptions[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2015:3128-3137.
[14]MA L,LU Z,SHANG L.Multimodal convolutional neural networks for matching image and sentence[C]//Proceedings of the IEEE International Conference on Computer Vision.2015:2623-2631.
[15]HUANG Y,WU Q,WANG W,et al.Image and sentence matching via semantic concepts and order learning[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2020,42(3):636-650.
[16]WANG S H,CHEN Y Y,ZUO J B,et al.Joint global and co-attentive representation learning for image-sentence retrieval[C]//Proceedings of the 26th ACM international conference on Multimedia.2018:1398-1406.
[17]WU Y,WANG S,SONG G,et al.Augmented AdversarialTraining for Cross-modal Retrieval[J].IEEE Transactions on Multimedia,2021,23:559-571.
[18]VINYALS O,TOSHEV A,BENGIO S,et al.Show and tell:A neural image caption generator[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2015:3156-3164.
[19]VENUGOPALAN S,XU H,DONAHUE J,et al.Translating Videos to Natural Language Using Deep Recurrent Neural Networks[J].Human Language Technologies,arXiv:1412.4729,2015.
[20]YAO L,TORABI A,CHO K,et al.Describing videos by exploiting temporal structure[C]//Proceedings of the IEEE International Conference on Computer Vision.2015:4507-4515.
[21]CORNIA,MARCELLA,LORENZO B.Show,control and tell:A framework for generating controllable and grounded captions[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:8307-8316.
[22]YIN G,SHENG L,LIU B,et al.Context and attribute grounded dense captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:6241-6250.
[23]ZHENG Y,LI Y,WANG S.Intention oriented image captions with guiding objects[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:8395-8404.
[24]KRISHNA R,HATA K,REN F,et al.Dense-captioning events in videos[C]//Proceedings of the IEEE International Conference on Computer Vision.2017:706-715.
[25]QI Z B,WANG S H,SU C.Modeling Temporal Concept Receptive Field Dynamically for Untrimmed Video Analysis[C]//Proceedings of the 28th ACM International Conference on Multimedia.2020:3798-3806.
[26]ZHOU L,ZHOU Y,CORSO J,et al.End-to-End Dense Video Captioning with Masked Transformer[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:8739-8748.
[27]MUN J,YANG L,REN Z,et al.Streamlined dense video captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:6588-6597.
[28]YU L,ZHANG W,WANG J,et al.Seqgan:Sequence generative adversarial nets with policy gradient[C]//Thirty-first AAAI Conference On Artificial Intelligence.2017:2852-2858.
[29]CHEN Y,WANG S,ZHANG W,et al.Less is more:Picking informative frames for video captioning[C]//European Conference on Computer Vision.2018:358-373.
[30]GUO L,LIU J,YAO P,et al.Mscap:Multi-style image captioning with unpaired stylized text[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:4204-4213.
[31]SHUSTER K,HUMEAU S,HU H,et al.Engaging image captioning via personality[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:12516-12526.
[32]XU Y,WU B,SHEN F,et al.Exact adversarial attack to image captioning via structured output learning with latent variables[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:4135-4144.
[33]DOGNIN P,MELNYK I,MROUE H,et al.Adversarial semantic alignment for improved image captions[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:10463-10471.
[34]REE D,SCOTT E.Generative adversarial text to image synthesis[C]//International Conference on Machine Learning.2016:1060-1069.
[35]REED,SCOTT E.Learning what and where to draw[C]//Neural Information Processing Systems.2016:217-225.
[36]HAN Z,XU T,HONGSHENG L.StackGAN:Text to Photo-Realistic Image Synthesis with Stacked Generative Adversarial Networks[C]//Proceedings of the IEEE International Conference on Computer Vision.2017:5908-5916.
[37]ZHANG H,XU T,LI H,et al.StackGAN++:Realistic Image Synthesis with Stacked Generative Adversarial Networks[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2018,41(8):1947-1962.
[38]XU T,ZHANG P,HUANG Q,et al.AttnGAN:Fine-GrainedText to Image Generation with Attentional Generative Adversarial Networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:1316-1324.
[39]JOHNSON J,HARIHARAN B,VAN DER MAATEN L,et al.Clevr:A diagnostic dataset for compositional language and elementary visual reasoning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:2901-2910.
[40]ANTOL S,AGRAWAL A,LU J,et al.Vqa:Visual question answering[C]//Proceedings of the IEEE International Conference on Computer Vision.2015:2425-2433.
[41]WU Q,WANG P,SHEN C,et al.Are you talking to me?reasoned visual dialog generation through adversarial learning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:6106-6115.
[42]KIM J H,ON K W,LIM W,et al.Hadamard product for low-rank bilinear pooling[C]//International Conference on Learning Representations.2017:1-13.
[43]YU Z,YU J,XIANG C,et al.Beyond Bilinear:Generalized Multimodal Factorized High-Order Pooling for Visual Question Answering[J].IEEE Transactions on Neural Networks and Learning Systems,2018,29(12):5947-5959.
[44]HAN X,WANG S,SU C,et al.Interpretable Visual Reasoning via Probabilistic Formulation Under Natural Supervision[C]//European Conference on Computer Vision.2020:553-570.
[45]WANG P,WU Q,SHEN C,et al.Explicit Knowledge-basedReasoning for Visual Question Answering[J].Computer Science,arXiv:1511.02570,2015.
[46]ANDERSON P,WU Q,TENEY D,et al.Image captioning and visual question answering based on attributes and external knowledge[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2018,40(6):1367-1381.
[47]NARASIMHAN M,LAZEBNIK S,SCHWING A.Out of thebox:Reasoning with graph convolution nets for factual visual question answering[C]//Advances in Neural Information Processing Systems.2018:2654-2665.
[48]ANDERSON P,WU Q,TENY D,et al.Vision-and-language navigation:Interpreting visually-grounded navigation instructions in real environments[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:3674-3683.
[49]WANG X,XIONG W,WANG H,et al.Look before you leap:Bridgingmodel-free and model-based reinforcement learning for planned-ahead vision-and-language navigation[C]//Proceedings of the European Conference on Computer Vision.2018:37-53.
[50]FRIED D,HU R,CIRIK V,et al.Speaker-follower models for vision-and-language navigation[C]//Advances in Neural Information Processing Systems.2018:3314-3325.
[51]WANG X,HUANG Q,CELIKYILMAZ A,et al.Reinforcedcross-modal matching and self-supervised imitation learning for vision-language navigation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:6629-6638.
[52]TAN H,YU L,BANSAL M.Learning to navigate unseen environments:Back translation with environmental dropout[C]//International Conference on Learning Representations.2019.
[53]MA C Y,LU J,WU Z,et al.Self-monitoring navigation agent via auxiliary progress estimation[C]//International Conference on Learning Representations.2019.
[54]ZHU F,ZHU Y,CHANG X,et al.Vision-language navigation with self-supervised auxiliary reasoning tasks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2020:10012-10022.
[55]DEVLIN J,CHANG M W,LEE K,et al.Bert:Pre-training of deep bidirectional transformers for language understanding[J].arXiv:1810.04805,2018.
[56]SUN C,MYERS A,VONDRICK C,et al.Videobert:A jointmodel for video and language representation learning[C]//Proceedings of the IEEE International Conference on Computer Vision.2019:7464-7473.
[57]LI L N,YATSKARM,YIN D,et al.Visualbert:A simple and performant baseline for vision and language[J].arXiv:1908.03557,2019.
[58]SU W,ZHU X,CAO Y,et al.Vl-bert:Pre-training of generic visual-linguistic representations[J].arXiv:1908.08530,2019.
[59]LU J,BATRA D,PARIKH D,et al.Vilbert:Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks[C]//Advances in Neural Information Processing Systems.2019:13-23.
[60]TAN H,MOHIT B.Lxmert:Learning cross-modality encoder representations from transformers[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing.2019.
[61]HOOD B M,ATKINSON J.Disengaging visual attention in the infant and adult[J].Infant Behavior & Development,1993,16(4):405-422.
[62]LIU X J,LI L,WANG S H,et al.Adaptive reconstruction network for weakly supervised referring expression grounding[C]//Proceedings of the IEEE International Conference on Computer Vision.2019:2611-2620.
[63]LIU CX,MAO J H,SHA F,et al.Attention correctness in neural image captioning[C]//Proceedings of the Conference on Artificial Intelligence.2017:4176-4182.
[64]JI S,PAN S,CAMBRIA E,et al.A Survey on KnowledgeGraphs:Representation,Acquisition and Applications[C]//Proceedings of the Conference on Artificial Inelligence.2020.
[65]MALINOWSKI M,FRITZ M.A multi-world approach to question answering about real-world scenes based on uncertain input[C]//Advances in Neural Information Processing Systems.2014:1682-1690.
[66]WU Q,WANG P,SHEN C,et al.Ask me anything:Free-form visualquestion answering based on knowledge from external sources[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:4622-4630.
[1] XU Yong-xin, ZHAO Jun-feng, WANG Ya-sha, XIE Bing, YANG Kai. Temporal Knowledge Graph Representation Learning [J]. Computer Science, 2022, 49(9): 162-171.
[2] RAO Zhi-shuang, JIA Zhen, ZHANG Fan, LI Tian-rui. Key-Value Relational Memory Networks for Question Answering over Knowledge Graph [J]. Computer Science, 2022, 49(9): 202-207.
[3] TANG Ling-tao, WANG Di, ZHANG Lu-fei, LIU Sheng-yun. Federated Learning Scheme Based on Secure Multi-party Computation and Differential Privacy [J]. Computer Science, 2022, 49(9): 297-305.
[4] WANG Jian, PENG Yu-qi, ZHAO Yu-fei, YANG Jian. Survey of Social Network Public Opinion Information Extraction Based on Deep Learning [J]. Computer Science, 2022, 49(8): 279-293.
[5] HAO Zhi-rong, CHEN Long, HUANG Jia-cheng. Class Discriminative Universal Adversarial Attack for Text Classification [J]. Computer Science, 2022, 49(8): 323-329.
[6] JIANG Meng-han, LI Shao-mei, ZHENG Hong-hao, ZHANG Jian-peng. Rumor Detection Model Based on Improved Position Embedding [J]. Computer Science, 2022, 49(8): 330-335.
[7] SUN Qi, JI Gen-lin, ZHANG Jie. Non-local Attention Based Generative Adversarial Network for Video Abnormal Event Detection [J]. Computer Science, 2022, 49(8): 172-177.
[8] HOU Yu-tao, ABULIZI Abudukelimu, ABUDUKELIMU Halidanmu. Advances in Chinese Pre-training Models [J]. Computer Science, 2022, 49(7): 148-163.
[9] ZHOU Hui, SHI Hao-chen, TU Yao-feng, HUANG Sheng-jun. Robust Deep Neural Network Learning Based on Active Sampling [J]. Computer Science, 2022, 49(7): 164-169.
[10] SU Dan-ning, CAO Gui-tao, WANG Yan-nan, WANG Hong, REN He. Survey of Deep Learning for Radar Emitter Identification Based on Small Sample [J]. Computer Science, 2022, 49(7): 226-235.
[11] HU Yan-yu, ZHAO Long, DONG Xiang-jun. Two-stage Deep Feature Selection Extraction Algorithm for Cancer Classification [J]. Computer Science, 2022, 49(7): 73-78.
[12] CHENG Cheng, JIANG Ai-lian. Real-time Semantic Segmentation Method Based on Multi-path Feature Extraction [J]. Computer Science, 2022, 49(7): 120-126.
[13] LIU Wei-ye, LU Hui-min, LI Yu-peng, MA Ning. Survey on Finger Vein Recognition Research [J]. Computer Science, 2022, 49(6A): 1-11.
[14] SUN Fu-quan, CUI Zhi-qing, ZOU Peng, ZHANG Kun. Brain Tumor Segmentation Algorithm Based on Multi-scale Features [J]. Computer Science, 2022, 49(6A): 12-16.
[15] KANG Yan, XU Yu-long, KOU Yong-qi, XIE Si-yu, YANG Xue-kun, LI Hao. Drug-Drug Interaction Prediction Based on Transformer and LSTM [J]. Computer Science, 2022, 49(6A): 17-21.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!