计算机科学 ›› 2024, Vol. 51 ›› Issue (6A): 230300212-5.doi: 10.11896/jsjkx.230300212

• 图像处理&多媒体技术 • 上一篇    下一篇

基于多模态视频分类任务的模态融合策略研究

王一帆, 张雪芳   

  1. 武汉邮电科学研究院 武汉 430070
  • 发布日期:2024-06-06
  • 通讯作者: 张雪芳(zhangxuefang@fhxy.net.cn)
  • 作者简介:(wyf1519uir@163.com)
  • 基金资助:
    国家重点研发计划(2019YFB1803600)

Modality Fusion Strategy Research Based on Multimodal Video Classification Task

WANG Yifan, ZHANG Xuefang   

  1. Wuhan Research Institute of Posts and Telecommunications,Wuhan 430070,China
  • Published:2024-06-06
  • About author:WANG Yifan,born in 1997,master.His main research interests include graph neural networks and multimodal machine learning.
    ZHANG Xuefang,born in 1989,master,senior engineer.Her main research interests include intelligent optical networks and AI.
  • Supported by:
    National Key R&D Program of China(2019YFB1803600).

摘要: 尽管过往人工智能相关技术在众多领域取得了成功,但是通常只是模拟了人类的某一种感知能力,也就意味着被限制在处理单个模态的信息之中。从多个模态信息中提取特征并进行有效融合对于从弱/限制领域人工智能向强/通用人工智能的发展迈进具有重要意义。本研究基于编码器-解码器结构,在视频分类任务上对多模态信息的特征编码进行早期特征融合、对各模态信息的预测结果进行后期决策融合以及对两者相结合的不同多模态信息融合策略进行了对比研究;同时对音频模态信息参与模态融合的两种方式进行了对比,即直接将音频进行特征编码进而参与模态融合或音频通过语音转文本进而以文本的形式参与模态融合。实验结果表明,将文本和音频模态单独的预测结果与另外两种模态的融合特征的预测结果进行决策融合能够进一步提高分类预测准确率;此外,通过语音识别将语音转换成文本模态信息,能够更加充分利用其中包含的语义信息。

关键词: 多模态, 模态融合, 语音识别, 视频分类

Abstract: Despite the success of AI-related technologies in many fields,they usually simulate only one type of human perception,which means that they are limited to process information from a single modality.Extracting features from multiple modal information and fusing them effectively is important for developing general AI.In this paper,a comparative study of different multimodal information fusion strategies based on an encoder-decoder architecture with early feature fusion for feature encoding of multimodal information,late decision fusion for prediction results of each modal information,and a combination of both is conducted on a video classification task.This paper also compares two ways to involve audio modal information in modal fusion,i.e.,directly encoding audio with features and then participating in modal fusion or audio by speech-to-text and then participating in modal fusion in the form of text.Experiments show that decision fusion of the prediction results of text and audio modalities alone with those of the fused features of the other two modalities can further improve the classification prediction accuracy under the experimental approach of this study.Moreover,converting speech into text modal information by ASR(Automatic Speech Recognition) can make fuller use of the semantic information contained in it.

Key words: Multimodality, Modality fusion, Speech recognition, Video classification

中图分类号: 

  • TP181
[1]BALTRUŠAITIS T,AHUJA C,MORENCY L P.Multimodal machine learning:A survey and taxonomy[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2018,41(2):423-443.
[2]KEEGAN M.The Most Surveilled Cities in the World[EB/OL]https://www.usnews.com/news/cities/articles/2020-08-14/the-top-10-most-surveilled-cities-in-the-world.
[3]中国互联网网络信息中心.第50次中国互联网络发展状况统计报告[R/OL].(2022-08-31)[2022-09-10].http://www3.cnnic.cn/NMediaFile/2022/1020/MAIN16662586615125EJOL1VKDF.pdf.
[4]Cisco.Cisco Annual Internet Report(2018-2023) White Paper[R].2020.
[5]RADFORD A,KIM J W,HALLACYC,et al.Learning trans-ferable visual models from natural language supervision[C]//International Conference on Machine Learning.PMLR,2021:8748-8763.
[6]OpenAI,GPT-4 Technical Report[R].2023.
[7]AREVALO J,SOLORIO T,MONTES-Y-GÓMEZ M,et al.Ga-ted multimodal units for information fusion[J].arXiv:1702.01992,2017.
[8]CHO K,VAN MERRIËNBOER B,BAHDANAU D,et al.Onthe properties of neural machine translation:Encoder-decoder approaches[J].arXiv:1409.1259,2014.
[9]ELMAN J L.Finding Structure in Time[J].Cognitive Science,1990,14(2):179-211.
[10]HOCHREITER S,SCHMIDHUBER J.Long Short-Term Memory[J].Neural Computation,1997,9(8):1735-1780.
[11]MIKOLOV T,CHEN K,CORRADOG,et al.Efficient estimation of word representations in vector space[J].arXiv:1301.3781,2013.
[12]PENNINGTON J,SOCHER R,MANNING C D.Glove:Global vectors for word representation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Proces-sing(EMNLP).2014:1532-1543.
[13]MCCANN B,BRADBURY J,XIONG C,et al.Learned in translation:Contextualized word vectors[J/OL].https://proceedings.neurips.cc/paper_files/paper/2017/hash/20c86a628232a67e7bd46f76fba7ce12-Abstract.html.
[14]RADFORD A,NARASIMHAN K,SALIMANS T,et al.Improving language understanding by generative pre-training[J/OL].https://www.semanticscholar.org/paper/Improving-Language-Understanding-by-Generative-Radford-Narasimhan/cd18800a0fe0b668a1cc19f2ec95b5003d0a5035.
[15]DEVLIN J,CHANG M W,LEE K,et al.Bert:Pre-training of deep bidirectional transformers for language understanding[J].arXiv:1810.04805,2018.
[16]ARANDJELOVIC R,ZISSERMAN A.Look,listen and learn[C]//Proceedings of the IEEE International Conference on Computer Vision.2017:609-617.
[17]KONG Q,CAO Y,IQBAL T,et al.Panns:Large-scale pre-trained audio neural networks for audio pattern recognition[J].IEEE/ACM Transactions on Audio,Speech,and Language Processing,2020,28:2880-2894.
[18]CARREIRA J,ZISSERMAN A.Quo vadis,action recognition? anew model and the kinetics dataset[C]//proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:6299-6308.
[19]FEICHTENHOFER C,FAN H,MALIK J,et al.Slowfast networks for video recognition[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2019:6202-6211.
[20]BERTASIUS G,WANG H,TORRESANI L.Is space-time attention all you need for video understanding?[C]//ICML.2021,2(3):4
[21]LAN Z,CHEN M,GOODMAN S,et al.Albert:A lite bert for self-supervised learning of language representations[J].arXiv:1909.11942,2019.
[22]NOJAVANASGHARI B,GOPINATH D,KOUSHIKJ,et al.Deep multimodal fusion for persuasiveness prediction[C]//Proceedings of the 18th ACM International Conference on Multimodal Interaction.2016:284-288.
[23]WANG H,MEGHAWAT A,MORENCYL P,et al.Select-additive learning:Improving generalization in multimodal sentiment analysis[C]//2017 IEEE International Conference on Multimedia and Expo(ICME).IEEE,2017:949-954.
[24]FAN H,MURRELL T,WANG H,et al.PyTorchVideo:A deep learning library for video understanding[C]//Proceedings of the 29th ACM International Conference on Multimedia.2021:3783-3786.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!