Computer Science ›› 2024, Vol. 51 ›› Issue (6A): 230300212-5.doi: 10.11896/jsjkx.230300212

• Image Processing & Multimedia Technolog • Previous Articles     Next Articles

Modality Fusion Strategy Research Based on Multimodal Video Classification Task

WANG Yifan, ZHANG Xuefang   

  1. Wuhan Research Institute of Posts and Telecommunications,Wuhan 430070,China
  • Published:2024-06-06
  • About author:WANG Yifan,born in 1997,master.His main research interests include graph neural networks and multimodal machine learning.
    ZHANG Xuefang,born in 1989,master,senior engineer.Her main research interests include intelligent optical networks and AI.
  • Supported by:
    National Key R&D Program of China(2019YFB1803600).

Abstract: Despite the success of AI-related technologies in many fields,they usually simulate only one type of human perception,which means that they are limited to process information from a single modality.Extracting features from multiple modal information and fusing them effectively is important for developing general AI.In this paper,a comparative study of different multimodal information fusion strategies based on an encoder-decoder architecture with early feature fusion for feature encoding of multimodal information,late decision fusion for prediction results of each modal information,and a combination of both is conducted on a video classification task.This paper also compares two ways to involve audio modal information in modal fusion,i.e.,directly encoding audio with features and then participating in modal fusion or audio by speech-to-text and then participating in modal fusion in the form of text.Experiments show that decision fusion of the prediction results of text and audio modalities alone with those of the fused features of the other two modalities can further improve the classification prediction accuracy under the experimental approach of this study.Moreover,converting speech into text modal information by ASR(Automatic Speech Recognition) can make fuller use of the semantic information contained in it.

Key words: Multimodality, Modality fusion, Speech recognition, Video classification

CLC Number: 

  • TP181
[1]BALTRUŠAITIS T,AHUJA C,MORENCY L P.Multimodal machine learning:A survey and taxonomy[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2018,41(2):423-443.
[2]KEEGAN M.The Most Surveilled Cities in the World[EB/OL]https://www.usnews.com/news/cities/articles/2020-08-14/the-top-10-most-surveilled-cities-in-the-world.
[3]中国互联网网络信息中心.第50次中国互联网络发展状况统计报告[R/OL].(2022-08-31)[2022-09-10].http://www3.cnnic.cn/NMediaFile/2022/1020/MAIN16662586615125EJOL1VKDF.pdf.
[4]Cisco.Cisco Annual Internet Report(2018-2023) White Paper[R].2020.
[5]RADFORD A,KIM J W,HALLACYC,et al.Learning trans-ferable visual models from natural language supervision[C]//International Conference on Machine Learning.PMLR,2021:8748-8763.
[6]OpenAI,GPT-4 Technical Report[R].2023.
[7]AREVALO J,SOLORIO T,MONTES-Y-GÓMEZ M,et al.Ga-ted multimodal units for information fusion[J].arXiv:1702.01992,2017.
[8]CHO K,VAN MERRIËNBOER B,BAHDANAU D,et al.Onthe properties of neural machine translation:Encoder-decoder approaches[J].arXiv:1409.1259,2014.
[9]ELMAN J L.Finding Structure in Time[J].Cognitive Science,1990,14(2):179-211.
[10]HOCHREITER S,SCHMIDHUBER J.Long Short-Term Memory[J].Neural Computation,1997,9(8):1735-1780.
[11]MIKOLOV T,CHEN K,CORRADOG,et al.Efficient estimation of word representations in vector space[J].arXiv:1301.3781,2013.
[12]PENNINGTON J,SOCHER R,MANNING C D.Glove:Global vectors for word representation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Proces-sing(EMNLP).2014:1532-1543.
[13]MCCANN B,BRADBURY J,XIONG C,et al.Learned in translation:Contextualized word vectors[J/OL].https://proceedings.neurips.cc/paper_files/paper/2017/hash/20c86a628232a67e7bd46f76fba7ce12-Abstract.html.
[14]RADFORD A,NARASIMHAN K,SALIMANS T,et al.Improving language understanding by generative pre-training[J/OL].https://www.semanticscholar.org/paper/Improving-Language-Understanding-by-Generative-Radford-Narasimhan/cd18800a0fe0b668a1cc19f2ec95b5003d0a5035.
[15]DEVLIN J,CHANG M W,LEE K,et al.Bert:Pre-training of deep bidirectional transformers for language understanding[J].arXiv:1810.04805,2018.
[16]ARANDJELOVIC R,ZISSERMAN A.Look,listen and learn[C]//Proceedings of the IEEE International Conference on Computer Vision.2017:609-617.
[17]KONG Q,CAO Y,IQBAL T,et al.Panns:Large-scale pre-trained audio neural networks for audio pattern recognition[J].IEEE/ACM Transactions on Audio,Speech,and Language Processing,2020,28:2880-2894.
[18]CARREIRA J,ZISSERMAN A.Quo vadis,action recognition? anew model and the kinetics dataset[C]//proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:6299-6308.
[19]FEICHTENHOFER C,FAN H,MALIK J,et al.Slowfast networks for video recognition[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2019:6202-6211.
[20]BERTASIUS G,WANG H,TORRESANI L.Is space-time attention all you need for video understanding?[C]//ICML.2021,2(3):4
[21]LAN Z,CHEN M,GOODMAN S,et al.Albert:A lite bert for self-supervised learning of language representations[J].arXiv:1909.11942,2019.
[22]NOJAVANASGHARI B,GOPINATH D,KOUSHIKJ,et al.Deep multimodal fusion for persuasiveness prediction[C]//Proceedings of the 18th ACM International Conference on Multimodal Interaction.2016:284-288.
[23]WANG H,MEGHAWAT A,MORENCYL P,et al.Select-additive learning:Improving generalization in multimodal sentiment analysis[C]//2017 IEEE International Conference on Multimedia and Expo(ICME).IEEE,2017:949-954.
[24]FAN H,MURRELL T,WANG H,et al.PyTorchVideo:A deep learning library for video understanding[C]//Proceedings of the 29th ACM International Conference on Multimedia.2021:3783-3786.
[1] TENG Sihang, WANG Lie, LI Ya. Non-autoregressive Transformer Chinese Speech Recognition Incorporating Pronunciation- Character Representation Conversion [J]. Computer Science, 2023, 50(8): 111-117.
[2] XU Ming-ke, ZHANG Fan. Head Fusion:A Method to Improve Accuracy and Robustness of Speech Emotion Recognition [J]. Computer Science, 2022, 49(7): 132-141.
[3] CHENG Gao-feng, YAN Yong-hong. Latest Development of Multilingual Speech Recognition Acoustic Model Modeling Methods [J]. Computer Science, 2022, 49(1): 47-52.
[4] YANG Run-yan, CHENG Gao-feng, LIU Jian. Study on Keyword Search Framework Based on End-to-End Automatic Speech Recognition [J]. Computer Science, 2022, 49(1): 53-58.
[5] GAN Chuang, WU Gui-xing, ZHAN Qing-yuan, WANG Peng-kun, PENG Zhi-lei. Multi-scale Gated Graph Convolutional Network for Skeleton-based Action Recognition [J]. Computer Science, 2022, 49(1): 181-186.
[6] ZHENG Chun-jun, WANG Chun-li, JIA Ning. Survey of Acoustic Feature Extraction in Speech Tasks [J]. Computer Science, 2020, 47(5): 110-119.
[7] ZHANG Jing, YANG Jian, SU Peng. Survey of Monosyllable Recognition in Speech Recognition [J]. Computer Science, 2020, 47(11A): 172-174.
[8] CUI Yang, LIU Chang-hong. PIFA-based Evaluation Platform for Speech Recognition System [J]. Computer Science, 2020, 47(11A): 638-641.
[9] SHI Yan-yan, BAI Jing. Speech Recognition Combining CFCC and Teager Energy Operators Cepstral Coefficients [J]. Computer Science, 2019, 46(5): 286-289.
[10] WEI Ying, WANG Shuang-wei, PAN Di, ZHANG Ling, XU Ting-fa and LIANG Shi-li. Specific Two Words Chinese Lexical Recognition Based on Broadband and Narrowband Spectrogram Feature Fusion with Zoning Projection [J]. Computer Science, 2016, 43(Z11): 215-219.
[11] LI Wei-lin, WEN Jian and MA Wen-kai. Speech Recognition System Based on Deep Neural Network [J]. Computer Science, 2016, 43(Z11): 45-49.
[12] SUN Zhi-yuan, LU Cheng-xiang, SHI Zhong-zhi and MA Gang. Research and Advances on Deep Learning [J]. Computer Science, 2016, 43(2): 1-8.
[13] YANG Dan, CHEN Mo, SUN Liang-xu and WANG Gang. Multi-layer Temporal Data Model Supporting Multi-modality Fusion Entity Search in Heterogeneous Information Spaces [J]. Computer Science, 2015, 42(4): 147-150.
[14] DONG Jun-jian,MAO Qi-rong,HU Su-li and ZHAN Yong-zhao. Sub-coding and Entire-coding Jointly Penalty Based Sparse Representation Dictionary Learning [J]. Computer Science, 2014, 41(10): 122-127.
[15] LIU Wan-feng,HU Jun and YUAN Wei-wei. Research on Technology of Voice Instruction Recognition for Air Traffic Control Communication [J]. Computer Science, 2013, 40(7): 131-137.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!