基于多模态视频分类任务的模态融合策略研究

doi:10.11896/jsjkx.230300212

Abstract

Abstract: Despite the success of AI-related technologies in many fields,they usually simulate only one type of human perception,which means that they are limited to process information from a single modality.Extracting features from multiple modal information and fusing them effectively is important for developing general AI.In this paper,a comparative study of different multimodal information fusion strategies based on an encoder-decoder architecture with early feature fusion for feature encoding of multimodal information,late decision fusion for prediction results of each modal information,and a combination of both is conducted on a video classification task.This paper also compares two ways to involve audio modal information in modal fusion,i.e.,directly encoding audio with features and then participating in modal fusion or audio by speech-to-text and then participating in modal fusion in the form of text.Experiments show that decision fusion of the prediction results of text and audio modalities alone with those of the fused features of the other two modalities can further improve the classification prediction accuracy under the experimental approach of this study.Moreover,converting speech into text modal information by ASR(Automatic Speech Recognition) can make fuller use of the semantic information contained in it.

Key words: Multimodality, Modality fusion, Speech recognition, Video classification

CLC Number:

TP181

WANG Yifan, ZHANG Xuefang. Modality Fusion Strategy Research Based on Multimodal Video Classification Task[J].Computer Science, 2024, 51(6A): 230300212-5.

References

[1]BALTRUŠAITIS T,AHUJA C,MORENCY L P.Multimodal machine learning:A survey and taxonomy[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2018,41(2):423-443.
[2]KEEGAN M.The Most Surveilled Cities in the World[EB/OL]https://www.usnews.com/news/cities/articles/2020-08-14/the-top-10-most-surveilled-cities-in-the-world.
[3]中国互联网网络信息中心.第50次中国互联网络发展状况统计报告[R/OL].(2022-08-31)[2022-09-10].http://www3.cnnic.cn/NMediaFile/2022/1020/MAIN16662586615125EJOL1VKDF.pdf.
[4]Cisco.Cisco Annual Internet Report(2018－2023) White Paper[R].2020.
[5]RADFORD A,KIM J W,HALLACYC,et al.Learning trans-ferable visual models from natural language supervision[C]//International Conference on Machine Learning.PMLR,2021:8748-8763.
[6]OpenAI,GPT-4 Technical Report[R].2023.
[7]AREVALO J,SOLORIO T,MONTES-Y-GÓMEZ M,et al.Ga-ted multimodal units for information fusion[J].arXiv:1702.01992,2017.
[8]CHO K,VAN MERRIËNBOER B,BAHDANAU D,et al.Onthe properties of neural machine translation:Encoder-decoder approaches[J].arXiv:1409.1259,2014.
[9]ELMAN J L.Finding Structure in Time[J].Cognitive Science,1990,14(2):179-211.
[10]HOCHREITER S,SCHMIDHUBER J.Long Short-Term Memory[J].Neural Computation,1997,9(8):1735-1780.
[11]MIKOLOV T,CHEN K,CORRADOG,et al.Efficient estimation of word representations in vector space[J].arXiv:1301.3781,2013.
[12]PENNINGTON J,SOCHER R,MANNING C D.Glove:Global vectors for word representation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Proces-sing(EMNLP).2014:1532-1543.
[13]MCCANN B,BRADBURY J,XIONG C,et al.Learned in translation:Contextualized word vectors[J/OL].https://proceedings.neurips.cc/paper_files/paper/2017/hash/20c86a628232a67e7bd46f76fba7ce12-Abstract.html.
[14]RADFORD A,NARASIMHAN K,SALIMANS T,et al.Improving language understanding by generative pre-training[J/OL].https://www.semanticscholar.org/paper/Improving-Language-Understanding-by-Generative-Radford-Narasimhan/cd18800a0fe0b668a1cc19f2ec95b5003d0a5035.
[15]DEVLIN J,CHANG M W,LEE K,et al.Bert:Pre-training of deep bidirectional transformers for language understanding[J].arXiv:1810.04805,2018.
[16]ARANDJELOVIC R,ZISSERMAN A.Look,listen and learn[C]//Proceedings of the IEEE International Conference on Computer Vision.2017:609-617.
[17]KONG Q,CAO Y,IQBAL T,et al.Panns:Large-scale pre-trained audio neural networks for audio pattern recognition[J].IEEE/ACM Transactions on Audio,Speech,and Language Processing,2020,28:2880-2894.
[18]CARREIRA J,ZISSERMAN A.Quo vadis,action recognition? anew model and the kinetics dataset[C]//proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:6299-6308.
[19]FEICHTENHOFER C,FAN H,MALIK J,et al.Slowfast networks for video recognition[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2019:6202-6211.
[20]BERTASIUS G,WANG H,TORRESANI L.Is space-time attention all you need for video understanding?[C]//ICML.2021,2(3):4
[21]LAN Z,CHEN M,GOODMAN S,et al.Albert:A lite bert for self-supervised learning of language representations[J].arXiv:1909.11942,2019.
[22]NOJAVANASGHARI B,GOPINATH D,KOUSHIKJ,et al.Deep multimodal fusion for persuasiveness prediction[C]//Proceedings of the 18th ACM International Conference on Multimodal Interaction.2016:284-288.
[23]WANG H,MEGHAWAT A,MORENCYL P,et al.Select-additive learning:Improving generalization in multimodal sentiment analysis[C]//2017 IEEE International Conference on Multimedia and Expo(ICME).IEEE,2017:949-954.
[24]FAN H,MURRELL T,WANG H,et al.PyTorchVideo:A deep learning library for video understanding[C]//Proceedings of the 29th ACM International Conference on Multimedia.2021:3783-3786.

Related Articles 15

[1]	TENG Sihang, WANG Lie, LI Ya. Non-autoregressive Transformer Chinese Speech Recognition Incorporating Pronunciation- Character Representation Conversion [J]. Computer Science, 2023, 50(8): 111-117.
[2]	XU Ming-ke, ZHANG Fan. Head Fusion:A Method to Improve Accuracy and Robustness of Speech Emotion Recognition [J]. Computer Science, 2022, 49(7): 132-141.
[3]	CHENG Gao-feng, YAN Yong-hong. Latest Development of Multilingual Speech Recognition Acoustic Model Modeling Methods [J]. Computer Science, 2022, 49(1): 47-52.
[4]	YANG Run-yan, CHENG Gao-feng, LIU Jian. Study on Keyword Search Framework Based on End-to-End Automatic Speech Recognition [J]. Computer Science, 2022, 49(1): 53-58.
[5]	GAN Chuang, WU Gui-xing, ZHAN Qing-yuan, WANG Peng-kun, PENG Zhi-lei. Multi-scale Gated Graph Convolutional Network for Skeleton-based Action Recognition [J]. Computer Science, 2022, 49(1): 181-186.
[6]	ZHENG Chun-jun, WANG Chun-li, JIA Ning. Survey of Acoustic Feature Extraction in Speech Tasks [J]. Computer Science, 2020, 47(5): 110-119.
[7]	ZHANG Jing, YANG Jian, SU Peng. Survey of Monosyllable Recognition in Speech Recognition [J]. Computer Science, 2020, 47(11A): 172-174.
[8]	CUI Yang, LIU Chang-hong. PIFA-based Evaluation Platform for Speech Recognition System [J]. Computer Science, 2020, 47(11A): 638-641.
[9]	SHI Yan-yan, BAI Jing. Speech Recognition Combining CFCC and Teager Energy Operators Cepstral Coefficients [J]. Computer Science, 2019, 46(5): 286-289.
[10]	WEI Ying, WANG Shuang-wei, PAN Di, ZHANG Ling, XU Ting-fa and LIANG Shi-li. Specific Two Words Chinese Lexical Recognition Based on Broadband and Narrowband Spectrogram Feature Fusion with Zoning Projection [J]. Computer Science, 2016, 43(Z11): 215-219.
[11]	LI Wei-lin, WEN Jian and MA Wen-kai. Speech Recognition System Based on Deep Neural Network [J]. Computer Science, 2016, 43(Z11): 45-49.
[12]	SUN Zhi-yuan, LU Cheng-xiang, SHI Zhong-zhi and MA Gang. Research and Advances on Deep Learning [J]. Computer Science, 2016, 43(2): 1-8.
[13]	YANG Dan, CHEN Mo, SUN Liang-xu and WANG Gang. Multi-layer Temporal Data Model Supporting Multi-modality Fusion Entity Search in Heterogeneous Information Spaces [J]. Computer Science, 2015, 42(4): 147-150.
[14]	DONG Jun-jian,MAO Qi-rong,HU Su-li and ZHAN Yong-zhao. Sub-coding and Entire-coding Jointly Penalty Based Sparse Representation Dictionary Learning [J]. Computer Science, 2014, 41(10): 122-127.
[15]	LIU Wan-feng,HU Jun and YUAN Wei-wei. Research on Technology of Voice Instruction Recognition for Air Traffic Control Communication [J]. Computer Science, 2013, 40(7): 131-137.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Modality Fusion Strategy Research Based on Multimodal Video Classification Task

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 15

Metrics

Comments

Recommended 0