计算机科学 ›› 2024, Vol. 51 ›› Issue (6A): 230300212-5.doi: 10.11896/jsjkx.230300212
王一帆, 张雪芳
WANG Yifan, ZHANG Xuefang
摘要: 尽管过往人工智能相关技术在众多领域取得了成功,但是通常只是模拟了人类的某一种感知能力,也就意味着被限制在处理单个模态的信息之中。从多个模态信息中提取特征并进行有效融合对于从弱/限制领域人工智能向强/通用人工智能的发展迈进具有重要意义。本研究基于编码器-解码器结构,在视频分类任务上对多模态信息的特征编码进行早期特征融合、对各模态信息的预测结果进行后期决策融合以及对两者相结合的不同多模态信息融合策略进行了对比研究;同时对音频模态信息参与模态融合的两种方式进行了对比,即直接将音频进行特征编码进而参与模态融合或音频通过语音转文本进而以文本的形式参与模态融合。实验结果表明,将文本和音频模态单独的预测结果与另外两种模态的融合特征的预测结果进行决策融合能够进一步提高分类预测准确率;此外,通过语音识别将语音转换成文本模态信息,能够更加充分利用其中包含的语义信息。
中图分类号:
[1]BALTRUAITIS T,AHUJA C,MORENCY L P.Multimodal machine learning:A survey and taxonomy[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2018,41(2):423-443. [2]KEEGAN M.The Most Surveilled Cities in the World[EB/OL]https://www.usnews.com/news/cities/articles/2020-08-14/the-top-10-most-surveilled-cities-in-the-world. [3]中国互联网网络信息中心.第50次中国互联网络发展状况统计报告[R/OL].(2022-08-31)[2022-09-10].http://www3.cnnic.cn/NMediaFile/2022/1020/MAIN16662586615125EJOL1VKDF.pdf. [4]Cisco.Cisco Annual Internet Report(2018-2023) White Paper[R].2020. [5]RADFORD A,KIM J W,HALLACYC,et al.Learning trans-ferable visual models from natural language supervision[C]//International Conference on Machine Learning.PMLR,2021:8748-8763. [6]OpenAI,GPT-4 Technical Report[R].2023. [7]AREVALO J,SOLORIO T,MONTES-Y-GÓMEZ M,et al.Ga-ted multimodal units for information fusion[J].arXiv:1702.01992,2017. [8]CHO K,VAN MERRIËNBOER B,BAHDANAU D,et al.Onthe properties of neural machine translation:Encoder-decoder approaches[J].arXiv:1409.1259,2014. [9]ELMAN J L.Finding Structure in Time[J].Cognitive Science,1990,14(2):179-211. [10]HOCHREITER S,SCHMIDHUBER J.Long Short-Term Memory[J].Neural Computation,1997,9(8):1735-1780. [11]MIKOLOV T,CHEN K,CORRADOG,et al.Efficient estimation of word representations in vector space[J].arXiv:1301.3781,2013. [12]PENNINGTON J,SOCHER R,MANNING C D.Glove:Global vectors for word representation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Proces-sing(EMNLP).2014:1532-1543. [13]MCCANN B,BRADBURY J,XIONG C,et al.Learned in translation:Contextualized word vectors[J/OL].https://proceedings.neurips.cc/paper_files/paper/2017/hash/20c86a628232a67e7bd46f76fba7ce12-Abstract.html. [14]RADFORD A,NARASIMHAN K,SALIMANS T,et al.Improving language understanding by generative pre-training[J/OL].https://www.semanticscholar.org/paper/Improving-Language-Understanding-by-Generative-Radford-Narasimhan/cd18800a0fe0b668a1cc19f2ec95b5003d0a5035. [15]DEVLIN J,CHANG M W,LEE K,et al.Bert:Pre-training of deep bidirectional transformers for language understanding[J].arXiv:1810.04805,2018. [16]ARANDJELOVIC R,ZISSERMAN A.Look,listen and learn[C]//Proceedings of the IEEE International Conference on Computer Vision.2017:609-617. [17]KONG Q,CAO Y,IQBAL T,et al.Panns:Large-scale pre-trained audio neural networks for audio pattern recognition[J].IEEE/ACM Transactions on Audio,Speech,and Language Processing,2020,28:2880-2894. [18]CARREIRA J,ZISSERMAN A.Quo vadis,action recognition? anew model and the kinetics dataset[C]//proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:6299-6308. [19]FEICHTENHOFER C,FAN H,MALIK J,et al.Slowfast networks for video recognition[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2019:6202-6211. [20]BERTASIUS G,WANG H,TORRESANI L.Is space-time attention all you need for video understanding?[C]//ICML.2021,2(3):4 [21]LAN Z,CHEN M,GOODMAN S,et al.Albert:A lite bert for self-supervised learning of language representations[J].arXiv:1909.11942,2019. [22]NOJAVANASGHARI B,GOPINATH D,KOUSHIKJ,et al.Deep multimodal fusion for persuasiveness prediction[C]//Proceedings of the 18th ACM International Conference on Multimodal Interaction.2016:284-288. [23]WANG H,MEGHAWAT A,MORENCYL P,et al.Select-additive learning:Improving generalization in multimodal sentiment analysis[C]//2017 IEEE International Conference on Multimedia and Expo(ICME).IEEE,2017:949-954. [24]FAN H,MURRELL T,WANG H,et al.PyTorchVideo:A deep learning library for video understanding[C]//Proceedings of the 29th ACM International Conference on Multimedia.2021:3783-3786. |
|