计算机科学 ›› 2025, Vol. 52 ›› Issue (6A): 240900046-8.doi: 10.11896/jsjkx.240900046

• 人工智能 • 上一篇    下一篇

FB-TimesNet:基于TimesNet改进的多模态情绪识别方法

李为荣, 殷继彬   

  1. 昆明理工大学信息工程与自动化学院 昆明 650031
  • 出版日期:2025-06-16 发布日期:2025-06-12
  • 通讯作者: 殷继彬(yjblovelh@aliyun.com)
  • 作者简介:(weirongli1024@aliyun.com)
  • 基金资助:
    国家自然科学基金(61741206)

FB-TimesNet:An Improved Multimodal Emotion Recognition Method Based on TimesNet

LI Weirong, YIN Jibin   

  1. Faculty of Information Engineering and Automation,Kunming University of Science and Technology,Kunming 650031,China
  • Online:2025-06-16 Published:2025-06-12
  • About author:LI Weirong,born in 2000,postgra-duate,is a member of CCF(No.V2902G).His main research interests include human-computer interaction,emotion recognition and deep learning.
    YIN Jibin,born in 1976,Ph.D,associate professor.His main research interests include human-computer interaction and artificial intelligence.
  • Supported by:
    National Natural Science Foundation of China(61741206).

摘要: 针对情感识别领域信息来源模态单一、抗干扰性差、计算成本高、时间特征关注度小等局限性,提出了一种基于Times-Net改进的面部表情和身体姿态的混合情绪识别方法FB-TimesNet。首先,对视频帧的人体关键点坐标进行提取,分别以面部关键点坐标相对于自然状态下的变化值,以及身体姿态关键点坐标作为面部表情和身体姿态的原始信息特征,从而降低数据的维度和计算成本。其次,使用快速傅里叶变化捕捉输入数据的周期变化,将一维数据转化为二维数据,再使用二维卷积核分别对两组特征进行时空特征编码和提取,以加强数据的表征能力。最后,使用融合算法动态分配各模态权重,以获得最佳融合效果。先后在两个常见的情感数据集上进行了广泛的对比实验,实验结果表明,FB-TimesNe在BRED数据集上相比基线模型提高了4.89%的分类准确率。

关键词: 视频情绪识别, 时空特征, 表情识别, 身体姿态, 多模态特征融合

Abstract: Aiming at the limitations such as single modality of information source,poor anti-interference,high computational cost,and small attention to temporal features in the field of emotion recognition,this paper propose a hybrid emotion recognition method FB-TimesNet based on the improvement of TimesNet for facial expression and body gesture.Firstly,the human body key-point coordinates of the video frames were extracted,and the facial key-point coordinates were used as the original information features respectively,with the change value of facial key-point coordinates relative to the natural state of the change values of facial keypoint coordinates relative to the natural state,and body posture keypoint coordinates as the raw information features of facial expression and body posture,thus reducing the dimensionality of the data and computational cost.Second,the periodic changes of the input data were captured using the fast Fourier variation,which transforms the one-dimensional data into two-dimensional data,and then two-dimensional convolution kernels were used to encode and extract spatio-temporal features for the two sets of features separately to enhance the characterization ability of the data.Finally,the fusion algorithm was used to dynamically allocate the weights of each modality to obtain the best fusion effect.In this paper,extensive comparative experiments have been conducted on two common sentiment datasets,and the experimental results show that FB-TimesNe improves the classification accuracy by 4.89% compared to the baseline model on the BRED dataset.

Key words: Video emotion recognition, Spatio-temporal features, Expression recognition, Body posture, Multimodal feature fusion

中图分类号: 

  • TP242
[1]TAO J,FAN C,LIAN Z,et al.Development of multimodal sentiment recognition and understanding[J].Journal of Image and Graphics,2024,29(6):1607-1627.
[2]DONG H,NIU Y,SUN Y,et al.Speech Emotion Recognition Based on Memory Capsules and Attention[J].Computer Engineering,2025,51(4):169-177.
[3]LEONG S C,TANG Y M,LAI C H,et al.Facial expression and body gesture emotion recognition:a systematic review on the use of visual data in affective computing[J].Computer Science Review,2023,48(5):100545.
[4]KUPPENS P,VERDUYN P.Looking at emotion regulationthrough the window of emotion dynamics[J].Psychological Inquiry,2015,26(1):72-79.
[5]HUANG K,LI J,CHENG S,et al.An efficient algorithm of facial expression recognition by TSG-RNN network[C]//MultiMedia Modeling:26th International Conference,MMM 2020,Daejeon,South Korea,Part II 26.Springer International Publishing,2020:161-174.
[6]LAMBA P S,VIRMANI D.CNN-LSTM-based facial expression recognition[C]//Proceedings of 3rd International Conference on Computing Informatics and Networks:ICCIN 2020.Springer Singapore,2021:379-389
[7]BHATTACHARYA U,RONCAL C,MITTALT,et al.Takeanemotion walk:Perceiving emotions from gaits using hierarchicalattention pooling and affective mapping[M]//European Conference on Computer Vision.Cham:Springer,2020:145-163.
[8]ZHAO S,JIA G,YANG J,et al.Emotion recognition frommultiple modalities:Fundamentals and methodologies[J].IEEE Signal Processing Magazine,2021,38(6):59-73.
[9]LIU M,LIU H,CHEN C.Enhanced skeleton visualization for view invariant human action recognition[J]Pattern Recognition,2017,68:346-362.
[10]CANAL F Z,MÜLLER T R,MATIAS J C,et al.A survey on facial emotion recognition techniques:A state-of-the-art literature review[J].Information Sciences,2022,582:593-617.
[11]EKMAN P,FRIESEN W.Facial action coding system(FACS):a technique for the measurement of facial action[M].Palo Alto,CA:Consulting Psychologists Press,1978.
[12]BORGWARDT K M,GRETTON A,RASCH M J,et al.In-tegrating structured biological data by kernel maximum mean discrepancy[J].Bioinformatics,2006,22(14):49-57.
[13]AGHABEIGI F,NAZARI S,OSATI ERAGHI N.An optimized facial emotion recognition architecture based on a deep convolutional neural network and genetic algorithm[J].Signal,Image and Video Processing,2024,18(2):1119-1129.
[14]FLISS I,ZEMZEMW.A novel PSO-ViT approach for facial emotion recognition[J].Computer Methods in Biomechanics and Biomedical Engineering:Imaging & Visualization,2024,11(7):2297016.
[15]WU Y,LI J.Multi-modal emotion identification fusing facial expression and EEG[J].Multimedia Tools and Applications,2023,82(7):10901-10919.
[16]FLISS I,ZEMZEM W.A novel PSO-ViT approach for facialGavrilescu,Mihai[C]//Proposed Architecture of Fully Integrated Modular Neural Network-based Automatic Facial Emotion Recognition System Based on Facial Action Coding System.2014 10th International Conference on Communications(COMM).IEEE,2014.
[17]TIAN J,SHE Y.A visual-audio-based emotion recognition system integrating dimensional analysis[J].IEEE Transactions on Computational Social Systems,2022,10(6):3273-3282.
[18]WANG G,WANG Z,YANG G,et al.Survey of Artificial Emotion[J].Application Research of Computers,2006(11):7-11.
[19]SHEN Z,CHENG J,HU X,et al.Emotion recognition based on multi-view body gestures[C]//2019 IEEE International Conference on Image Processing(ICIP).IEEE,2019:3317-3321.
[20]FERDOUS A,BARI A H,GAVRILOVA M L.Emotion recognition from body movement[J].IEEE Access,2019(8):11761-11781.
[21]ZHOU T,GAO S,MEI Y,et al.Facial expressions and bodypostures emotion recognition based on convolutional attention network[C]//2021 International Conference on Computer,Information and Telecommunication Systems(CITS).IEEE,2021:1-5.
[22]WEI J,HU G,YANG X,et al.Learning facial expression and body gesture visual information for video emotion recognition[J].Expert Systems with Applications,2024,237:121419.
[23]MARTINEZ G H.Openpose:Whole-body pose estimation[D].Carnegie Mellon University,2019.
[24]WU H,HU T,LIU Y,et al.Timesnet:Temporal 2d-variationmodeling for general time series analysis[J].arXiv:2210.02186,2022.
[25]LUCEY P,COHN J F,KANADE T,et al.The extended cohn-kanade dataset(ck+):A complete dataset for action unit and.emotion-specified expression[C]//2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition-workshops.2010:94-101.
[26]FILNTISIS P P,EAHYMIOU N,KOUTRAS P,et al.Fusingbodyposture with facial expressions for joint recognition ofaffect in childrobot interaction[J].lEEE Roboties andautomation letters,2019,4(4):4011-4018
[27]ZHI J,SONG T,YU K,et al.Multi-attentionmadule for dy-namie fsei:l emotion recognitian[J].Infaraation,2022,13(5):207.
[28]NEWEL L,YANG K,DENG J.Stacked hourglass networks for human pose estimation[C]//European Conference on Computer Vision.Springer,2016:483-499.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!