计算机科学 ›› 2024, Vol. 51 ›› Issue (7): 229-235.doi: 10.11896/jsjkx.230500054

• 计算机图形学&多媒体 • 上一篇    下一篇

基于改进双流视觉Transformer的行为识别模型

雷永升1, 丁锰1,2, 沈尧1, 李居昊1, 赵东越1, 陈福仕1   

  1. 1 中国人民公安大学侦查学院 北京 100038
    2 中国人民公安大学公共安全行为科学实验室 北京 100038
  • 收稿日期:2023-05-09 修回日期:2023-10-09 出版日期:2024-07-15 发布日期:2024-07-10
  • 通讯作者: 丁锰(dingmeng@ppsuc.edu.cn)
  • 作者简介:(834624067@qq.com)
  • 基金资助:
    公安学一流学科培优行动及公共安全行为科学实验室建设项目(2023ZB02)

Action Recognition Model Based on Improved Two Stream Vision Transformer

LEI Yongsheng1, DING Meng1,2, SHEN Yao1, LI Juhao1, ZHAO Dongyue1, CHEN Fushi1   

  1. 1 Department of Criminal Investigation,People's Public Security University of China,Beijing 100038,China
    2 Public Security Behavioral Science Lab,People's Public Security University of China,Beijing 100038,China
  • Received:2023-05-09 Revised:2023-10-09 Online:2024-07-15 Published:2024-07-10
  • About author:LEI Yongsheng,born in 1999,postgra-duate.His main research interests include digital forensics and so on.
    DING Meng,born in 1980,master,associate professor,postgraduate supervisor.His main research interests include digital forensics and video processing.
  • Supported by:
    First-class Discipline Training Program for Public Security Studies and Construction Project for Laboratory of Public Safety Behavior Science(2023ZB02).

摘要: 针对现有行为识别方法中抗背景干扰能力差和准确率低等问题,提出了一种改进的双流视觉Transformer行为识别模型。该模型采用分段采样的方法来增加模型对长时序列数据的处理能力;在网络头部嵌入无参数的注意力模块,在降低动作背景干扰的同时,增强了模型的特征表示能力;在网络尾部嵌入时间注意力模块,通过融合时域高语义信息来充分提取时序特征。文中提出了一种新的联合损失函数,旨在增大类间差异并减少类内差异;采用决策融合层以充分利用光流与RGB流特征。针对上述改进模型,在基准数据集UCF101和HMDB51上进行消融及对比实验,消融实验结果验证了所提方法的有效性,对比实验结果表明,所提方法相比时间分段网络在两个数据集上的准确率分别提高了3.48%和7.76%,优于目前的主流算法,具有较好的识别效果。

关键词: 行为识别, 视觉Transformer, SimAM无参注意力, 时间注意力, 联合损失

Abstract: To address the issues of poor resistance to background interference and low accuracy in existing action recognition methods,an improved dual stream visual Transformer action recognition model is proposed.The model adopts a segmented sampling method to increase its processing ability for long-term sequence data; embedding a parameter free attention module in the network header enhances the model's feature representation ability while reducing action background interference;embedding a temporal attention module at the tail of the network to fully extract temporal features by integrating high semantic information in the time domain.A new joint loss function is proposed in the paper,aiming to increase inter class differences and reduce intra class differences.Adopting a decision fusion layer to fully utilize the features of optical flow and RGB flow.In response to the above improved model,comparative and ablation experiments are conducted on the benchmark datasets UCF101 and HMDB51.The ablation experiment results verify the effectiveness of the proposed method.The comparison results show that the accuracy of the proposed method is 3.48% and 7.76% higher than that of the time segmented network on the two datasets,respectively,which is better than the current mainstream algorithms and has good recognition performance.

Key words: Action recognition, Vision Transformer, SimAM parameter-free attention, Temporal attention, Joint loss

中图分类号: 

  • TP391.7
[1]MA Y X,TAN L,DONG X,et al.Action Recognition for Intelligent Monitoring[J].Journal of Image and Graphics,2019,24(2):282-290.
[2]CHU J H,ZHANG S,LV W.Driving Behavior Analysis Algorithm Based on Convolutional Neural Network[J].Laser & Optoelectronics Progress,2020,57(14):141018.
[3]SUN Q,JI G L,ZHANG J.Non-Local Attention Based Generative Adversarial Network for Video Abnormal Event Detection[J].Computer Science,2022,49(8):172-177.
[4]MIAO Q G,XIN W T,LIU R Y,et al.Graph Convolutional Skeleton-Based Action Recognition Method for Intelligent Behavior Analysis[J].Computer Science,2022,49(2):156-161.
[5]KARPATHY A,TODERICI G,SHETTY S,et al.Large-Scale Video Classification with Convolutional Neural Networks[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Washington DC:IEEE Computer Society,2014:1725-1732.
[6]HE M,ZHU C,HUANG Q,et al.A Review of Monocular Visu-al Odometry[J].The Visual Computer,2020,36(5):1053-1065.
[7]SIMONYAN K,ZISSERMAN A.Two-stream ConvolutionalNetworks for Action Recognition in Videos[C]//The 27th International Conference on Neural Information Processing Systems.Montreal:NIPS'14,2014:568-576.
[8]NEIMARK D,BAR O,ZOHAR M,et al.Video Transformer Network[C]//Proceedings of IEEE/CVF International Confe-rence on Computer Vision Workshops[C]//Piscataway,NJ:IEEE Press,2021:3156-3165.
[9]ARNAB A,DEHGHANI M,HEIGOLD G,et al.ViViT:A Vi-deo Vision Transformer[C]//Proceedings of IEEE/CVF International Conference on Computer Vision.Piscataway,NJ:IEEE Press,2021:6816-6826.
[10]FAN H,XIONG B,MANGALAM K,et al.Multiscale VisionTransformers[C]//Proceedings of IEEE/CVF International Conference on Computer Vision.Piscataway,NJ:IEEE Press,2021:6804-6815.
[11]YAN S,XIONG X,ARNAB A,et al.Multiview Transformers for Video Recognition[C]//2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition.New Orleans:IEEE Press,2022:3333-3343.
[12]BERTASIUS G,WANG H,TORRESANI L.Is Space-time Attention All You Need for Video Understanding?[C]//2021 ICML.Virtual:IEEE Press,2021.
[13]YANG L,ZHANG R Y,LI L,et al.Simam:A Simple,Parameter-free Attention Module for Convolutional Neural Networks[C]//Proceedings of the 38th International Conference on Machine Learning.New York:PMLR,2021:11863-11874.
[14]HU J,SHEN L,SUN G.Squeeze-and-excitation Networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Salt Lake City:IEEE,2018:7132-7141.
[15]BROCK A,DE S,SMITH S L,et al.High-Performance Large-Scale Image Recognition without Normalization[C]//Procee-dings of the 38th International Conference on Machine Lear-ning.New York:PMLR,2021:1059-1071.
[16]LIU S,MA X,WU H,et al.An End to End Framework withAdaptive Spatio-Temporal Attention Module for Human Action Recognition[J].IEEE Access,2020,8:47220-47231.
[17]SHALMANI S M,CHIANG F,ZHENG R.Efficient ActionRecognition Using Confidence Distillation[C]//The 26th International Conference on Pattern Recognition.Montreal:IEEE,2022:3362-3369.
[18]TONG Z,SONG Y,WANG J,et al.Videomae:Masked Autoencoders Are Data-Efficient Learners for Self-Supervised Video Pre-Training[J].arXiv:2203.12602,2022.
[19]YI Z W,SUN Z H,FENG J C,et al.Channel Separable Convolutional Neural Network for Action Recognition[J].Journal of Signal Processing,2020,36(9):1497-1502.
[20]SAHOO S P,ARI S,MAHAPATRA K,et al. HAR-Depth:A Novel Framework for Human Action Recognition Using Sequential Learning and Depth Estimated History Images[J].IEEE Transactions on Emerging Topics in Computational Intelligence,2020,5(5):813-825.
[21]HU Z P,ZHANG R X,ZHANG X,et al.TVBN-ResNeXt:End-to-End Fusion of Space-Time Two-Stream Convolution Network for Video Classification[J].Journal of Signal Processing,2020,36(1):58-66.
[22]WANG Z Q,ZHANG W Q,ZHANG L,et al.Human Behavior Recognition with High-Order Attention Mechanism[J].Journal of Signal Processing,2020,36(8):1272-1279.
[23]KUMAWAT S,VERMA M,NAKASHIMA Y,et al.Depthwise Spatio-Temporal STFT Convolutional Neural Networks for Human Action Recognition[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2021,44(9):4839-4851.
[24]IZUTOV E.LIGAR:Lightweight General-Purpose Action Recognition[J].arXiv:2108.13153,2021.
[25]ZHANG J,HU H,LIU Z.Appearance-and-Dynamic Learningwith Bifurcated Convolution Neural Network for Action Recognition[J].IEEE Transactions on Circuits and Systems for Video Technology,2020,31(4):1593-1606.
[26]PAN N,JIANG M,KONG J.Human Action Recognition Algorithm Based on Spatio-Temporal Interactive Attention Model[J].Laser& Optoelectronics Progress,2020,57(18):181506.
[27]ZHANG W Q,WANG Z Q,ZHANG L.Human Action Recognition Combining Sequential Dynamic Images and Two-Stream Convolutional Network[J].Laser & Optoelectronics Progress,2021,58(2):0210007.
[28]LI C,HE M,DONG C,et al.Action Recognition Model of Directed Attention Based on Cosine Similarity[J].Journal of System Simulation,2024,36(1):67-82.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!