基于改进双流视觉Transformer的行为识别模型

doi:10.11896/jsjkx.230500054

计算机科学 ›› 2024, Vol. 51 ›› Issue (7): 229-235.doi: 10.11896/jsjkx.230500054

• 计算机图形学&多媒体 • 上一篇下一篇

基于改进双流视觉Transformer的行为识别模型

雷永升¹, 丁锰^1,2, 沈尧¹, 李居昊¹, 赵东越¹, 陈福仕¹

1 中国人民公安大学侦查学院北京 100038
2 中国人民公安大学公共安全行为科学实验室北京 100038

收稿日期:2023-05-09 修回日期:2023-10-09 出版日期:2024-07-15 发布日期:2024-07-10
通讯作者: 丁锰(dingmeng@ppsuc.edu.cn)
作者简介:(834624067@qq.com)
基金资助:
公安学一流学科培优行动及公共安全行为科学实验室建设项目(2023ZB02)

Action Recognition Model Based on Improved Two Stream Vision Transformer

LEI Yongsheng¹, DING Meng^1,2, SHEN Yao¹, LI Juhao¹, ZHAO Dongyue¹, CHEN Fushi¹

1 Department of Criminal Investigation,People's Public Security University of China,Beijing 100038,China
2 Public Security Behavioral Science Lab,People's Public Security University of China,Beijing 100038,China

Received:2023-05-09 Revised:2023-10-09 Online:2024-07-15 Published:2024-07-10
About author:LEI Yongsheng,born in 1999,postgra-duate.His main research interests include digital forensics and so on.
DING Meng,born in 1980,master,associate professor,postgraduate supervisor.His main research interests include digital forensics and video processing.
Supported by:
First-class Discipline Training Program for Public Security Studies and Construction Project for Laboratory of Public Safety Behavior Science(2023ZB02).

摘要/Abstract

摘要： 针对现有行为识别方法中抗背景干扰能力差和准确率低等问题,提出了一种改进的双流视觉Transformer行为识别模型。该模型采用分段采样的方法来增加模型对长时序列数据的处理能力;在网络头部嵌入无参数的注意力模块,在降低动作背景干扰的同时,增强了模型的特征表示能力;在网络尾部嵌入时间注意力模块,通过融合时域高语义信息来充分提取时序特征。文中提出了一种新的联合损失函数,旨在增大类间差异并减少类内差异;采用决策融合层以充分利用光流与RGB流特征。针对上述改进模型,在基准数据集UCF101和HMDB51上进行消融及对比实验,消融实验结果验证了所提方法的有效性,对比实验结果表明,所提方法相比时间分段网络在两个数据集上的准确率分别提高了3.48%和7.76%,优于目前的主流算法,具有较好的识别效果。

关键词: 行为识别, 视觉Transformer, SimAM无参注意力, 时间注意力, 联合损失

Abstract: To address the issues of poor resistance to background interference and low accuracy in existing action recognition methods,an improved dual stream visual Transformer action recognition model is proposed.The model adopts a segmented sampling method to increase its processing ability for long-term sequence data; embedding a parameter free attention module in the network header enhances the model's feature representation ability while reducing action background interference;embedding a temporal attention module at the tail of the network to fully extract temporal features by integrating high semantic information in the time domain.A new joint loss function is proposed in the paper,aiming to increase inter class differences and reduce intra class differences.Adopting a decision fusion layer to fully utilize the features of optical flow and RGB flow.In response to the above improved model,comparative and ablation experiments are conducted on the benchmark datasets UCF101 and HMDB51.The ablation experiment results verify the effectiveness of the proposed method.The comparison results show that the accuracy of the proposed method is 3.48% and 7.76% higher than that of the time segmented network on the two datasets,respectively,which is better than the current mainstream algorithms and has good recognition performance.

Key words: Action recognition, Vision Transformer, SimAM parameter-free attention, Temporal attention, Joint loss

中图分类号:

TP391.7

雷永升, 丁锰, 沈尧, 李居昊, 赵东越, 陈福仕. 基于改进双流视觉Transformer的行为识别模型[J]. 计算机科学, 2024, 51(7): 229-235. https://doi.org/10.11896/jsjkx.230500054

LEI Yongsheng, DING Meng, SHEN Yao, LI Juhao, ZHAO Dongyue, CHEN Fushi. Action Recognition Model Based on Improved Two Stream Vision Transformer[J]. Computer Science, 2024, 51(7): 229-235. https://doi.org/10.11896/jsjkx.230500054

参考文献

[1]MA Y X,TAN L,DONG X,et al.Action Recognition for Intelligent Monitoring[J].Journal of Image and Graphics,2019,24(2):282-290.
[2]CHU J H,ZHANG S,LV W.Driving Behavior Analysis Algorithm Based on Convolutional Neural Network[J].Laser & Optoelectronics Progress,2020,57(14):141018.
[3]SUN Q,JI G L,ZHANG J.Non-Local Attention Based Generative Adversarial Network for Video Abnormal Event Detection[J].Computer Science,2022,49(8):172-177.
[4]MIAO Q G,XIN W T,LIU R Y,et al.Graph Convolutional Skeleton-Based Action Recognition Method for Intelligent Behavior Analysis[J].Computer Science,2022,49(2):156-161.
[5]KARPATHY A,TODERICI G,SHETTY S,et al.Large-Scale Video Classification with Convolutional Neural Networks[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Washington DC:IEEE Computer Society,2014:1725-1732.
[6]HE M,ZHU C,HUANG Q,et al.A Review of Monocular Visu-al Odometry[J].The Visual Computer,2020,36(5):1053-1065.
[7]SIMONYAN K,ZISSERMAN A.Two-stream ConvolutionalNetworks for Action Recognition in Videos[C]//The 27th International Conference on Neural Information Processing Systems.Montreal:NIPS'14,2014:568-576.
[8]NEIMARK D,BAR O,ZOHAR M,et al.Video Transformer Network[C]//Proceedings of IEEE/CVF International Confe-rence on Computer Vision Workshops[C]//Piscataway,NJ:IEEE Press,2021:3156-3165.
[9]ARNAB A,DEHGHANI M,HEIGOLD G,et al.ViViT:A Vi-deo Vision Transformer[C]//Proceedings of IEEE/CVF International Conference on Computer Vision.Piscataway,NJ:IEEE Press,2021:6816-6826.
[10]FAN H,XIONG B,MANGALAM K,et al.Multiscale VisionTransformers[C]//Proceedings of IEEE/CVF International Conference on Computer Vision.Piscataway,NJ:IEEE Press,2021:6804-6815.
[11]YAN S,XIONG X,ARNAB A,et al.Multiview Transformers for Video Recognition[C]//2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition.New Orleans:IEEE Press,2022:3333-3343.
[12]BERTASIUS G,WANG H,TORRESANI L.Is Space-time Attention All You Need for Video Understanding?[C]//2021 ICML.Virtual:IEEE Press,2021.
[13]YANG L,ZHANG R Y,LI L,et al.Simam:A Simple,Parameter-free Attention Module for Convolutional Neural Networks[C]//Proceedings of the 38th International Conference on Machine Learning.New York:PMLR,2021:11863-11874.
[14]HU J,SHEN L,SUN G.Squeeze-and-excitation Networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Salt Lake City:IEEE,2018:7132-7141.
[15]BROCK A,DE S,SMITH S L,et al.High-Performance Large-Scale Image Recognition without Normalization[C]//Procee-dings of the 38th International Conference on Machine Lear-ning.New York:PMLR,2021:1059-1071.
[16]LIU S,MA X,WU H,et al.An End to End Framework withAdaptive Spatio-Temporal Attention Module for Human Action Recognition[J].IEEE Access,2020,8:47220-47231.
[17]SHALMANI S M,CHIANG F,ZHENG R.Efficient ActionRecognition Using Confidence Distillation[C]//The 26th International Conference on Pattern Recognition.Montreal:IEEE,2022:3362-3369.
[18]TONG Z,SONG Y,WANG J,et al.Videomae:Masked Autoencoders Are Data-Efficient Learners for Self-Supervised Video Pre-Training[J].arXiv:2203.12602,2022.
[19]YI Z W,SUN Z H,FENG J C,et al.Channel Separable Convolutional Neural Network for Action Recognition[J].Journal of Signal Processing,2020,36(9):1497-1502.
[20]SAHOO S P,ARI S,MAHAPATRA K,et al. HAR-Depth:A Novel Framework for Human Action Recognition Using Sequential Learning and Depth Estimated History Images[J].IEEE Transactions on Emerging Topics in Computational Intelligence,2020,5(5):813-825.
[21]HU Z P,ZHANG R X,ZHANG X,et al.TVBN-ResNeXt:End-to-End Fusion of Space-Time Two-Stream Convolution Network for Video Classification[J].Journal of Signal Processing,2020,36(1):58-66.
[22]WANG Z Q,ZHANG W Q,ZHANG L,et al.Human Behavior Recognition with High-Order Attention Mechanism[J].Journal of Signal Processing,2020,36(8):1272-1279.
[23]KUMAWAT S,VERMA M,NAKASHIMA Y,et al.Depthwise Spatio-Temporal STFT Convolutional Neural Networks for Human Action Recognition[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2021,44(9):4839-4851.
[24]IZUTOV E.LIGAR:Lightweight General-Purpose Action Recognition[J].arXiv:2108.13153,2021.
[25]ZHANG J,HU H,LIU Z.Appearance-and-Dynamic Learningwith Bifurcated Convolution Neural Network for Action Recognition[J].IEEE Transactions on Circuits and Systems for Video Technology,2020,31(4):1593-1606.
[26]PAN N,JIANG M,KONG J.Human Action Recognition Algorithm Based on Spatio-Temporal Interactive Attention Model[J].Laser& Optoelectronics Progress,2020,57(18):181506.
[27]ZHANG W Q,WANG Z Q,ZHANG L.Human Action Recognition Combining Sequential Dynamic Images and Two-Stream Convolutional Network[J].Laser & Optoelectronics Progress,2021,58(2):0210007.
[28]LI C,HE M,DONG C,et al.Action Recognition Model of Directed Attention Based on Cosine Similarity[J].Journal of System Simulation,2024,36(1):67-82.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

基于改进双流视觉Transformer的行为识别模型

Action Recognition Model Based on Improved Two Stream Vision Transformer

PDF (PC)

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 0

Metrics

本文评价

推荐阅读 0