基于改进双流视觉Transformer的行为识别模型

doi:10.11896/jsjkx.230500054

Abstract

Abstract: To address the issues of poor resistance to background interference and low accuracy in existing action recognition methods,an improved dual stream visual Transformer action recognition model is proposed.The model adopts a segmented sampling method to increase its processing ability for long-term sequence data; embedding a parameter free attention module in the network header enhances the model's feature representation ability while reducing action background interference;embedding a temporal attention module at the tail of the network to fully extract temporal features by integrating high semantic information in the time domain.A new joint loss function is proposed in the paper,aiming to increase inter class differences and reduce intra class differences.Adopting a decision fusion layer to fully utilize the features of optical flow and RGB flow.In response to the above improved model,comparative and ablation experiments are conducted on the benchmark datasets UCF101 and HMDB51.The ablation experiment results verify the effectiveness of the proposed method.The comparison results show that the accuracy of the proposed method is 3.48% and 7.76% higher than that of the time segmented network on the two datasets,respectively,which is better than the current mainstream algorithms and has good recognition performance.

Key words: Action recognition, Vision Transformer, SimAM parameter-free attention, Temporal attention, Joint loss

CLC Number:

TP391.7

LEI Yongsheng, DING Meng, SHEN Yao, LI Juhao, ZHAO Dongyue, CHEN Fushi. Action Recognition Model Based on Improved Two Stream Vision Transformer[J].Computer Science, 2024, 51(7): 229-235.

References

[1]MA Y X,TAN L,DONG X,et al.Action Recognition for Intelligent Monitoring[J].Journal of Image and Graphics,2019,24(2):282-290.
[2]CHU J H,ZHANG S,LV W.Driving Behavior Analysis Algorithm Based on Convolutional Neural Network[J].Laser & Optoelectronics Progress,2020,57(14):141018.
[3]SUN Q,JI G L,ZHANG J.Non-Local Attention Based Generative Adversarial Network for Video Abnormal Event Detection[J].Computer Science,2022,49(8):172-177.
[4]MIAO Q G,XIN W T,LIU R Y,et al.Graph Convolutional Skeleton-Based Action Recognition Method for Intelligent Behavior Analysis[J].Computer Science,2022,49(2):156-161.
[5]KARPATHY A,TODERICI G,SHETTY S,et al.Large-Scale Video Classification with Convolutional Neural Networks[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Washington DC:IEEE Computer Society,2014:1725-1732.
[6]HE M,ZHU C,HUANG Q,et al.A Review of Monocular Visu-al Odometry[J].The Visual Computer,2020,36(5):1053-1065.
[7]SIMONYAN K,ZISSERMAN A.Two-stream ConvolutionalNetworks for Action Recognition in Videos[C]//The 27th International Conference on Neural Information Processing Systems.Montreal:NIPS'14,2014:568-576.
[8]NEIMARK D,BAR O,ZOHAR M,et al.Video Transformer Network[C]//Proceedings of IEEE/CVF International Confe-rence on Computer Vision Workshops[C]//Piscataway,NJ:IEEE Press,2021:3156-3165.
[9]ARNAB A,DEHGHANI M,HEIGOLD G,et al.ViViT:A Vi-deo Vision Transformer[C]//Proceedings of IEEE/CVF International Conference on Computer Vision.Piscataway,NJ:IEEE Press,2021:6816-6826.
[10]FAN H,XIONG B,MANGALAM K,et al.Multiscale VisionTransformers[C]//Proceedings of IEEE/CVF International Conference on Computer Vision.Piscataway,NJ:IEEE Press,2021:6804-6815.
[11]YAN S,XIONG X,ARNAB A,et al.Multiview Transformers for Video Recognition[C]//2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition.New Orleans:IEEE Press,2022:3333-3343.
[12]BERTASIUS G,WANG H,TORRESANI L.Is Space-time Attention All You Need for Video Understanding?[C]//2021 ICML.Virtual:IEEE Press,2021.
[13]YANG L,ZHANG R Y,LI L,et al.Simam:A Simple,Parameter-free Attention Module for Convolutional Neural Networks[C]//Proceedings of the 38th International Conference on Machine Learning.New York:PMLR,2021:11863-11874.
[14]HU J,SHEN L,SUN G.Squeeze-and-excitation Networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Salt Lake City:IEEE,2018:7132-7141.
[15]BROCK A,DE S,SMITH S L,et al.High-Performance Large-Scale Image Recognition without Normalization[C]//Procee-dings of the 38th International Conference on Machine Lear-ning.New York:PMLR,2021:1059-1071.
[16]LIU S,MA X,WU H,et al.An End to End Framework withAdaptive Spatio-Temporal Attention Module for Human Action Recognition[J].IEEE Access,2020,8:47220-47231.
[17]SHALMANI S M,CHIANG F,ZHENG R.Efficient ActionRecognition Using Confidence Distillation[C]//The 26th International Conference on Pattern Recognition.Montreal:IEEE,2022:3362-3369.
[18]TONG Z,SONG Y,WANG J,et al.Videomae:Masked Autoencoders Are Data-Efficient Learners for Self-Supervised Video Pre-Training[J].arXiv:2203.12602,2022.
[19]YI Z W,SUN Z H,FENG J C,et al.Channel Separable Convolutional Neural Network for Action Recognition[J].Journal of Signal Processing,2020,36(9):1497-1502.
[20]SAHOO S P,ARI S,MAHAPATRA K,et al. HAR-Depth:A Novel Framework for Human Action Recognition Using Sequential Learning and Depth Estimated History Images[J].IEEE Transactions on Emerging Topics in Computational Intelligence,2020,5(5):813-825.
[21]HU Z P,ZHANG R X,ZHANG X,et al.TVBN-ResNeXt:End-to-End Fusion of Space-Time Two-Stream Convolution Network for Video Classification[J].Journal of Signal Processing,2020,36(1):58-66.
[22]WANG Z Q,ZHANG W Q,ZHANG L,et al.Human Behavior Recognition with High-Order Attention Mechanism[J].Journal of Signal Processing,2020,36(8):1272-1279.
[23]KUMAWAT S,VERMA M,NAKASHIMA Y,et al.Depthwise Spatio-Temporal STFT Convolutional Neural Networks for Human Action Recognition[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2021,44(9):4839-4851.
[24]IZUTOV E.LIGAR:Lightweight General-Purpose Action Recognition[J].arXiv:2108.13153,2021.
[25]ZHANG J,HU H,LIU Z.Appearance-and-Dynamic Learningwith Bifurcated Convolution Neural Network for Action Recognition[J].IEEE Transactions on Circuits and Systems for Video Technology,2020,31(4):1593-1606.
[26]PAN N,JIANG M,KONG J.Human Action Recognition Algorithm Based on Spatio-Temporal Interactive Attention Model[J].Laser& Optoelectronics Progress,2020,57(18):181506.
[27]ZHANG W Q,WANG Z Q,ZHANG L.Human Action Recognition Combining Sequential Dynamic Images and Two-Stream Convolutional Network[J].Laser & Optoelectronics Progress,2021,58(2):0210007.
[28]LI C,HE M,DONG C,et al.Action Recognition Model of Directed Attention Based on Cosine Similarity[J].Journal of System Simulation,2024,36(1):67-82.

Related Articles 15

[1]	LIAO Junshuang, TAN Qinhong. DETR with Multi-granularity Spatial Attention and Spatial Prior Supervision [J]. Computer Science, 2024, 51(6): 239-246.
[2]	YAN Wenjie, YIN Yiying. Human Action Recognition Algorithm Based on Adaptive Shifted Graph Convolutional Neural Network with 3D Skeleton Similarity [J]. Computer Science, 2024, 51(4): 236-242.
[3]	LUO Huilan, YU Yawei, WANG Chanjuan. Multi-dimensional Feature Excitation Network for Video Action Recognition [J]. Computer Science, 2023, 50(11A): 230300115-8.
[4]	LI Hua, ZHAO Lingdi, CHEN Yujie, YANG Yang, DU Xinzhao. Lightweight Graph Convolution Action Recognition Algorithm Based on Multi-streamFusion [J]. Computer Science, 2023, 50(11A): 220800147-6.
[5]	WU Yushan, XU Zengmin, ZHANG Xuelian, WANG Tao. Self-supervised Action Recognition Based on Skeleton Data Augmentation and Double Nearest Neighbor Retrieval [J]. Computer Science, 2023, 50(11): 97-106.
[6]	LI Rong-fan, ZHONG Ting, WU Jin, ZHOU Fan, KUANG Ping. Spatio-Temporal Attention-based Kriging for Land Deformation Data Interpolation [J]. Computer Science, 2022, 49(8): 33-39.
[7]	GAO Yue, FU Xiang-ling, OUYANG Tian-xiong, CHEN Song-ling, YAN Chen-wei. EEG Emotion Recognition Based on Spatiotemporal Self-Adaptive Graph ConvolutionalNeural Network [J]. Computer Science, 2022, 49(4): 30-36.
[8]	XIE Yu, YANG Rui-ling, LIU Gong-xu, LI De-yu, WANG Wen-jian. Human Skeleton Action Recognition Algorithm Based on Dynamic Topological Graph [J]. Computer Science, 2022, 49(2): 62-68.
[9]	MIAO Qi-guang, XIN Wen-tian, LIU Ru-yi, XIE Kun, WANG Quan, YANG Zong-kai. Graph Convolutional Skeleton-based Action Recognition Method for Intelligent Behavior Analysis [J]. Computer Science, 2022, 49(2): 156-161.
[10]	LI Bao-zhen, ZHANG Jin, WANG Bao-lu, YU Ping. Human-Object Interaction Recognition Integrating Multi-level Visual Features [J]. Computer Science, 2022, 49(11A): 220700012-8.
[11]	GAN Chuang, WU Gui-xing, ZHAN Qing-yuan, WANG Peng-kun, PENG Zhi-lei. Multi-scale Gated Graph Convolutional Network for Skeleton-based Action Recognition [J]. Computer Science, 2022, 49(1): 181-186.
[12]	LIU Xin, YUAN Jia-bin, WANG Tian-xing. Interior Human Action Recognition Method Based on Prior Knowledge of Scene [J]. Computer Science, 2022, 49(1): 225-232.
[13]	FENG Jiao, LU Chang-yu. Cross Media Retrieval Method Based on Residual Attention Network [J]. Computer Science, 2021, 48(6A): 122-126.
[14]	ZHONG Yue, FANG Hu-sheng, ZHANG Guo-yu, WANG Zhao, ZHU Jing-wei. Method of CNN Flag Movement Recognition Based on 9-axis Attitude Sensor [J]. Computer Science, 2021, 48(6): 153-158.
[15]	HONG Yao-qiu. Visual Human Action Recognition Based on Deep Belief Network [J]. Computer Science, 2021, 48(11A): 400-403.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Action Recognition Model Based on Improved Two Stream Vision Transformer

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 15

Metrics

Comments

Recommended 0