Computer Science ›› 2024, Vol. 51 ›› Issue (7): 229-235.doi: 10.11896/jsjkx.230500054

• Computer Graphics & Multimedia • Previous Articles     Next Articles

Action Recognition Model Based on Improved Two Stream Vision Transformer

LEI Yongsheng1, DING Meng1,2, SHEN Yao1, LI Juhao1, ZHAO Dongyue1, CHEN Fushi1   

  1. 1 Department of Criminal Investigation,People's Public Security University of China,Beijing 100038,China
    2 Public Security Behavioral Science Lab,People's Public Security University of China,Beijing 100038,China
  • Received:2023-05-09 Revised:2023-10-09 Online:2024-07-15 Published:2024-07-10
  • About author:LEI Yongsheng,born in 1999,postgra-duate.His main research interests include digital forensics and so on.
    DING Meng,born in 1980,master,associate professor,postgraduate supervisor.His main research interests include digital forensics and video processing.
  • Supported by:
    First-class Discipline Training Program for Public Security Studies and Construction Project for Laboratory of Public Safety Behavior Science(2023ZB02).

Abstract: To address the issues of poor resistance to background interference and low accuracy in existing action recognition methods,an improved dual stream visual Transformer action recognition model is proposed.The model adopts a segmented sampling method to increase its processing ability for long-term sequence data; embedding a parameter free attention module in the network header enhances the model's feature representation ability while reducing action background interference;embedding a temporal attention module at the tail of the network to fully extract temporal features by integrating high semantic information in the time domain.A new joint loss function is proposed in the paper,aiming to increase inter class differences and reduce intra class differences.Adopting a decision fusion layer to fully utilize the features of optical flow and RGB flow.In response to the above improved model,comparative and ablation experiments are conducted on the benchmark datasets UCF101 and HMDB51.The ablation experiment results verify the effectiveness of the proposed method.The comparison results show that the accuracy of the proposed method is 3.48% and 7.76% higher than that of the time segmented network on the two datasets,respectively,which is better than the current mainstream algorithms and has good recognition performance.

Key words: Action recognition, Vision Transformer, SimAM parameter-free attention, Temporal attention, Joint loss

CLC Number: 

  • TP391.7
[1]MA Y X,TAN L,DONG X,et al.Action Recognition for Intelligent Monitoring[J].Journal of Image and Graphics,2019,24(2):282-290.
[2]CHU J H,ZHANG S,LV W.Driving Behavior Analysis Algorithm Based on Convolutional Neural Network[J].Laser & Optoelectronics Progress,2020,57(14):141018.
[3]SUN Q,JI G L,ZHANG J.Non-Local Attention Based Generative Adversarial Network for Video Abnormal Event Detection[J].Computer Science,2022,49(8):172-177.
[4]MIAO Q G,XIN W T,LIU R Y,et al.Graph Convolutional Skeleton-Based Action Recognition Method for Intelligent Behavior Analysis[J].Computer Science,2022,49(2):156-161.
[5]KARPATHY A,TODERICI G,SHETTY S,et al.Large-Scale Video Classification with Convolutional Neural Networks[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Washington DC:IEEE Computer Society,2014:1725-1732.
[6]HE M,ZHU C,HUANG Q,et al.A Review of Monocular Visu-al Odometry[J].The Visual Computer,2020,36(5):1053-1065.
[7]SIMONYAN K,ZISSERMAN A.Two-stream ConvolutionalNetworks for Action Recognition in Videos[C]//The 27th International Conference on Neural Information Processing Systems.Montreal:NIPS'14,2014:568-576.
[8]NEIMARK D,BAR O,ZOHAR M,et al.Video Transformer Network[C]//Proceedings of IEEE/CVF International Confe-rence on Computer Vision Workshops[C]//Piscataway,NJ:IEEE Press,2021:3156-3165.
[9]ARNAB A,DEHGHANI M,HEIGOLD G,et al.ViViT:A Vi-deo Vision Transformer[C]//Proceedings of IEEE/CVF International Conference on Computer Vision.Piscataway,NJ:IEEE Press,2021:6816-6826.
[10]FAN H,XIONG B,MANGALAM K,et al.Multiscale VisionTransformers[C]//Proceedings of IEEE/CVF International Conference on Computer Vision.Piscataway,NJ:IEEE Press,2021:6804-6815.
[11]YAN S,XIONG X,ARNAB A,et al.Multiview Transformers for Video Recognition[C]//2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition.New Orleans:IEEE Press,2022:3333-3343.
[12]BERTASIUS G,WANG H,TORRESANI L.Is Space-time Attention All You Need for Video Understanding?[C]//2021 ICML.Virtual:IEEE Press,2021.
[13]YANG L,ZHANG R Y,LI L,et al.Simam:A Simple,Parameter-free Attention Module for Convolutional Neural Networks[C]//Proceedings of the 38th International Conference on Machine Learning.New York:PMLR,2021:11863-11874.
[14]HU J,SHEN L,SUN G.Squeeze-and-excitation Networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Salt Lake City:IEEE,2018:7132-7141.
[15]BROCK A,DE S,SMITH S L,et al.High-Performance Large-Scale Image Recognition without Normalization[C]//Procee-dings of the 38th International Conference on Machine Lear-ning.New York:PMLR,2021:1059-1071.
[16]LIU S,MA X,WU H,et al.An End to End Framework withAdaptive Spatio-Temporal Attention Module for Human Action Recognition[J].IEEE Access,2020,8:47220-47231.
[17]SHALMANI S M,CHIANG F,ZHENG R.Efficient ActionRecognition Using Confidence Distillation[C]//The 26th International Conference on Pattern Recognition.Montreal:IEEE,2022:3362-3369.
[18]TONG Z,SONG Y,WANG J,et al.Videomae:Masked Autoencoders Are Data-Efficient Learners for Self-Supervised Video Pre-Training[J].arXiv:2203.12602,2022.
[19]YI Z W,SUN Z H,FENG J C,et al.Channel Separable Convolutional Neural Network for Action Recognition[J].Journal of Signal Processing,2020,36(9):1497-1502.
[20]SAHOO S P,ARI S,MAHAPATRA K,et al. HAR-Depth:A Novel Framework for Human Action Recognition Using Sequential Learning and Depth Estimated History Images[J].IEEE Transactions on Emerging Topics in Computational Intelligence,2020,5(5):813-825.
[21]HU Z P,ZHANG R X,ZHANG X,et al.TVBN-ResNeXt:End-to-End Fusion of Space-Time Two-Stream Convolution Network for Video Classification[J].Journal of Signal Processing,2020,36(1):58-66.
[22]WANG Z Q,ZHANG W Q,ZHANG L,et al.Human Behavior Recognition with High-Order Attention Mechanism[J].Journal of Signal Processing,2020,36(8):1272-1279.
[23]KUMAWAT S,VERMA M,NAKASHIMA Y,et al.Depthwise Spatio-Temporal STFT Convolutional Neural Networks for Human Action Recognition[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2021,44(9):4839-4851.
[24]IZUTOV E.LIGAR:Lightweight General-Purpose Action Recognition[J].arXiv:2108.13153,2021.
[25]ZHANG J,HU H,LIU Z.Appearance-and-Dynamic Learningwith Bifurcated Convolution Neural Network for Action Recognition[J].IEEE Transactions on Circuits and Systems for Video Technology,2020,31(4):1593-1606.
[26]PAN N,JIANG M,KONG J.Human Action Recognition Algorithm Based on Spatio-Temporal Interactive Attention Model[J].Laser& Optoelectronics Progress,2020,57(18):181506.
[27]ZHANG W Q,WANG Z Q,ZHANG L.Human Action Recognition Combining Sequential Dynamic Images and Two-Stream Convolutional Network[J].Laser & Optoelectronics Progress,2021,58(2):0210007.
[28]LI C,HE M,DONG C,et al.Action Recognition Model of Directed Attention Based on Cosine Similarity[J].Journal of System Simulation,2024,36(1):67-82.
[1] LIAO Junshuang, TAN Qinhong. DETR with Multi-granularity Spatial Attention and Spatial Prior Supervision [J]. Computer Science, 2024, 51(6): 239-246.
[2] YAN Wenjie, YIN Yiying. Human Action Recognition Algorithm Based on Adaptive Shifted Graph Convolutional Neural
Network with 3D Skeleton Similarity
[J]. Computer Science, 2024, 51(4): 236-242.
[3] LUO Huilan, YU Yawei, WANG Chanjuan. Multi-dimensional Feature Excitation Network for Video Action Recognition [J]. Computer Science, 2023, 50(11A): 230300115-8.
[4] LI Hua, ZHAO Lingdi, CHEN Yujie, YANG Yang, DU Xinzhao. Lightweight Graph Convolution Action Recognition Algorithm Based on Multi-streamFusion [J]. Computer Science, 2023, 50(11A): 220800147-6.
[5] WU Yushan, XU Zengmin, ZHANG Xuelian, WANG Tao. Self-supervised Action Recognition Based on Skeleton Data Augmentation and Double Nearest Neighbor Retrieval [J]. Computer Science, 2023, 50(11): 97-106.
[6] LI Rong-fan, ZHONG Ting, WU Jin, ZHOU Fan, KUANG Ping. Spatio-Temporal Attention-based Kriging for Land Deformation Data Interpolation [J]. Computer Science, 2022, 49(8): 33-39.
[7] GAO Yue, FU Xiang-ling, OUYANG Tian-xiong, CHEN Song-ling, YAN Chen-wei. EEG Emotion Recognition Based on Spatiotemporal Self-Adaptive Graph ConvolutionalNeural Network [J]. Computer Science, 2022, 49(4): 30-36.
[8] XIE Yu, YANG Rui-ling, LIU Gong-xu, LI De-yu, WANG Wen-jian. Human Skeleton Action Recognition Algorithm Based on Dynamic Topological Graph [J]. Computer Science, 2022, 49(2): 62-68.
[9] MIAO Qi-guang, XIN Wen-tian, LIU Ru-yi, XIE Kun, WANG Quan, YANG Zong-kai. Graph Convolutional Skeleton-based Action Recognition Method for Intelligent Behavior Analysis [J]. Computer Science, 2022, 49(2): 156-161.
[10] LI Bao-zhen, ZHANG Jin, WANG Bao-lu, YU Ping. Human-Object Interaction Recognition Integrating Multi-level Visual Features [J]. Computer Science, 2022, 49(11A): 220700012-8.
[11] GAN Chuang, WU Gui-xing, ZHAN Qing-yuan, WANG Peng-kun, PENG Zhi-lei. Multi-scale Gated Graph Convolutional Network for Skeleton-based Action Recognition [J]. Computer Science, 2022, 49(1): 181-186.
[12] LIU Xin, YUAN Jia-bin, WANG Tian-xing. Interior Human Action Recognition Method Based on Prior Knowledge of Scene [J]. Computer Science, 2022, 49(1): 225-232.
[13] FENG Jiao, LU Chang-yu. Cross Media Retrieval Method Based on Residual Attention Network [J]. Computer Science, 2021, 48(6A): 122-126.
[14] ZHONG Yue, FANG Hu-sheng, ZHANG Guo-yu, WANG Zhao, ZHU Jing-wei. Method of CNN Flag Movement Recognition Based on 9-axis Attitude Sensor [J]. Computer Science, 2021, 48(6): 153-158.
[15] HONG Yao-qiu. Visual Human Action Recognition Based on Deep Belief Network [J]. Computer Science, 2021, 48(11A): 400-403.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!