计算机科学 ›› 2024, Vol. 51 ›› Issue (7): 229-235.doi: 10.11896/jsjkx.230500054
雷永升1, 丁锰1,2, 沈尧1, 李居昊1, 赵东越1, 陈福仕1
LEI Yongsheng1, DING Meng1,2, SHEN Yao1, LI Juhao1, ZHAO Dongyue1, CHEN Fushi1
摘要: 针对现有行为识别方法中抗背景干扰能力差和准确率低等问题,提出了一种改进的双流视觉Transformer行为识别模型。该模型采用分段采样的方法来增加模型对长时序列数据的处理能力;在网络头部嵌入无参数的注意力模块,在降低动作背景干扰的同时,增强了模型的特征表示能力;在网络尾部嵌入时间注意力模块,通过融合时域高语义信息来充分提取时序特征。文中提出了一种新的联合损失函数,旨在增大类间差异并减少类内差异;采用决策融合层以充分利用光流与RGB流特征。针对上述改进模型,在基准数据集UCF101和HMDB51上进行消融及对比实验,消融实验结果验证了所提方法的有效性,对比实验结果表明,所提方法相比时间分段网络在两个数据集上的准确率分别提高了3.48%和7.76%,优于目前的主流算法,具有较好的识别效果。
中图分类号:
[1]MA Y X,TAN L,DONG X,et al.Action Recognition for Intelligent Monitoring[J].Journal of Image and Graphics,2019,24(2):282-290. [2]CHU J H,ZHANG S,LV W.Driving Behavior Analysis Algorithm Based on Convolutional Neural Network[J].Laser & Optoelectronics Progress,2020,57(14):141018. [3]SUN Q,JI G L,ZHANG J.Non-Local Attention Based Generative Adversarial Network for Video Abnormal Event Detection[J].Computer Science,2022,49(8):172-177. [4]MIAO Q G,XIN W T,LIU R Y,et al.Graph Convolutional Skeleton-Based Action Recognition Method for Intelligent Behavior Analysis[J].Computer Science,2022,49(2):156-161. [5]KARPATHY A,TODERICI G,SHETTY S,et al.Large-Scale Video Classification with Convolutional Neural Networks[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Washington DC:IEEE Computer Society,2014:1725-1732. [6]HE M,ZHU C,HUANG Q,et al.A Review of Monocular Visu-al Odometry[J].The Visual Computer,2020,36(5):1053-1065. [7]SIMONYAN K,ZISSERMAN A.Two-stream ConvolutionalNetworks for Action Recognition in Videos[C]//The 27th International Conference on Neural Information Processing Systems.Montreal:NIPS'14,2014:568-576. [8]NEIMARK D,BAR O,ZOHAR M,et al.Video Transformer Network[C]//Proceedings of IEEE/CVF International Confe-rence on Computer Vision Workshops[C]//Piscataway,NJ:IEEE Press,2021:3156-3165. [9]ARNAB A,DEHGHANI M,HEIGOLD G,et al.ViViT:A Vi-deo Vision Transformer[C]//Proceedings of IEEE/CVF International Conference on Computer Vision.Piscataway,NJ:IEEE Press,2021:6816-6826. [10]FAN H,XIONG B,MANGALAM K,et al.Multiscale VisionTransformers[C]//Proceedings of IEEE/CVF International Conference on Computer Vision.Piscataway,NJ:IEEE Press,2021:6804-6815. [11]YAN S,XIONG X,ARNAB A,et al.Multiview Transformers for Video Recognition[C]//2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition.New Orleans:IEEE Press,2022:3333-3343. [12]BERTASIUS G,WANG H,TORRESANI L.Is Space-time Attention All You Need for Video Understanding?[C]//2021 ICML.Virtual:IEEE Press,2021. [13]YANG L,ZHANG R Y,LI L,et al.Simam:A Simple,Parameter-free Attention Module for Convolutional Neural Networks[C]//Proceedings of the 38th International Conference on Machine Learning.New York:PMLR,2021:11863-11874. [14]HU J,SHEN L,SUN G.Squeeze-and-excitation Networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Salt Lake City:IEEE,2018:7132-7141. [15]BROCK A,DE S,SMITH S L,et al.High-Performance Large-Scale Image Recognition without Normalization[C]//Procee-dings of the 38th International Conference on Machine Lear-ning.New York:PMLR,2021:1059-1071. [16]LIU S,MA X,WU H,et al.An End to End Framework withAdaptive Spatio-Temporal Attention Module for Human Action Recognition[J].IEEE Access,2020,8:47220-47231. [17]SHALMANI S M,CHIANG F,ZHENG R.Efficient ActionRecognition Using Confidence Distillation[C]//The 26th International Conference on Pattern Recognition.Montreal:IEEE,2022:3362-3369. [18]TONG Z,SONG Y,WANG J,et al.Videomae:Masked Autoencoders Are Data-Efficient Learners for Self-Supervised Video Pre-Training[J].arXiv:2203.12602,2022. [19]YI Z W,SUN Z H,FENG J C,et al.Channel Separable Convolutional Neural Network for Action Recognition[J].Journal of Signal Processing,2020,36(9):1497-1502. [20]SAHOO S P,ARI S,MAHAPATRA K,et al. HAR-Depth:A Novel Framework for Human Action Recognition Using Sequential Learning and Depth Estimated History Images[J].IEEE Transactions on Emerging Topics in Computational Intelligence,2020,5(5):813-825. [21]HU Z P,ZHANG R X,ZHANG X,et al.TVBN-ResNeXt:End-to-End Fusion of Space-Time Two-Stream Convolution Network for Video Classification[J].Journal of Signal Processing,2020,36(1):58-66. [22]WANG Z Q,ZHANG W Q,ZHANG L,et al.Human Behavior Recognition with High-Order Attention Mechanism[J].Journal of Signal Processing,2020,36(8):1272-1279. [23]KUMAWAT S,VERMA M,NAKASHIMA Y,et al.Depthwise Spatio-Temporal STFT Convolutional Neural Networks for Human Action Recognition[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2021,44(9):4839-4851. [24]IZUTOV E.LIGAR:Lightweight General-Purpose Action Recognition[J].arXiv:2108.13153,2021. [25]ZHANG J,HU H,LIU Z.Appearance-and-Dynamic Learningwith Bifurcated Convolution Neural Network for Action Recognition[J].IEEE Transactions on Circuits and Systems for Video Technology,2020,31(4):1593-1606. [26]PAN N,JIANG M,KONG J.Human Action Recognition Algorithm Based on Spatio-Temporal Interactive Attention Model[J].Laser& Optoelectronics Progress,2020,57(18):181506. [27]ZHANG W Q,WANG Z Q,ZHANG L.Human Action Recognition Combining Sequential Dynamic Images and Two-Stream Convolutional Network[J].Laser & Optoelectronics Progress,2021,58(2):0210007. [28]LI C,HE M,DONG C,et al.Action Recognition Model of Directed Attention Based on Cosine Similarity[J].Journal of System Simulation,2024,36(1):67-82. |
|