多维特征激励网络用于视频行为识别

doi:10.11896/jsjkx.230300115

摘要/Abstract

摘要： 在动作识别任务中,由于视频数据存在内容多样和背景复杂的特性,因此提取有效的时空特征是研究的主要难点。为了利用深度网络学习时空特征,研究者们通常采用双流网络和3D卷积网络。但是,双流网络中光流信息缺乏捕获长距离时间关系的能力,且光流提取需占用很大的内存和时间;而3D卷积与2D卷积相比,增加了一个数量级的计算成本,容易导致过拟合和收敛缓慢。为解决以上问题,提出了一种基于注意力的多维度特征激励融合网络MFARs(Multi-dimensional Feature Activation Residual networks)用于视频行为识别。MFARs采用2D卷积网络解决时序特征表达学习问题,利用运动补足激励模块建模时序特征,激发时间通道运动信息;同时利用联合特征激励模块,通过时序特征激励通道和空间信息,以学习到更好的时空特征表达。MFARs在行为识别数据集UCF101和HMDB51上的准确度分别达到了96.5%和73.6%。与当前的主流行为识别模型相比,提出的多维特征激励方法能够有效地表达时空特征,更好地平衡复杂度和分类准确率。

关键词: 行为识别, 深度学习, 2D卷积网络, 注意力机制, 视频特征表达

Abstract: Due to the diversity of video content and the complexity of video background,how to effectively extract spatio-temporal features is the main challenge of the video action recognition.In order to use deep networks to learn spatio-temporal features,researchers usually use two-stream networks and 3D convolution networks.Two-stream networks use the optical flow as its input to learn temporal features,but optical flow cannot express long-distance temporal relationships and the calculation of optical flow requires a lot of memory and time.On the other hand,3D convolution networks increase the computational cost by an order of magnitude compared with 2D convolution networks,which easily leads to over-fitting and slow convergence.To solve these problems,an attention-based multi-dimensional feature activation residual networks(MFARs) is proposed for video action recognition.A motion supplement excitation module is proposed to model temporal information and stimulate motion information.A united information excitation module is proposed to use temporal features to stimulate channels and spatial information in order to learn a better spatio-temporal features.Combing these two modules,MFARs is constructed for video action recognition.The proposed method obtainsan accuracy of 96.5% and 73.6% respectively on datasets UCF101 and HMDB51.Compared with the current mainstream action recognition models,the proposed multi-dimensional feature excitation method can effectively express spatial and temporal characteristics,and achieve a better balance of computation complexity and classification accuracy.

Key words: Action recognition, Deep learning, 2D convolution network, Attention mechanism, Video feature representation

中图分类号:

TP391

罗会兰, 于亚威, 王婵娟. 多维特征激励网络用于视频行为识别[J]. 计算机科学, 2023, 50(11A): 230300115-8. https://doi.org/10.11896/jsjkx.230300115

LUO Huilan, YU Yawei, WANG Chanjuan. Multi-dimensional Feature Excitation Network for Video Action Recognition[J]. Computer Science, 2023, 50(11A): 230300115-8. https://doi.org/10.11896/jsjkx.230300115

参考文献

[1]WANG H,KLASER A,SCHMID C,et al.Action recognition by dense trajectories[C]//Computer Vision and Pattern Recognition.IEEE,2011:3169-3176.
[2]WANG H,SCHMID C.Action Recognition with Improved Tra-jectories[C]//IEEE International Conference on Computer Vision.IEEE,2013:3551-3558.
[3]YILMAZ A,MUBARAK S.Actions Sketch:A Novel Action Representation[C]//Computer Vision and Pattern Recognition.IEEE,2005:984-989.
[4]BOBICK A,DAVIS J.An appearance-based representation ofaction[C]//International Conference on Pattern Recognition.IEEE,1996:307-312.
[5]WANG H,ULLAH M M,KLASER A,et al.Evaluation of local spatio-temporal features for action recognition[C]//Procedings of the British Machine Vision Conference.London:British Machine Vision Association,2009:124.1-124.11.
[6]SIMONYAN K,ZISSERMAN A.Two-Stream ConvolutionalNetworks for Action Recognition in Videos[C]//Neural Information Processing Systems.Curran Associates,Inc..2014:568-576.
[7]WANG L,XIONG Y,WANG Z,et al.Temporal Segment Networks:Towards Good Practices for Deep Action Recognition[C]//European Conference on Computer Vision.ECCV,2016:20-36.
[8]ZHOU B,ANDONIAN A,OLIVA A,et al.Temporal RelationalReasoning in Videos[C]//European Conference on Computer Vision.ECCV,2018:831-846.
[9]DIBA A,SHARMA V,VAN GOOL L.Deep Temporal LinearEncoding Networks[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition.IEEE,2017:1541-1550.
[10]JI S,XU W,YANG M,et al.3D Convolutional Neural Networks for Human Action Recognition[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2013,35(1):221-231.
[11]TRAN D,BOURDEV L,FERGUS R,et al.Learning Spatiotem-poral Features with 3D Convolutional Networks[C]//2015 IEEE International Conference on Computer Vision(ICCV).Santiago,Chile:IEEE,2015:4489-4497.
[12]CARREIRA J,ZISSERMAN A.Quo Vadis,Action Recogni-tion? A New Model and the Kinetics Dataset[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition(CVPR).Honolulu,HI:IEEE,2017:4724-4733.
[13]XIE S,SUN C,HUANG J,et al.Rethinking SpatiotemporalFeature Learning:Speed-Accuracy Trade-offs in Video Classification[C]//European Conference on Computer Vision.ECCV,2018:318-335.
[14]TRAN D,WANG H,TORRESANI L,et al.A Closer Look at Spatiotemporal Convolutions for Action Recognition[C]//IEEE Conference on Computer Vision and Pattern Recognition,2018:6450-6459.
[15]HUANG M,SHANG R X,QIAN H M.Composite Deep Neural Network for Human Activities Recognition in Video[J].Pattern Recognition and Artificial Intelligence,2022,35(6):562-570.
[16]ZHANG H B,FU D M,ZHOU K.Video-Based Temporal Enhanced Action Recognition[J].Pattern Recognition and Artificial Intelligence,2020,33(10):951-958.
[17]ONG A Y,TANG C,WANG W J.Human Action Recognition Fusing Two-Stream Networks and SVM[J].Pattern Recognition and Artificial Intelligence,2021,34(9):863-870.
[18]LIN J,GAN C,HAN S.TSM:Temporal Shift Module for Efficient Video Understanding[C]//2019 IEEE/CVF International Conference on Computer Vision.Seoul,Korea(South):IEEE,2019:7082-7092.
[19]SUDHAKARAN S,ESCALERA S,LANZ O.Gate-Shift Net-works for Video Action Recognition[C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition.Seattle,WA,USA:IEEE,2020:1099-1108.
[20]JIANG B,WANG M,GAN W,et al.STM:SpatioTemporal and Motion Encoding for Action Recognition[C]//2019 IEEE/CVF International Conference on Computer Vision.Seoul,Korea(South):IEEE,2019:2000-2009.
[21]LIU Z,LUO D,WANG Y,et al.TEINet:Towards an Efficient Architecture for Video Recognition[J].Proceedings of the AAAI Conference on Artificial Intelligence,2020,34(7):11669-11676.
[22]LI Y,JI B,SHI X,et al.TEA:Temporal Excitation and Aggregation for Action Recognition[C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition.Seattle,WA,USA:IEEE,2020:906-915.
[23]WANG L,TONG Z,JI B,et al.TDN:Temporal Difference Networks for Efficient Action Recognition[C]//IEEE Conference on Computer Vision and Pattern Recognition.Computer Vision Foundation.IEEE,2021:1895-1904.
[24]HU J,SHEN L,ALBANIE S,et al.Squeeze-and-Excitation Net-works[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition.IEEE,2018:7132-7141.
[25]WOO S,PARK J,LEE J Y,et al.CBAM:Convolutional Block Attention Module[C]//Proceedings of the European Conference on Computer Vision(ECCV).2018:3-19.
[26]FU J,LIU J,TIAN H,et al.Dual Attention Network for Scene Segmentation[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition.IEEE,2019:3146-3154.
[27]QIU Y,LIU Y,CHEN Y,et al.A2SPPNet:Attentive AtrousSpatial Pyramid Pooling Network for Salient Object Detection[J].IEEE Transactions on Multimedia,2022,25:1991-2006.
[28]WANG Z,SHE Q,SMOLIC A.ACTION-Net:Multipath Excitation for Action Recognition[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition.IEEE,2021:13209-13218.
[29]BERTASIUS G,WANG H,TORRESANI L.Is Space-Time Attention All You Need for Video Understanding?[C]//International Conference on Machine Learning.PMLR,2021:813-824.
[30]VASWANI A,SHAZEER N,PARMAR N,et al.Attention is All you Need[J].arXiv:1706.03762,2017.
[31]ARNAB A,DEHGHANI M,HEIGOLD G,et al.ViViT:A Video Vision Transformer[C]//IEEE/CVF International Confe-rence on Computer Vision.IEEE,2021:6816-6826.
[32]SOOMRO K,ZAMIR A R,SHAH M.UCF101:A Dataset of 101 Human Actions Classes From Videos in The Wild[J].Computer Science,2012,3(12):1-9.
[33]KUEHNE H,JHUANG H,GARROTE E,et al.HMDB:A large video database for human motion recognition[C]//2011 International Conference on Computer Vision.Barcelona,Spain:IEEE,2011:2556-2563.
[34]ZHOU B,KHOSLA A,LAPEDRIZA A,et al.Learning DeepFeatures for Discriminative Localization[C]//IEEE Conference on Computer Vision and Pattern Recognition.IEEE,2016:2921-2929.
[35]LUO H L,TONG K,YUAN P.Spatiotemporal squeeze-and-excitation residual multiplier network for video action recognition[J].Journal on Communications,2019,40(10):189-198.
[36]LUO H L,CHEN H.Spatial-Temporal Convolution Attention Network for Action Recognition[J].Computer Engineering and Applications,2023(9):150-158.
[37]WANG Y,LIU W,XING W.Improved Two-stream Networkfor Action Recognition in Complex Scenes[C]//2021 International Conference on Artificial Intelligence and Electromechanical Automation(AIEA).IEEE,2021:361-365.
[38]YANG G,ZOU W.Deep learning network model based on fusion of spatiotemporal features for action recognition[J].Multimedia Tools and Applications,2022,81(7):9875-9896.
[39]ZOLFAGHARI M,SINGH K,BROX T.ECO:Efficient Convolutional Network for Online Video Understanding[C]//Euro-pean Conference on Computer Vision(ECCV).2018:695-712.
[40]MING Y,FENG F,LI C,et al.3D-TDC:A 3D temporal dilation convolution framework for video action recognition[J].Neurocomputing,2021,450:362-371.
[41]ZHANG K,YANG J,ZHANG D,et al.MRTP:Multi-Temporal Resolution Real-Time Action Recognitionpproach by Time-Action Perception[J].Joural of Xi’an Jiaotong University,2022,56(3):22-32.
[42]HE D,ZHOU Z,GAN C,et al.StNet:Local and Global Spatial-Temporal Modeling for Action Recognition[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2019:8401-8408.
[43]ZHANG Z,PENG Y,GAN C,et al.Separable 3D residual attention network for human action recognition[J].Multimedia Tools and Applications,2022,82(4):5435-5453.
[44]CHEN B,TANG H,ZHANG Z,et al.Video-based action recognition using spurious-3D residual attention networks[J].IET Image Processing,2022,16(11):3097-3111.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed