计算机科学 ›› 2023, Vol. 50 ›› Issue (11A): 230300115-8.doi: 10.11896/jsjkx.230300115
罗会兰, 于亚威, 王婵娟
LUO Huilan, YU Yawei, WANG Chanjuan
摘要: 在动作识别任务中,由于视频数据存在内容多样和背景复杂的特性,因此提取有效的时空特征是研究的主要难点。为了利用深度网络学习时空特征,研究者们通常采用双流网络和3D卷积网络。但是,双流网络中光流信息缺乏捕获长距离时间关系的能力,且光流提取需占用很大的内存和时间;而3D卷积与2D卷积相比,增加了一个数量级的计算成本,容易导致过拟合和收敛缓慢。为解决以上问题,提出了一种基于注意力的多维度特征激励融合网络MFARs(Multi-dimensional Feature Activation Residual networks)用于视频行为识别。MFARs采用2D卷积网络解决时序特征表达学习问题,利用运动补足激励模块建模时序特征,激发时间通道运动信息;同时利用联合特征激励模块,通过时序特征激励通道和空间信息,以学习到更好的时空特征表达。MFARs在行为识别数据集UCF101和HMDB51上的准确度分别达到了96.5%和73.6%。与当前的主流行为识别模型相比,提出的多维特征激励方法能够有效地表达时空特征,更好地平衡复杂度和分类准确率。
中图分类号:
[1]WANG H,KLASER A,SCHMID C,et al.Action recognition by dense trajectories[C]//Computer Vision and Pattern Recognition.IEEE,2011:3169-3176. [2]WANG H,SCHMID C.Action Recognition with Improved Tra-jectories[C]//IEEE International Conference on Computer Vision.IEEE,2013:3551-3558. [3]YILMAZ A,MUBARAK S.Actions Sketch:A Novel Action Representation[C]//Computer Vision and Pattern Recognition.IEEE,2005:984-989. [4]BOBICK A,DAVIS J.An appearance-based representation ofaction[C]//International Conference on Pattern Recognition.IEEE,1996:307-312. [5]WANG H,ULLAH M M,KLASER A,et al.Evaluation of local spatio-temporal features for action recognition[C]//Procedings of the British Machine Vision Conference.London:British Machine Vision Association,2009:124.1-124.11. [6]SIMONYAN K,ZISSERMAN A.Two-Stream ConvolutionalNetworks for Action Recognition in Videos[C]//Neural Information Processing Systems.Curran Associates,Inc..2014:568-576. [7]WANG L,XIONG Y,WANG Z,et al.Temporal Segment Networks:Towards Good Practices for Deep Action Recognition[C]//European Conference on Computer Vision.ECCV,2016:20-36. [8]ZHOU B,ANDONIAN A,OLIVA A,et al.Temporal RelationalReasoning in Videos[C]//European Conference on Computer Vision.ECCV,2018:831-846. [9]DIBA A,SHARMA V,VAN GOOL L.Deep Temporal LinearEncoding Networks[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition.IEEE,2017:1541-1550. [10]JI S,XU W,YANG M,et al.3D Convolutional Neural Networks for Human Action Recognition[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2013,35(1):221-231. [11]TRAN D,BOURDEV L,FERGUS R,et al.Learning Spatiotem-poral Features with 3D Convolutional Networks[C]//2015 IEEE International Conference on Computer Vision(ICCV).Santiago,Chile:IEEE,2015:4489-4497. [12]CARREIRA J,ZISSERMAN A.Quo Vadis,Action Recogni-tion? A New Model and the Kinetics Dataset[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition(CVPR).Honolulu,HI:IEEE,2017:4724-4733. [13]XIE S,SUN C,HUANG J,et al.Rethinking SpatiotemporalFeature Learning:Speed-Accuracy Trade-offs in Video Classification[C]//European Conference on Computer Vision.ECCV,2018:318-335. [14]TRAN D,WANG H,TORRESANI L,et al.A Closer Look at Spatiotemporal Convolutions for Action Recognition[C]//IEEE Conference on Computer Vision and Pattern Recognition,2018:6450-6459. [15]HUANG M,SHANG R X,QIAN H M.Composite Deep Neural Network for Human Activities Recognition in Video[J].Pattern Recognition and Artificial Intelligence,2022,35(6):562-570. [16]ZHANG H B,FU D M,ZHOU K.Video-Based Temporal Enhanced Action Recognition[J].Pattern Recognition and Artificial Intelligence,2020,33(10):951-958. [17]ONG A Y,TANG C,WANG W J.Human Action Recognition Fusing Two-Stream Networks and SVM[J].Pattern Recognition and Artificial Intelligence,2021,34(9):863-870. [18]LIN J,GAN C,HAN S.TSM:Temporal Shift Module for Efficient Video Understanding[C]//2019 IEEE/CVF International Conference on Computer Vision.Seoul,Korea(South):IEEE,2019:7082-7092. [19]SUDHAKARAN S,ESCALERA S,LANZ O.Gate-Shift Net-works for Video Action Recognition[C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition.Seattle,WA,USA:IEEE,2020:1099-1108. [20]JIANG B,WANG M,GAN W,et al.STM:SpatioTemporal and Motion Encoding for Action Recognition[C]//2019 IEEE/CVF International Conference on Computer Vision.Seoul,Korea(South):IEEE,2019:2000-2009. [21]LIU Z,LUO D,WANG Y,et al.TEINet:Towards an Efficient Architecture for Video Recognition[J].Proceedings of the AAAI Conference on Artificial Intelligence,2020,34(7):11669-11676. [22]LI Y,JI B,SHI X,et al.TEA:Temporal Excitation and Aggregation for Action Recognition[C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition.Seattle,WA,USA:IEEE,2020:906-915. [23]WANG L,TONG Z,JI B,et al.TDN:Temporal Difference Networks for Efficient Action Recognition[C]//IEEE Conference on Computer Vision and Pattern Recognition.Computer Vision Foundation.IEEE,2021:1895-1904. [24]HU J,SHEN L,ALBANIE S,et al.Squeeze-and-Excitation Net-works[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition.IEEE,2018:7132-7141. [25]WOO S,PARK J,LEE J Y,et al.CBAM:Convolutional Block Attention Module[C]//Proceedings of the European Conference on Computer Vision(ECCV).2018:3-19. [26]FU J,LIU J,TIAN H,et al.Dual Attention Network for Scene Segmentation[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition.IEEE,2019:3146-3154. [27]QIU Y,LIU Y,CHEN Y,et al.A2SPPNet:Attentive AtrousSpatial Pyramid Pooling Network for Salient Object Detection[J].IEEE Transactions on Multimedia,2022,25:1991-2006. [28]WANG Z,SHE Q,SMOLIC A.ACTION-Net:Multipath Excitation for Action Recognition[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition.IEEE,2021:13209-13218. [29]BERTASIUS G,WANG H,TORRESANI L.Is Space-Time Attention All You Need for Video Understanding?[C]//International Conference on Machine Learning.PMLR,2021:813-824. [30]VASWANI A,SHAZEER N,PARMAR N,et al.Attention is All you Need[J].arXiv:1706.03762,2017. [31]ARNAB A,DEHGHANI M,HEIGOLD G,et al.ViViT:A Video Vision Transformer[C]//IEEE/CVF International Confe-rence on Computer Vision.IEEE,2021:6816-6826. [32]SOOMRO K,ZAMIR A R,SHAH M.UCF101:A Dataset of 101 Human Actions Classes From Videos in The Wild[J].Computer Science,2012,3(12):1-9. [33]KUEHNE H,JHUANG H,GARROTE E,et al.HMDB:A large video database for human motion recognition[C]//2011 International Conference on Computer Vision.Barcelona,Spain:IEEE,2011:2556-2563. [34]ZHOU B,KHOSLA A,LAPEDRIZA A,et al.Learning DeepFeatures for Discriminative Localization[C]//IEEE Conference on Computer Vision and Pattern Recognition.IEEE,2016:2921-2929. [35]LUO H L,TONG K,YUAN P.Spatiotemporal squeeze-and-excitation residual multiplier network for video action recognition[J].Journal on Communications,2019,40(10):189-198. [36]LUO H L,CHEN H.Spatial-Temporal Convolution Attention Network for Action Recognition[J].Computer Engineering and Applications,2023(9):150-158. [37]WANG Y,LIU W,XING W.Improved Two-stream Networkfor Action Recognition in Complex Scenes[C]//2021 International Conference on Artificial Intelligence and Electromechanical Automation(AIEA).IEEE,2021:361-365. [38]YANG G,ZOU W.Deep learning network model based on fusion of spatiotemporal features for action recognition[J].Multimedia Tools and Applications,2022,81(7):9875-9896. [39]ZOLFAGHARI M,SINGH K,BROX T.ECO:Efficient Convolutional Network for Online Video Understanding[C]//Euro-pean Conference on Computer Vision(ECCV).2018:695-712. [40]MING Y,FENG F,LI C,et al.3D-TDC:A 3D temporal dilation convolution framework for video action recognition[J].Neurocomputing,2021,450:362-371. [41]ZHANG K,YANG J,ZHANG D,et al.MRTP:Multi-Temporal Resolution Real-Time Action Recognitionpproach by Time-Action Perception[J].Joural of Xi’an Jiaotong University,2022,56(3):22-32. [42]HE D,ZHOU Z,GAN C,et al.StNet:Local and Global Spatial-Temporal Modeling for Action Recognition[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2019:8401-8408. [43]ZHANG Z,PENG Y,GAN C,et al.Separable 3D residual attention network for human action recognition[J].Multimedia Tools and Applications,2022,82(4):5435-5453. [44]CHEN B,TANG H,ZHANG Z,et al.Video-based action recognition using spurious-3D residual attention networks[J].IET Image Processing,2022,16(11):3097-3111. |
|