Computer Science ›› 2023, Vol. 50 ›› Issue (11A): 230300115-8.doi: 10.11896/jsjkx.230300115

• Image Processing & Multimedia Technology • Previous Articles     Next Articles

Multi-dimensional Feature Excitation Network for Video Action Recognition

LUO Huilan, YU Yawei, WANG Chanjuan   

  1. College of Information Engineering,Jiangxi University of Technology,Ganzhou,Jiangxi 341000,China
  • Published:2023-11-09
  • About author:LUO Huilan,born in 1974,Ph.D,professor,Ph.D supervisor.Her main research interests include computer vision and machine learning.
  • Supported by:
    National Natural Science Foundation of China(61862031),Project Supported by the Leading Talents Plan for the Technical Leaders of Major Disciplines in Jiangxi Province(20213BCJ22004) and Jiangxi Province Degree and Postgraduate Education and Teaching Reform Research Key Project(JXYJG-2020-120).

Abstract: Due to the diversity of video content and the complexity of video background,how to effectively extract spatio-temporal features is the main challenge of the video action recognition.In order to use deep networks to learn spatio-temporal features,researchers usually use two-stream networks and 3D convolution networks.Two-stream networks use the optical flow as its input to learn temporal features,but optical flow cannot express long-distance temporal relationships and the calculation of optical flow requires a lot of memory and time.On the other hand,3D convolution networks increase the computational cost by an order of magnitude compared with 2D convolution networks,which easily leads to over-fitting and slow convergence.To solve these problems,an attention-based multi-dimensional feature activation residual networks(MFARs) is proposed for video action recognition.A motion supplement excitation module is proposed to model temporal information and stimulate motion information.A united information excitation module is proposed to use temporal features to stimulate channels and spatial information in order to learn a better spatio-temporal features.Combing these two modules,MFARs is constructed for video action recognition.The proposed method obtainsan accuracy of 96.5% and 73.6% respectively on datasets UCF101 and HMDB51.Compared with the current mainstream action recognition models,the proposed multi-dimensional feature excitation method can effectively express spatial and temporal characteristics,and achieve a better balance of computation complexity and classification accuracy.

Key words: Action recognition, Deep learning, 2D convolution network, Attention mechanism, Video feature representation

CLC Number: 

  • TP391
[1]WANG H,KLASER A,SCHMID C,et al.Action recognition by dense trajectories[C]//Computer Vision and Pattern Recognition.IEEE,2011:3169-3176.
[2]WANG H,SCHMID C.Action Recognition with Improved Tra-jectories[C]//IEEE International Conference on Computer Vision.IEEE,2013:3551-3558.
[3]YILMAZ A,MUBARAK S.Actions Sketch:A Novel Action Representation[C]//Computer Vision and Pattern Recognition.IEEE,2005:984-989.
[4]BOBICK A,DAVIS J.An appearance-based representation ofaction[C]//International Conference on Pattern Recognition.IEEE,1996:307-312.
[5]WANG H,ULLAH M M,KLASER A,et al.Evaluation of local spatio-temporal features for action recognition[C]//Procedings of the British Machine Vision Conference.London:British Machine Vision Association,2009:124.1-124.11.
[6]SIMONYAN K,ZISSERMAN A.Two-Stream ConvolutionalNetworks for Action Recognition in Videos[C]//Neural Information Processing Systems.Curran Associates,Inc..2014:568-576.
[7]WANG L,XIONG Y,WANG Z,et al.Temporal Segment Networks:Towards Good Practices for Deep Action Recognition[C]//European Conference on Computer Vision.ECCV,2016:20-36.
[8]ZHOU B,ANDONIAN A,OLIVA A,et al.Temporal RelationalReasoning in Videos[C]//European Conference on Computer Vision.ECCV,2018:831-846.
[9]DIBA A,SHARMA V,VAN GOOL L.Deep Temporal LinearEncoding Networks[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition.IEEE,2017:1541-1550.
[10]JI S,XU W,YANG M,et al.3D Convolutional Neural Networks for Human Action Recognition[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2013,35(1):221-231.
[11]TRAN D,BOURDEV L,FERGUS R,et al.Learning Spatiotem-poral Features with 3D Convolutional Networks[C]//2015 IEEE International Conference on Computer Vision(ICCV).Santiago,Chile:IEEE,2015:4489-4497.
[12]CARREIRA J,ZISSERMAN A.Quo Vadis,Action Recogni-tion? A New Model and the Kinetics Dataset[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition(CVPR).Honolulu,HI:IEEE,2017:4724-4733.
[13]XIE S,SUN C,HUANG J,et al.Rethinking SpatiotemporalFeature Learning:Speed-Accuracy Trade-offs in Video Classification[C]//European Conference on Computer Vision.ECCV,2018:318-335.
[14]TRAN D,WANG H,TORRESANI L,et al.A Closer Look at Spatiotemporal Convolutions for Action Recognition[C]//IEEE Conference on Computer Vision and Pattern Recognition,2018:6450-6459.
[15]HUANG M,SHANG R X,QIAN H M.Composite Deep Neural Network for Human Activities Recognition in Video[J].Pattern Recognition and Artificial Intelligence,2022,35(6):562-570.
[16]ZHANG H B,FU D M,ZHOU K.Video-Based Temporal Enhanced Action Recognition[J].Pattern Recognition and Artificial Intelligence,2020,33(10):951-958.
[17]ONG A Y,TANG C,WANG W J.Human Action Recognition Fusing Two-Stream Networks and SVM[J].Pattern Recognition and Artificial Intelligence,2021,34(9):863-870.
[18]LIN J,GAN C,HAN S.TSM:Temporal Shift Module for Efficient Video Understanding[C]//2019 IEEE/CVF International Conference on Computer Vision.Seoul,Korea(South):IEEE,2019:7082-7092.
[19]SUDHAKARAN S,ESCALERA S,LANZ O.Gate-Shift Net-works for Video Action Recognition[C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition.Seattle,WA,USA:IEEE,2020:1099-1108.
[20]JIANG B,WANG M,GAN W,et al.STM:SpatioTemporal and Motion Encoding for Action Recognition[C]//2019 IEEE/CVF International Conference on Computer Vision.Seoul,Korea(South):IEEE,2019:2000-2009.
[21]LIU Z,LUO D,WANG Y,et al.TEINet:Towards an Efficient Architecture for Video Recognition[J].Proceedings of the AAAI Conference on Artificial Intelligence,2020,34(7):11669-11676.
[22]LI Y,JI B,SHI X,et al.TEA:Temporal Excitation and Aggregation for Action Recognition[C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition.Seattle,WA,USA:IEEE,2020:906-915.
[23]WANG L,TONG Z,JI B,et al.TDN:Temporal Difference Networks for Efficient Action Recognition[C]//IEEE Conference on Computer Vision and Pattern Recognition.Computer Vision Foundation.IEEE,2021:1895-1904.
[24]HU J,SHEN L,ALBANIE S,et al.Squeeze-and-Excitation Net-works[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition.IEEE,2018:7132-7141.
[25]WOO S,PARK J,LEE J Y,et al.CBAM:Convolutional Block Attention Module[C]//Proceedings of the European Conference on Computer Vision(ECCV).2018:3-19.
[26]FU J,LIU J,TIAN H,et al.Dual Attention Network for Scene Segmentation[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition.IEEE,2019:3146-3154.
[27]QIU Y,LIU Y,CHEN Y,et al.A2SPPNet:Attentive AtrousSpatial Pyramid Pooling Network for Salient Object Detection[J].IEEE Transactions on Multimedia,2022,25:1991-2006.
[28]WANG Z,SHE Q,SMOLIC A.ACTION-Net:Multipath Excitation for Action Recognition[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition.IEEE,2021:13209-13218.
[29]BERTASIUS G,WANG H,TORRESANI L.Is Space-Time Attention All You Need for Video Understanding?[C]//International Conference on Machine Learning.PMLR,2021:813-824.
[30]VASWANI A,SHAZEER N,PARMAR N,et al.Attention is All you Need[J].arXiv:1706.03762,2017.
[31]ARNAB A,DEHGHANI M,HEIGOLD G,et al.ViViT:A Video Vision Transformer[C]//IEEE/CVF International Confe-rence on Computer Vision.IEEE,2021:6816-6826.
[32]SOOMRO K,ZAMIR A R,SHAH M.UCF101:A Dataset of 101 Human Actions Classes From Videos in The Wild[J].Computer Science,2012,3(12):1-9.
[33]KUEHNE H,JHUANG H,GARROTE E,et al.HMDB:A large video database for human motion recognition[C]//2011 International Conference on Computer Vision.Barcelona,Spain:IEEE,2011:2556-2563.
[34]ZHOU B,KHOSLA A,LAPEDRIZA A,et al.Learning DeepFeatures for Discriminative Localization[C]//IEEE Conference on Computer Vision and Pattern Recognition.IEEE,2016:2921-2929.
[35]LUO H L,TONG K,YUAN P.Spatiotemporal squeeze-and-excitation residual multiplier network for video action recognition[J].Journal on Communications,2019,40(10):189-198.
[36]LUO H L,CHEN H.Spatial-Temporal Convolution Attention Network for Action Recognition[J].Computer Engineering and Applications,2023(9):150-158.
[37]WANG Y,LIU W,XING W.Improved Two-stream Networkfor Action Recognition in Complex Scenes[C]//2021 International Conference on Artificial Intelligence and Electromechanical Automation(AIEA).IEEE,2021:361-365.
[38]YANG G,ZOU W.Deep learning network model based on fusion of spatiotemporal features for action recognition[J].Multimedia Tools and Applications,2022,81(7):9875-9896.
[39]ZOLFAGHARI M,SINGH K,BROX T.ECO:Efficient Convolutional Network for Online Video Understanding[C]//Euro-pean Conference on Computer Vision(ECCV).2018:695-712.
[40]MING Y,FENG F,LI C,et al.3D-TDC:A 3D temporal dilation convolution framework for video action recognition[J].Neurocomputing,2021,450:362-371.
[41]ZHANG K,YANG J,ZHANG D,et al.MRTP:Multi-Temporal Resolution Real-Time Action Recognitionpproach by Time-Action Perception[J].Joural of Xi’an Jiaotong University,2022,56(3):22-32.
[42]HE D,ZHOU Z,GAN C,et al.StNet:Local and Global Spatial-Temporal Modeling for Action Recognition[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2019:8401-8408.
[43]ZHANG Z,PENG Y,GAN C,et al.Separable 3D residual attention network for human action recognition[J].Multimedia Tools and Applications,2022,82(4):5435-5453.
[44]CHEN B,TANG H,ZHANG Z,et al.Video-based action recognition using spurious-3D residual attention networks[J].IET Image Processing,2022,16(11):3097-3111.
[1] ZHAO Mingmin, YANG Qiuhui, HONG Mei, CAI Chuang. Smart Contract Fuzzing Based on Deep Learning and Information Feedback [J]. Computer Science, 2023, 50(9): 117-122.
[2] LI Haiming, ZHU Zhiheng, LIU Lei, GUO Chenkai. Multi-task Graph-embedding Deep Prediction Model for Mobile App Rating Recommendation [J]. Computer Science, 2023, 50(9): 160-167.
[3] HUANG Hanqiang, XING Yunbing, SHEN Jianfei, FAN Feiyi. Sign Language Animation Splicing Model Based on LpTransformer Network [J]. Computer Science, 2023, 50(9): 184-191.
[4] ZHU Ye, HAO Yingguang, WANG Hongyu. Deep Learning Based Salient Object Detection in Infrared Video [J]. Computer Science, 2023, 50(9): 227-234.
[5] YI Liu, GENG Xinyu, BAI Jing. Hierarchical Multi-label Text Classification Algorithm Based on Parallel Convolutional Network Information Fusion [J]. Computer Science, 2023, 50(9): 278-286.
[6] LUO Yuanyuan, YANG Chunming, LI Bo, ZHANG Hui, ZHAO Xujian. Chinese Medical Named Entity Recognition Method Incorporating Machine ReadingComprehension [J]. Computer Science, 2023, 50(9): 287-294.
[7] LI Ke, YANG Ling, ZHAO Yanbo, CHEN Yonglong, LUO Shouxi. EGCN-CeDML:A Distributed Machine Learning Framework for Vehicle Driving Behavior Prediction [J]. Computer Science, 2023, 50(9): 318-330.
[8] WANG Jiahao, ZHONG Xin, LI Wenxiong, ZHAO Dexin. Human Activity Recognition with Meta-learning and Attention [J]. Computer Science, 2023, 50(8): 193-201.
[9] WANG Yu, WANG Zuchao, PAN Rui. Survey of DGA Domain Name Detection Based on Character Feature [J]. Computer Science, 2023, 50(8): 251-259.
[10] ZHANG Yian, YANG Ying, REN Gang, WANG Gang. Study on Multimodal Online Reviews Helpfulness Prediction Based on Attention Mechanism [J]. Computer Science, 2023, 50(8): 37-44.
[11] SONG Xinyang, YAN Zhiyuan, SUN Muyi, DAI Linlin, LI Qi, SUN Zhenan. Review of Talking Face Generation [J]. Computer Science, 2023, 50(8): 68-78.
[12] WANG Xu, WU Yanxia, ZHANG Xue, HONG Ruize, LI Guangsheng. Survey of Rotating Object Detection Research in Computer Vision [J]. Computer Science, 2023, 50(8): 79-92.
[13] ZHOU Ziyi, XIONG Hailing. Image Captioning Optimization Strategy Based on Deep Learning [J]. Computer Science, 2023, 50(8): 99-110.
[14] TENG Sihang, WANG Lie, LI Ya. Non-autoregressive Transformer Chinese Speech Recognition Incorporating Pronunciation- Character Representation Conversion [J]. Computer Science, 2023, 50(8): 111-117.
[15] ZHANG Xiao, DONG Hongbin. Lightweight Multi-view Stereo Integrating Coarse Cost Volume and Bilateral Grid [J]. Computer Science, 2023, 50(8): 125-132.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!