Computer Science ›› 2022, Vol. 49 ›› Issue (11A): 211200025-10.doi: 10.11896/jsjkx.211200025

• Image Processing & Multimedia Technology • Previous Articles     Next Articles

Survey of Deep Learning Networks for Video Recognition

QIAN Wen-xiang1,3, YI Yang1,2,3   

  1. 1 School of Computer Science and Engineering,Sun Yat-sen University,Guangzhou 510275,China
    2 School of Information Science,Guangzhou Xinhua University,Guangzhou 510520,China
    3 Guangdong Key Laboratory of Big Data Analysis and Processing,Guangzhou 510275,China
  • Online:2022-11-10 Published:2022-11-21
  • About author:QIAN Wen-xiang,born in 1992,postgraduate.His main research interests include human body recognition in na-tural scenes and so on.
    YI Yang,born in 1967,Ph.D,associate professor.Her main research interests include human body recognition in natu-ral scenes and so on.
  • Supported by:
    Guangzhou Science and Technology Project(202002030273,202102080656) and Key Discipline Project of Guangzhou Xinhua University(2020XZD02).

Abstract: Video recognition is one of the most important tasks in computer vision research,which is concerned by many resear-chers.Video recognition refers to extracting the key features from different video clips,analyzing these features,and classification of the video.Compared to a single,static picture,there are many significant differences between frames of a video clip.How to tell the differences through the dimension of spatial-temporal information from video clips are well concerned by researchers.Taking video recognition technology as the target of the research,first,this paper introduces the basic concepts of video recognition and challenges in this area,together with some of the most frequently used datasets in video recognition tasks.Then,the classic video recognition methods based on spatio-temporal interest points,dense trajectories,and improved dense trajectories are reviewed.Also,the deep learning network frameworks for video recognition proposed in recent years are then summarized.They are summarized according to the time order of their proposal and grouped by the different architecture of their network.Among them,the video recognition framework based on 2D convolution neural network is introduced,including two-stream convolutional network architecture,long short-term memory network,and long-term recurrent convolutional network.Then,a framework based on a 3D convolutional neural network is introduced,including Slowfast Network,X3D(eXpand 3D) Network.Following that,the pseudo-3D convolutional neural network is introduced,including R(2+1)d network,Pseudo-3D residual network,and a set of light-weight networks based on building models on temporal information.At last,a Transformer-based network is introduced,including Timesformer,video vision Transformer,shifted window Transformer(Swin Transformer).The evolution of these deep learning frameworks,their implementation details and characteristics are analyzed.The performance of each network on different datasets is evaluated,and the applicable scenarios of each network are analyzed.In the end,the future research trend of video recognition network framework is prospected.Video recognition task can automatically and efficiently recognize the category to which the video belongs,and video recognition based on deep learning has a wide range of practical value.

Key words: Video recognition, Improved dense trajectory, Deep learning, Two-stream network, Convolutional neural network, Deep self-attention network

CLC Number: 

  • TP183
[1]DALAL N,TRIGGS B.Histograms of oriented gradients forhuman detection[C]//Institute of Electrical and Electronics Engineering Conference on Computer Vision and Pattern Recognition.San Diego,USA,2005:886-893.
[2]CHAUDHRY R,RAVICHANDRAN A,HAGER G,et al.Histograms of oriented optical flow and binet-cauchy kernels on nonlinear dynamical systems for the recognition of human actions[C]//Institute of Electrical and Electronics Engineering Conference on Computer Vision and Pattern Recognition.Miami,USA,2009:1932-1939.
[3]WANG H,KLASER A,SCHMID C,et al.Dense Trajectories and Motion Boundary Descriptors for Action Recognition[J].International Journal of Computer Vision,2013,103(1):61-79.
[4]LAZEBNIK S,SCHMID C,PONCE J.Beyond Bags of Fea-tures:Spatial Pyramid Matching for Recognizing Natural Scene Categories[C]//Electrical and Electronics Engineering ComputerSociety Conference on Computer Vision and Pattern Recognition.New York,USA,2006:2169-2178.
[5]YANG M,ZHANG L,FENG X,et al.Sparse representation based fisher discrimination dictionary learning for image classification[J].International Journal of Computer Vision,2014,109(3):209-232.
[6]HINTON G E.Learning multiple layersof representation[J].Trends in Cognitive Sciences,2007,11(10):428-434.
[7]DENG L,YU D.Deep learning:methods and applications[J].Foundations and Trendsr in Signal Processing,2014,7(3/4):197-387.
[8]SCHMIDHUBER J.Deep learning in neural networks:an overview[J].Neural Networks,2015,61(1):85-117.
[9]KRIZHEVSKY A,SUTSKEVER I,HINTON G E.Imagenetclassification with deep convolutional neural networks[J].Advances in Neural Information Processing Systems,2012,25(1):1097-1105.
[10]KARPATHY A,TODERICI G,SHETTY S,et al.Large-Scale Video Classification with Convolutional Neural Networks[C]//Institute of Electrical and Electronics Engineering Conference on Computer Vision and Pattern Recognition.Columbus,USA,2014:1725-1732.
[11]MATERZYNSKA J,XIAO T,HERZIG R,et al.Something-Else:Compositional Action Recognition With Spatial-Temporal Interaction Networks[C]//Institute of Electrical and Electroni-cs Engineering/Computer Vision Foundation Conference on Computer Vision and Pattern Recognition.Virtual,2020:1049-1059.
[12]SOOMRO K,ZAMIR A R,SHAH M.UCF101:A Dataset of101 Human Actions Classes From Videos in The Wild[J].ar-Xiv:1212.0402,2012.
[13]KAY W,CARREIRA J,SIMONYAN K,et al.The KineticsHuman Action Video Dataset[J].arXiv:1705.06950,2017.
[14]GU C,CHEN S,DAVID A R,et al.Ava:A video dataset of spatio-temporally localized atomic visual actions[C]//Institute of Electrical and Electronics Engineering Conference on Computer Vision and Pattern Recognition.Salt Lake City,USA,2018:6047-6056.
[15]LI A,THOTAKURI M,ROSS D A,et al.The AVA-Kinetics Localized Human Actions Video Dataset[J].arXiv:2005.00214,2020.
[16]KUEHNE H,JHUANG H,GARROTE E,et al.HMDB:ALarge Video Database for Human Motion Recognition[C]//Institute of Electrical and Electronics Engineering International Conference on Computer Vision.Barcelona,Spain,2011:2556-2563.
[17]SIGURDSSON G A,VAROL G,WANG X,et al.Hollywood in homes:Crowdsourcing data collection for activity understanding[C]//European Conference on Computer Vision.Cham:Swit-zerland,2016:510-526.
[18]GUNNAR A S,ABHINAV G,CORDELIA S,et al.Charades-ego:A large-scale dataset of paired third and first person videos[J].arXiv:1804.09626,2018.
[19]DAMEN D,DOUGHTY H,FARINELLA G M,et al.Rescaling Egocentric Vision:Collection,Pipeline and Challenges for EPIC-KITCHENS-100[J].International Journal of Computer Vision,2022,130(1):33-55.
[20]DAMEN D,DOUGHTY H,FARINELLA G M,et al.Scalingegocentric vision:The epic-kitchens dataset[C]//The European Conference on Computer Vision.Munich,Germany,2018:720-736.
[21]DAMEN D,DOUGHTY H,FARINELLA G M,et al.The epic-kitchens dataset:Collection,challenges and baselines[J].Institute of Electrical and Electronics Engineering Transactions on Pattern Analysis & Machine Intelligence,2020(1):1-1.
[22]ABU-EL-HAIJA S,KOTHARI N,LEE J,et al.Youtube-8m:A large-scale video classification benchmark[J].arXiv:1609.08675,2016.
[23]ANTIPOV G,BERRANI S A,RUCHAUD N,et al.Learnedvs.hand-crafted features for pedestrian gender recognition[C]//23rd Association for Computing Machinery International Conference on Multimedia.New York,USA,2015:1263-1266.
[24]KLASER A,MARSZALEK M,SCHMID C.A spatio-temporal descriptor based on 3D-gradients[C]//19th British Machine Vision Conference.Leeds,British,2008:1-10.
[25]WANG H,KLASER A,SCHMID C,et al.Action recognition by dense trajectories[C]//2011 Institute of Electrical and Electro-nics Engineering Conference on Computer Vision and Pattern Recognition.Colorado Springs,USA,2011:316903176.
[26]WANG H,SCHMID C.Action recognition with improved tra-jectories[C]//2013 Institute of Electrical and Electronics Engineering International Conference on Computer Vision.Sydney,Australia,2013:3551-3558.
[27]TRAN D,BOURDEV L,Fergus R,et al.Learning Spatiotemporal Features with 3D Convolutional Networks[C]//Institute of Electrical and Electronics Engineering International Conference on Computer Vision.Santiago,Chile,2015:4489-4497.
[28]HUANG K,DELANY S J,MCKEEVER S.Human Action Reco-gnition in Videos Using Transfer Learning[C]//Irish Machine Vision and Image Processing Conference.Dublin,Ireland,2019.
[29]ZHANG Z,SEJDIC E.Radiological images and machine lear-ning:trends,perspectives,and prospects[J].Computersin biology and medicine,2019,108(1):354-370.
[30]HINTON G E.Deep belief networks[J].Scholarpedia,2009,4(5):5947.
[31]TAYLOR G W,HINTON G E.Factored conditional restrictedBoltzmann machines for modeling motion style[C]//The 26th Annual International Conference on Machine Learning.New York,USA,2009:1025-1032.
[32]LAROCHELLE H,BENGIO Y.Classification using discriminative restricted Boltzmann machines[C]//The 25th International Conference on Machine Learning.New York,USA,2008:536-543.
[33]CHEN B.Deep learning of invariant spatio-temporal featuresfrom video[D].British Columbia:University of British Columbia,2010.
[34]YANG T A,SILVER D L.The Disadvantage of CNN versusDBN Image Classification Under Adversarial Conditions[C] //The 34th Canadian Conference on Artificial Intelligence.Vancouver,Canada,2021.
[35]CHEN M,RADFORD A,CHILD R,et al.Generative pretrai-ning from pixels[C]//International Conference on Machine Learning.Virtual,2020:1691-1703.
[36]SOCHER R,HUVAL B,BATH B,et al.Convolutional-recur-sive deep learning for 3d object classification[J].Advances in Neural Information Processing Systems,2012,25(1):656-664.
[37]VIJAYANARASIMHAN S,SHLENS J,MONGA R,et al.Deep networks with large output spaces[J].arXiv:1412.7479,2014.
[38]HOCHREITER S,SCHMIDHUBER J.Long short-term memory[J].Neural Computation,1997,9(8):1735-1780.
[39]NG Y H,HAUSKNECHT M,VIJAYANARASIMHAN S,et al.Beyond short snippets:Deep networks for video classification[C]//Institute of Electrical and Electronics Engineering Conference on Computer Vision and Pattern Recognition.Boston,USA,2015:4694-4702.
[40]DONAHUE J,HENDRICKS L A,ROHRBACH M,et al.Long-term recurrent convolutional networks for visual recognition and description[J].Institute of Electrical and Electronics Enginee-ring Transactions on Pattern Analysis & Machine Intelligence,2017,39(4):677-691.
[41]CARREIRA J,ZISSERMAN A.Quo vadis,action recognition? a new model and the kinetics dataset[C]//Institute of Electrical and Electronics Engineering Conference on Computer Vision and Pattern Recognition.Honolulu,USA,2017:6299-6308.
[42]SIMONYAN K,ZISSERMAN A.Two-stream convolutionalnetworks for action recognition in videos[J].Advances in Neural Information Processing Systems,2014,27:568-576.
[43]FEICHTENHOFER C,PINZ A,ZISSERMAN A.Convolutional two-stream network fusion for video action recognition[C]//Institute of Electrical and Electronics Engineering conference on Computer Vision and Pattern Recognition.Las Vegas,USA,2016:1933-1941.
[44]SIMONYAN K,ZISSERMAN A.Very deepconvolutional net-works for large-scale image recognition[J].arXiv:1409.1556,2014.
[45]LIN J,GAN C,HAN S.Tsm:Temporal shift module for efficient video understanding[C]//Institute of Electrical and Electronics Engineering/Computer Vision Foundation International Conference on Computer Vision.Seoul,Korea,2019:7083-7093.
[46]JI S,XU W,YANG M,et al.3D Convolutional Neural Networks for Human Action Recognition[J].Institute of Electrical and Electronics Engineering Transactions on Pattern Analysis & Machine Intelligence,2013,35(1):221-231.
[47]FEICHTENHOFER C,FAN H,MALIK J,et al.Slowfast networks for video recognition[C]//Institute of Electrical and Electronics Engineering International Conference on Computer Vision.Seoul,South Korea,2019:6202-6211.
[48]FEICHTENHOFER C.X3D:Expanding Architectures for Efficient Video Recognition[C]//Institute of Electrical and Electronics Engineering International Conference on Computer Vision.Virtual,2020:203-213.
[49]LEE Y,KIM H I,YUN K,et al.Diverse temporal aggregation and depthwise spatiotemporal factorization for efficient video classification[J].arXiv:2012.00317,2020.
[50]DU T,WANG H,TORRESANI L,et al.A Closer Look at Spatiotemporal Convolutions for Action Recognition[C]//Institute of Electrical and Electronics Engineering/Computer Vision Foundation Conference on Computer Vision and Pattern Recognition.Salt Lake City,USA,2018.
[51]QIU Z,YAO T,MEI T.Learning spatio-temporal representa-tion with pseudo-3D residual networks[C]//Institute of Electrical and Electronics Engineering International Conference on Computer Vision.Venice,Italy,2017:5534-5542.
[52]WANG L,XIONG Y,WANG Z,et al.Temporal segment networks:Towards good practices for deep action recognition[C]//European Conference on Computer Vision.Amsterdam,Netherlands,2016:20-36.
[53]LIU Z,LUO D,WANG Y,et al.TEINet:Towards an Efficient Architecture for Video Recognition[C]//Association for the Advance of Artificial Intelligence Conference on Artificial Intelligence.New York,USA,2020:11669-11676.
[54]LI Y,JI B,SHI X,et al.TEA:Temporal Excitation and Aggregation for Action Recognition[C]//Institute of Electrical and Electronics Engineering/Computer Vision Foundation Confe-rence on Computer Vision and Pattern Recognition.Virtual,2020:909-918.
[55]LIU Z,WANG L,WU W,et al.TAM:Temporal adaptive mo-dule for video recognition[C]//Institute of Electrical and Electronics Engineering/Computer Vision Foundation International Conference on Computer Vision.Virtual,2021:13708-13718.
[56]WANG L,TONG Z,JI B,et al.TDN:Temporal Difference Networks for Efficient Action Recognition[C]//Institute of Electrical and Electronics Engineering/Computer Vision Foundation Conference on Computer Vision and Pattern Recognition.Vir-tual,2021:1895-1904.
[57]DEVLIN J,CHANG M W,LEE K,et al.BERT:Pre-Training of Deep Bidirectional Transformers for Language Understanding[C]//Conference of the North American Chapter of the Asso-ciation for Computational Linguistics:Human Language Technologies,Volume 1(Long and Short Papers).Minneapolis,USA,2018:4171-4186.
[58]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isAll You Need[J].arXiv:1706.03762,2017.
[59]RUAN L,QIN J.Survey:Transformer Based Video-LanguagePre-Training[J].arXiv:2109.09920,2021.
[60]GIRDHAR R,CARREIRA J,DOERSCH C,et al.Video action transformer network[C]//Institute of Electrical and Electronics Engineering/Computer Vision Foundation Conference on Computer Vision and Pattern Recognition.Long Beach,USA,2019:244-253.
[61]HARA K,KATAOKA H,SATOH Y.Can spatiotemporal 3dcnns retrace the history of 2d cnns and imagenet? [C]//Institute of Electrical and Electronics Engineering/Computer Vision Foundation Conference on Computer Vision and Pattern Recognition.Salt Lake City,USA,2018:6546-6555.
[62]PARK J,JEON S,KIM S,et al.Learning to detect,associate,and recognize human actions and surrounding scenes in untrimmed videos[C]//The 1st Workshop and Challenge on Comprehensive Video Understanding in the Wild.Seoul,Korea,2018:21-26.
[63]SEONG H,HYUN J,KIM E.Video multitask transformer network[C]//Institute of Electrical and Electronics Engineering/Computer Vision Foundation International Conference on Computer Vision Workshops.Seoul,Korea,2019.
[64]BERTASIUS G,WANG H,TORRESANI L.Is Space-Time Attention All You Need for Video Understanding[J].arXiv:2102.05095,2021.
[65]ARNAB A,DEHGHANI M,HEIGOLD G,et al.ViViT:A Vi-deo Vision Transformer[C]//Institute of Electrical and Electro-nics Engineering/Computer Vision Foundation International Conference on Computer Vision.Virtual,2021:6836-6846.
[66]LIU Z,LIN Y,CAO Y,et al.Swin Transformer:Hierarchical Vision Transformer Using Shifted Windows[C]//Institute of Electrical and Electronics Engineering/Computer Vision Foundation International Conference on Computer Vision.Virtual,2021:10012-10022.
[67]KONDRATYUK D,YUAN L,LI Y,et al.Movinets:Mobilevideo networks for efficient video recognition[C]//Institute of Electrical and Electronics Engineering/Computer Vision Foundation Conference on Computer Vision and Pattern Recognition.Virtual,2021:16020-16030.
[68]KOOT R,HENNERBICHLER M,LU H.Evaluating Trans-formers for Lightweight Action Recognition[J].arXiv:2111.09641,2021.
[69]LANGERMAN D,JOHNSONA,BUETTNER K,et al.Beyond Floating-Point Ops:CNN Performance Prediction with Critical Datapath Length[C]//Institute of Electrical and Electronics Engineering High Performance Extreme Computing Conference.Virtual,2020:1-9.
[1] RAO Zhi-shuang, JIA Zhen, ZHANG Fan, LI Tian-rui. Key-Value Relational Memory Networks for Question Answering over Knowledge Graph [J]. Computer Science, 2022, 49(9): 202-207.
[2] TANG Ling-tao, WANG Di, ZHANG Lu-fei, LIU Sheng-yun. Federated Learning Scheme Based on Secure Multi-party Computation and Differential Privacy [J]. Computer Science, 2022, 49(9): 297-305.
[3] ZHOU Le-yuan, ZHANG Jian-hua, YUAN Tian-tian, CHEN Sheng-yong. Sequence-to-Sequence Chinese Continuous Sign Language Recognition and Translation with Multi- layer Attention Mechanism Fusion [J]. Computer Science, 2022, 49(9): 155-161.
[4] XU Yong-xin, ZHAO Jun-feng, WANG Ya-sha, XIE Bing, YANG Kai. Temporal Knowledge Graph Representation Learning [J]. Computer Science, 2022, 49(9): 162-171.
[5] WANG Jian, PENG Yu-qi, ZHAO Yu-fei, YANG Jian. Survey of Social Network Public Opinion Information Extraction Based on Deep Learning [J]. Computer Science, 2022, 49(8): 279-293.
[6] HAO Zhi-rong, CHEN Long, HUANG Jia-cheng. Class Discriminative Universal Adversarial Attack for Text Classification [J]. Computer Science, 2022, 49(8): 323-329.
[7] JIANG Meng-han, LI Shao-mei, ZHENG Hong-hao, ZHANG Jian-peng. Rumor Detection Model Based on Improved Position Embedding [J]. Computer Science, 2022, 49(8): 330-335.
[8] CHEN Yong-quan, JIANG Ying. Analysis Method of APP User Behavior Based on Convolutional Neural Network [J]. Computer Science, 2022, 49(8): 78-85.
[9] ZHU Cheng-zhang, HUANG Jia-er, XIAO Ya-long, WANG Han, ZOU Bei-ji. Deep Hash Retrieval Algorithm for Medical Images Based on Attention Mechanism [J]. Computer Science, 2022, 49(8): 113-119.
[10] SUN Qi, JI Gen-lin, ZHANG Jie. Non-local Attention Based Generative Adversarial Network for Video Abnormal Event Detection [J]. Computer Science, 2022, 49(8): 172-177.
[11] HU Yan-yu, ZHAO Long, DONG Xiang-jun. Two-stage Deep Feature Selection Extraction Algorithm for Cancer Classification [J]. Computer Science, 2022, 49(7): 73-78.
[12] DAI Zhao-xia, LI Jin-xin, ZHANG Xiang-dong, XU Xu, MEI Lin, ZHANG Liang. Super-resolution Reconstruction of MRI Based on DNGAN [J]. Computer Science, 2022, 49(7): 113-119.
[13] CHENG Cheng, JIANG Ai-lian. Real-time Semantic Segmentation Method Based on Multi-path Feature Extraction [J]. Computer Science, 2022, 49(7): 120-126.
[14] LIU Yue-hong, NIU Shao-hua, SHEN Xian-hao. Virtual Reality Video Intraframe Prediction Coding Based on Convolutional Neural Network [J]. Computer Science, 2022, 49(7): 127-131.
[15] XU Ming-ke, ZHANG Fan. Head Fusion:A Method to Improve Accuracy and Robustness of Speech Emotion Recognition [J]. Computer Science, 2022, 49(7): 132-141.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!