视频识别深度学习网络综述

doi:10.11896/jsjkx.211200025

摘要/Abstract

摘要： 视频识别是计算机视觉领域中最重要的任务之一,受到了研究者的广泛关注。视频识别指从视频片段中提取特征,并依据特征识别视频动作。相比于静态图片,视频的各帧间存在较大的关联性。如何高效地使用来自时空等不同维度的特征信息准确地识别视频,是当前研究的重点。以视频识别技术为研究对象,首先介绍了视频识别研究的背景信息及常用数据集。然后,详细地梳理了视频识别方法的演变过程;回顾了基于时空兴趣点、密集轨迹、改进的密集轨迹等传统的视频识别方法,以及近年来提出的可用于视频识别的深度学习网络框架。其中,分别介绍了基于2D卷积神经网络的视频识别框架、基于3D卷积神经网络的视频框架、伪3D卷积神经网络,以及基于Transformer结构的网络,介绍了这些框架的演变,并总结了它们的实现细节及特点;评测了各网络在不同视频识别数据集上的表现情况,分析了各网络的适用场景。最后,展望了视频识别网络框架未来的研究趋势。视频识别任务可以自动、高效地识别出视频所属的类别,基于深度学习的视频识别具有广泛的实用价值。

关键词: 视频识别, 改进的密集轨迹, 深度学习, 双流网络, 卷积神经网络, 深度自注意力网络

Abstract: Video recognition is one of the most important tasks in computer vision research,which is concerned by many resear-chers.Video recognition refers to extracting the key features from different video clips,analyzing these features,and classification of the video.Compared to a single,static picture,there are many significant differences between frames of a video clip.How to tell the differences through the dimension of spatial-temporal information from video clips are well concerned by researchers.Taking video recognition technology as the target of the research,first,this paper introduces the basic concepts of video recognition and challenges in this area,together with some of the most frequently used datasets in video recognition tasks.Then,the classic video recognition methods based on spatio-temporal interest points,dense trajectories,and improved dense trajectories are reviewed.Also,the deep learning network frameworks for video recognition proposed in recent years are then summarized.They are summarized according to the time order of their proposal and grouped by the different architecture of their network.Among them,the video recognition framework based on 2D convolution neural network is introduced,including two-stream convolutional network architecture,long short-term memory network,and long-term recurrent convolutional network.Then,a framework based on a 3D convolutional neural network is introduced,including Slowfast Network,X3D(eXpand 3D) Network.Following that,the pseudo-3D convolutional neural network is introduced,including R(2+1)d network,Pseudo-3D residual network,and a set of light-weight networks based on building models on temporal information.At last,a Transformer-based network is introduced,including Timesformer,video vision Transformer,shifted window Transformer(Swin Transformer).The evolution of these deep learning frameworks,their implementation details and characteristics are analyzed.The performance of each network on different datasets is evaluated,and the applicable scenarios of each network are analyzed.In the end,the future research trend of video recognition network framework is prospected.Video recognition task can automatically and efficiently recognize the category to which the video belongs,and video recognition based on deep learning has a wide range of practical value.

Key words: Video recognition, Improved dense trajectory, Deep learning, Two-stream network, Convolutional neural network, Deep self-attention network

中图分类号:

TP183

钱文祥, 衣杨. 视频识别深度学习网络综述[J]. 计算机科学, 2022, 49(11A): 211200025-10. https://doi.org/10.11896/jsjkx.211200025

QIAN Wen-xiang, YI Yang. Survey of Deep Learning Networks for Video Recognition[J]. Computer Science, 2022, 49(11A): 211200025-10. https://doi.org/10.11896/jsjkx.211200025

参考文献

[1]DALAL N,TRIGGS B.Histograms of oriented gradients forhuman detection[C]//Institute of Electrical and Electronics Engineering Conference on Computer Vision and Pattern Recognition.San Diego,USA,2005:886-893.
[2]CHAUDHRY R,RAVICHANDRAN A,HAGER G,et al.Histograms of oriented optical flow and binet-cauchy kernels on nonlinear dynamical systems for the recognition of human actions[C]//Institute of Electrical and Electronics Engineering Conference on Computer Vision and Pattern Recognition.Miami,USA,2009:1932-1939.
[3]WANG H,KLASER A,SCHMID C,et al.Dense Trajectories and Motion Boundary Descriptors for Action Recognition[J].International Journal of Computer Vision,2013,103(1):61-79.
[4]LAZEBNIK S,SCHMID C,PONCE J.Beyond Bags of Fea-tures:Spatial Pyramid Matching for Recognizing Natural Scene Categories[C]//Electrical and Electronics Engineering ComputerSociety Conference on Computer Vision and Pattern Recognition.New York,USA,2006:2169-2178.
[5]YANG M,ZHANG L,FENG X,et al.Sparse representation based fisher discrimination dictionary learning for image classification[J].International Journal of Computer Vision,2014,109(3):209-232.
[6]HINTON G E.Learning multiple layersof representation[J].Trends in Cognitive Sciences,2007,11(10):428-434.
[7]DENG L,YU D.Deep learning:methods and applications[J].Foundations and Trendsr in Signal Processing,2014,7(3/4):197-387.
[8]SCHMIDHUBER J.Deep learning in neural networks:an overview[J].Neural Networks,2015,61(1):85-117.
[9]KRIZHEVSKY A,SUTSKEVER I,HINTON G E.Imagenetclassification with deep convolutional neural networks[J].Advances in Neural Information Processing Systems,2012,25(1):1097-1105.
[10]KARPATHY A,TODERICI G,SHETTY S,et al.Large-Scale Video Classification with Convolutional Neural Networks[C]//Institute of Electrical and Electronics Engineering Conference on Computer Vision and Pattern Recognition.Columbus,USA,2014:1725-1732.
[11]MATERZYNSKA J,XIAO T,HERZIG R,et al.Something-Else:Compositional Action Recognition With Spatial-Temporal Interaction Networks[C]//Institute of Electrical and Electroni-cs Engineering/Computer Vision Foundation Conference on Computer Vision and Pattern Recognition.Virtual,2020:1049-1059.
[12]SOOMRO K,ZAMIR A R,SHAH M.UCF101:A Dataset of101 Human Actions Classes From Videos in The Wild[J].ar-Xiv:1212.0402,2012.
[13]KAY W,CARREIRA J,SIMONYAN K,et al.The KineticsHuman Action Video Dataset[J].arXiv:1705.06950,2017.
[14]GU C,CHEN S,DAVID A R,et al.Ava:A video dataset of spatio-temporally localized atomic visual actions[C]//Institute of Electrical and Electronics Engineering Conference on Computer Vision and Pattern Recognition.Salt Lake City,USA,2018:6047-6056.
[15]LI A,THOTAKURI M,ROSS D A,et al.The AVA-Kinetics Localized Human Actions Video Dataset[J].arXiv:2005.00214,2020.
[16]KUEHNE H,JHUANG H,GARROTE E,et al.HMDB:ALarge Video Database for Human Motion Recognition[C]//Institute of Electrical and Electronics Engineering International Conference on Computer Vision.Barcelona,Spain,2011:2556-2563.
[17]SIGURDSSON G A,VAROL G,WANG X,et al.Hollywood in homes:Crowdsourcing data collection for activity understanding[C]//European Conference on Computer Vision.Cham:Swit-zerland,2016:510-526.
[18]GUNNAR A S,ABHINAV G,CORDELIA S,et al.Charades-ego:A large-scale dataset of paired third and first person videos[J].arXiv:1804.09626,2018.
[19]DAMEN D,DOUGHTY H,FARINELLA G M,et al.Rescaling Egocentric Vision:Collection,Pipeline and Challenges for EPIC-KITCHENS-100[J].International Journal of Computer Vision,2022,130(1):33-55.
[20]DAMEN D,DOUGHTY H,FARINELLA G M,et al.Scalingegocentric vision:The epic-kitchens dataset[C]//The European Conference on Computer Vision.Munich,Germany,2018:720-736.
[21]DAMEN D,DOUGHTY H,FARINELLA G M,et al.The epic-kitchens dataset:Collection,challenges and baselines[J].Institute of Electrical and Electronics Engineering Transactions on Pattern Analysis & Machine Intelligence,2020(1):1-1.
[22]ABU-EL-HAIJA S,KOTHARI N,LEE J,et al.Youtube-8m:A large-scale video classification benchmark[J].arXiv:1609.08675,2016.
[23]ANTIPOV G,BERRANI S A,RUCHAUD N,et al.Learnedvs.hand-crafted features for pedestrian gender recognition[C]//23rd Association for Computing Machinery International Conference on Multimedia.New York,USA,2015:1263-1266.
[24]KLASER A,MARSZALEK M,SCHMID C.A spatio-temporal descriptor based on 3D-gradients[C]//19th British Machine Vision Conference.Leeds,British,2008:1-10.
[25]WANG H,KLASER A,SCHMID C,et al.Action recognition by dense trajectories[C]//2011 Institute of Electrical and Electro-nics Engineering Conference on Computer Vision and Pattern Recognition.Colorado Springs,USA,2011:316903176.
[26]WANG H,SCHMID C.Action recognition with improved tra-jectories[C]//2013 Institute of Electrical and Electronics Engineering International Conference on Computer Vision.Sydney,Australia,2013:3551-3558.
[27]TRAN D,BOURDEV L,Fergus R,et al.Learning Spatiotemporal Features with 3D Convolutional Networks[C]//Institute of Electrical and Electronics Engineering International Conference on Computer Vision.Santiago,Chile,2015:4489-4497.
[28]HUANG K,DELANY S J,MCKEEVER S.Human Action Reco-gnition in Videos Using Transfer Learning[C]//Irish Machine Vision and Image Processing Conference.Dublin,Ireland,2019.
[29]ZHANG Z,SEJDIC E.Radiological images and machine lear-ning:trends,perspectives,and prospects[J].Computersin biology and medicine,2019,108(1):354-370.
[30]HINTON G E.Deep belief networks[J].Scholarpedia,2009,4(5):5947.
[31]TAYLOR G W,HINTON G E.Factored conditional restrictedBoltzmann machines for modeling motion style[C]//The 26th Annual International Conference on Machine Learning.New York,USA,2009:1025-1032.
[32]LAROCHELLE H,BENGIO Y.Classification using discriminative restricted Boltzmann machines[C]//The 25th International Conference on Machine Learning.New York,USA,2008:536-543.
[33]CHEN B.Deep learning of invariant spatio-temporal featuresfrom video[D].British Columbia:University of British Columbia,2010.
[34]YANG T A,SILVER D L.The Disadvantage of CNN versusDBN Image Classification Under Adversarial Conditions[C] //The 34th Canadian Conference on Artificial Intelligence.Vancouver,Canada,2021.
[35]CHEN M,RADFORD A,CHILD R,et al.Generative pretrai-ning from pixels[C]//International Conference on Machine Learning.Virtual,2020:1691-1703.
[36]SOCHER R,HUVAL B,BATH B,et al.Convolutional-recur-sive deep learning for 3d object classification[J].Advances in Neural Information Processing Systems,2012,25(1):656-664.
[37]VIJAYANARASIMHAN S,SHLENS J,MONGA R,et al.Deep networks with large output spaces[J].arXiv:1412.7479,2014.
[38]HOCHREITER S,SCHMIDHUBER J.Long short-term memory[J].Neural Computation,1997,9(8):1735-1780.
[39]NG Y H,HAUSKNECHT M,VIJAYANARASIMHAN S,et al.Beyond short snippets:Deep networks for video classification[C]//Institute of Electrical and Electronics Engineering Conference on Computer Vision and Pattern Recognition.Boston,USA,2015:4694-4702.
[40]DONAHUE J,HENDRICKS L A,ROHRBACH M,et al.Long-term recurrent convolutional networks for visual recognition and description[J].Institute of Electrical and Electronics Enginee-ring Transactions on Pattern Analysis & Machine Intelligence,2017,39(4):677-691.
[41]CARREIRA J,ZISSERMAN A.Quo vadis,action recognition? a new model and the kinetics dataset[C]//Institute of Electrical and Electronics Engineering Conference on Computer Vision and Pattern Recognition.Honolulu,USA,2017:6299-6308.
[42]SIMONYAN K,ZISSERMAN A.Two-stream convolutionalnetworks for action recognition in videos[J].Advances in Neural Information Processing Systems,2014,27:568-576.
[43]FEICHTENHOFER C,PINZ A,ZISSERMAN A.Convolutional two-stream network fusion for video action recognition[C]//Institute of Electrical and Electronics Engineering conference on Computer Vision and Pattern Recognition.Las Vegas,USA,2016:1933-1941.
[44]SIMONYAN K,ZISSERMAN A.Very deepconvolutional net-works for large-scale image recognition[J].arXiv:1409.1556,2014.
[45]LIN J,GAN C,HAN S.Tsm:Temporal shift module for efficient video understanding[C]//Institute of Electrical and Electronics Engineering/Computer Vision Foundation International Conference on Computer Vision.Seoul,Korea,2019:7083-7093.
[46]JI S,XU W,YANG M,et al.3D Convolutional Neural Networks for Human Action Recognition[J].Institute of Electrical and Electronics Engineering Transactions on Pattern Analysis & Machine Intelligence,2013,35(1):221-231.
[47]FEICHTENHOFER C,FAN H,MALIK J,et al.Slowfast networks for video recognition[C]//Institute of Electrical and Electronics Engineering International Conference on Computer Vision.Seoul,South Korea,2019:6202-6211.
[48]FEICHTENHOFER C.X3D:Expanding Architectures for Efficient Video Recognition[C]//Institute of Electrical and Electronics Engineering International Conference on Computer Vision.Virtual,2020:203-213.
[49]LEE Y,KIM H I,YUN K,et al.Diverse temporal aggregation and depthwise spatiotemporal factorization for efficient video classification[J].arXiv:2012.00317,2020.
[50]DU T,WANG H,TORRESANI L,et al.A Closer Look at Spatiotemporal Convolutions for Action Recognition[C]//Institute of Electrical and Electronics Engineering/Computer Vision Foundation Conference on Computer Vision and Pattern Recognition.Salt Lake City,USA,2018.
[51]QIU Z,YAO T,MEI T.Learning spatio-temporal representa-tion with pseudo-3D residual networks[C]//Institute of Electrical and Electronics Engineering International Conference on Computer Vision.Venice,Italy,2017:5534-5542.
[52]WANG L,XIONG Y,WANG Z,et al.Temporal segment networks:Towards good practices for deep action recognition[C]//European Conference on Computer Vision.Amsterdam,Netherlands,2016:20-36.
[53]LIU Z,LUO D,WANG Y,et al.TEINet:Towards an Efficient Architecture for Video Recognition[C]//Association for the Advance of Artificial Intelligence Conference on Artificial Intelligence.New York,USA,2020:11669-11676.
[54]LI Y,JI B,SHI X,et al.TEA:Temporal Excitation and Aggregation for Action Recognition[C]//Institute of Electrical and Electronics Engineering/Computer Vision Foundation Confe-rence on Computer Vision and Pattern Recognition.Virtual,2020:909-918.
[55]LIU Z,WANG L,WU W,et al.TAM:Temporal adaptive mo-dule for video recognition[C]//Institute of Electrical and Electronics Engineering/Computer Vision Foundation International Conference on Computer Vision.Virtual,2021:13708-13718.
[56]WANG L,TONG Z,JI B,et al.TDN:Temporal Difference Networks for Efficient Action Recognition[C]//Institute of Electrical and Electronics Engineering/Computer Vision Foundation Conference on Computer Vision and Pattern Recognition.Vir-tual,2021:1895-1904.
[57]DEVLIN J,CHANG M W,LEE K,et al.BERT:Pre-Training of Deep Bidirectional Transformers for Language Understanding[C]//Conference of the North American Chapter of the Asso-ciation for Computational Linguistics:Human Language Technologies,Volume 1(Long and Short Papers).Minneapolis,USA,2018:4171-4186.
[58]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isAll You Need[J].arXiv:1706.03762,2017.
[59]RUAN L,QIN J.Survey:Transformer Based Video-LanguagePre-Training[J].arXiv:2109.09920,2021.
[60]GIRDHAR R,CARREIRA J,DOERSCH C,et al.Video action transformer network[C]//Institute of Electrical and Electronics Engineering/Computer Vision Foundation Conference on Computer Vision and Pattern Recognition.Long Beach,USA,2019:244-253.
[61]HARA K,KATAOKA H,SATOH Y.Can spatiotemporal 3dcnns retrace the history of 2d cnns and imagenet? [C]//Institute of Electrical and Electronics Engineering/Computer Vision Foundation Conference on Computer Vision and Pattern Recognition.Salt Lake City,USA,2018:6546-6555.
[62]PARK J,JEON S,KIM S,et al.Learning to detect,associate,and recognize human actions and surrounding scenes in untrimmed videos[C]//The 1st Workshop and Challenge on Comprehensive Video Understanding in the Wild.Seoul,Korea,2018:21-26.
[63]SEONG H,HYUN J,KIM E.Video multitask transformer network[C]//Institute of Electrical and Electronics Engineering/Computer Vision Foundation International Conference on Computer Vision Workshops.Seoul,Korea,2019.
[64]BERTASIUS G,WANG H,TORRESANI L.Is Space-Time Attention All You Need for Video Understanding[J].arXiv:2102.05095,2021.
[65]ARNAB A,DEHGHANI M,HEIGOLD G,et al.ViViT:A Vi-deo Vision Transformer[C]//Institute of Electrical and Electro-nics Engineering/Computer Vision Foundation International Conference on Computer Vision.Virtual,2021:6836-6846.
[66]LIU Z,LIN Y,CAO Y,et al.Swin Transformer:Hierarchical Vision Transformer Using Shifted Windows[C]//Institute of Electrical and Electronics Engineering/Computer Vision Foundation International Conference on Computer Vision.Virtual,2021:10012-10022.
[67]KONDRATYUK D,YUAN L,LI Y,et al.Movinets:Mobilevideo networks for efficient video recognition[C]//Institute of Electrical and Electronics Engineering/Computer Vision Foundation Conference on Computer Vision and Pattern Recognition.Virtual,2021:16020-16030.
[68]KOOT R,HENNERBICHLER M,LU H.Evaluating Trans-formers for Lightweight Action Recognition[J].arXiv:2111.09641,2021.
[69]LANGERMAN D,JOHNSONA,BUETTNER K,et al.Beyond Floating-Point Ops:CNN Performance Prediction with Critical Datapath Length[C]//Institute of Electrical and Electronics Engineering High Performance Extreme Computing Conference.Virtual,2020:1-9.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed