计算机科学 ›› 2022, Vol. 49 ›› Issue (11A): 211200025-10.doi: 10.11896/jsjkx.211200025

• 图像处理&多媒体技术 • 上一篇    下一篇

视频识别深度学习网络综述

钱文祥1,3, 衣杨1,2,3   

  1. 1 中山大学计算机学院 广州 510275
    2 广州新华学院信息科学学院 广州 510520
    3 广东省大数据分析与处理重点实验室 广州 510275
  • 出版日期:2022-11-10 发布日期:2022-11-21
  • 通讯作者: 衣杨(issyy@mail.sysu.edu.cn)
  • 作者简介:(qianwx3@mail2.sysu.edu.cn)
  • 基金资助:
    广州市科技计划项目(202002030273,202102080656);广州新华学院重点学科项目(2020XZD02)

Survey of Deep Learning Networks for Video Recognition

QIAN Wen-xiang1,3, YI Yang1,2,3   

  1. 1 School of Computer Science and Engineering,Sun Yat-sen University,Guangzhou 510275,China
    2 School of Information Science,Guangzhou Xinhua University,Guangzhou 510520,China
    3 Guangdong Key Laboratory of Big Data Analysis and Processing,Guangzhou 510275,China
  • Online:2022-11-10 Published:2022-11-21
  • About author:QIAN Wen-xiang,born in 1992,postgraduate.His main research interests include human body recognition in na-tural scenes and so on.
    YI Yang,born in 1967,Ph.D,associate professor.Her main research interests include human body recognition in natu-ral scenes and so on.
  • Supported by:
    Guangzhou Science and Technology Project(202002030273,202102080656) and Key Discipline Project of Guangzhou Xinhua University(2020XZD02).

摘要: 视频识别是计算机视觉领域中最重要的任务之一,受到了研究者的广泛关注。视频识别指从视频片段中提取特征,并依据特征识别视频动作。相比于静态图片,视频的各帧间存在较大的关联性。如何高效地使用来自时空等不同维度的特征信息准确地识别视频,是当前研究的重点。以视频识别技术为研究对象,首先介绍了视频识别研究的背景信息及常用数据集。然后,详细地梳理了视频识别方法的演变过程;回顾了基于时空兴趣点、密集轨迹、改进的密集轨迹等传统的视频识别方法,以及近年来提出的可用于视频识别的深度学习网络框架。其中,分别介绍了基于2D卷积神经网络的视频识别框架、基于3D卷积神经网络的视频框架、伪3D卷积神经网络,以及基于Transformer结构的网络,介绍了这些框架的演变,并总结了它们的实现细节及特点;评测了各网络在不同视频识别数据集上的表现情况,分析了各网络的适用场景。最后,展望了视频识别网络框架未来的研究趋势。视频识别任务可以自动、高效地识别出视频所属的类别,基于深度学习的视频识别具有广泛的实用价值。

关键词: 视频识别, 改进的密集轨迹, 深度学习, 双流网络, 卷积神经网络, 深度自注意力网络

Abstract: Video recognition is one of the most important tasks in computer vision research,which is concerned by many resear-chers.Video recognition refers to extracting the key features from different video clips,analyzing these features,and classification of the video.Compared to a single,static picture,there are many significant differences between frames of a video clip.How to tell the differences through the dimension of spatial-temporal information from video clips are well concerned by researchers.Taking video recognition technology as the target of the research,first,this paper introduces the basic concepts of video recognition and challenges in this area,together with some of the most frequently used datasets in video recognition tasks.Then,the classic video recognition methods based on spatio-temporal interest points,dense trajectories,and improved dense trajectories are reviewed.Also,the deep learning network frameworks for video recognition proposed in recent years are then summarized.They are summarized according to the time order of their proposal and grouped by the different architecture of their network.Among them,the video recognition framework based on 2D convolution neural network is introduced,including two-stream convolutional network architecture,long short-term memory network,and long-term recurrent convolutional network.Then,a framework based on a 3D convolutional neural network is introduced,including Slowfast Network,X3D(eXpand 3D) Network.Following that,the pseudo-3D convolutional neural network is introduced,including R(2+1)d network,Pseudo-3D residual network,and a set of light-weight networks based on building models on temporal information.At last,a Transformer-based network is introduced,including Timesformer,video vision Transformer,shifted window Transformer(Swin Transformer).The evolution of these deep learning frameworks,their implementation details and characteristics are analyzed.The performance of each network on different datasets is evaluated,and the applicable scenarios of each network are analyzed.In the end,the future research trend of video recognition network framework is prospected.Video recognition task can automatically and efficiently recognize the category to which the video belongs,and video recognition based on deep learning has a wide range of practical value.

Key words: Video recognition, Improved dense trajectory, Deep learning, Two-stream network, Convolutional neural network, Deep self-attention network

中图分类号: 

  • TP183
[1]DALAL N,TRIGGS B.Histograms of oriented gradients forhuman detection[C]//Institute of Electrical and Electronics Engineering Conference on Computer Vision and Pattern Recognition.San Diego,USA,2005:886-893.
[2]CHAUDHRY R,RAVICHANDRAN A,HAGER G,et al.Histograms of oriented optical flow and binet-cauchy kernels on nonlinear dynamical systems for the recognition of human actions[C]//Institute of Electrical and Electronics Engineering Conference on Computer Vision and Pattern Recognition.Miami,USA,2009:1932-1939.
[3]WANG H,KLASER A,SCHMID C,et al.Dense Trajectories and Motion Boundary Descriptors for Action Recognition[J].International Journal of Computer Vision,2013,103(1):61-79.
[4]LAZEBNIK S,SCHMID C,PONCE J.Beyond Bags of Fea-tures:Spatial Pyramid Matching for Recognizing Natural Scene Categories[C]//Electrical and Electronics Engineering ComputerSociety Conference on Computer Vision and Pattern Recognition.New York,USA,2006:2169-2178.
[5]YANG M,ZHANG L,FENG X,et al.Sparse representation based fisher discrimination dictionary learning for image classification[J].International Journal of Computer Vision,2014,109(3):209-232.
[6]HINTON G E.Learning multiple layersof representation[J].Trends in Cognitive Sciences,2007,11(10):428-434.
[7]DENG L,YU D.Deep learning:methods and applications[J].Foundations and Trendsr in Signal Processing,2014,7(3/4):197-387.
[8]SCHMIDHUBER J.Deep learning in neural networks:an overview[J].Neural Networks,2015,61(1):85-117.
[9]KRIZHEVSKY A,SUTSKEVER I,HINTON G E.Imagenetclassification with deep convolutional neural networks[J].Advances in Neural Information Processing Systems,2012,25(1):1097-1105.
[10]KARPATHY A,TODERICI G,SHETTY S,et al.Large-Scale Video Classification with Convolutional Neural Networks[C]//Institute of Electrical and Electronics Engineering Conference on Computer Vision and Pattern Recognition.Columbus,USA,2014:1725-1732.
[11]MATERZYNSKA J,XIAO T,HERZIG R,et al.Something-Else:Compositional Action Recognition With Spatial-Temporal Interaction Networks[C]//Institute of Electrical and Electroni-cs Engineering/Computer Vision Foundation Conference on Computer Vision and Pattern Recognition.Virtual,2020:1049-1059.
[12]SOOMRO K,ZAMIR A R,SHAH M.UCF101:A Dataset of101 Human Actions Classes From Videos in The Wild[J].ar-Xiv:1212.0402,2012.
[13]KAY W,CARREIRA J,SIMONYAN K,et al.The KineticsHuman Action Video Dataset[J].arXiv:1705.06950,2017.
[14]GU C,CHEN S,DAVID A R,et al.Ava:A video dataset of spatio-temporally localized atomic visual actions[C]//Institute of Electrical and Electronics Engineering Conference on Computer Vision and Pattern Recognition.Salt Lake City,USA,2018:6047-6056.
[15]LI A,THOTAKURI M,ROSS D A,et al.The AVA-Kinetics Localized Human Actions Video Dataset[J].arXiv:2005.00214,2020.
[16]KUEHNE H,JHUANG H,GARROTE E,et al.HMDB:ALarge Video Database for Human Motion Recognition[C]//Institute of Electrical and Electronics Engineering International Conference on Computer Vision.Barcelona,Spain,2011:2556-2563.
[17]SIGURDSSON G A,VAROL G,WANG X,et al.Hollywood in homes:Crowdsourcing data collection for activity understanding[C]//European Conference on Computer Vision.Cham:Swit-zerland,2016:510-526.
[18]GUNNAR A S,ABHINAV G,CORDELIA S,et al.Charades-ego:A large-scale dataset of paired third and first person videos[J].arXiv:1804.09626,2018.
[19]DAMEN D,DOUGHTY H,FARINELLA G M,et al.Rescaling Egocentric Vision:Collection,Pipeline and Challenges for EPIC-KITCHENS-100[J].International Journal of Computer Vision,2022,130(1):33-55.
[20]DAMEN D,DOUGHTY H,FARINELLA G M,et al.Scalingegocentric vision:The epic-kitchens dataset[C]//The European Conference on Computer Vision.Munich,Germany,2018:720-736.
[21]DAMEN D,DOUGHTY H,FARINELLA G M,et al.The epic-kitchens dataset:Collection,challenges and baselines[J].Institute of Electrical and Electronics Engineering Transactions on Pattern Analysis & Machine Intelligence,2020(1):1-1.
[22]ABU-EL-HAIJA S,KOTHARI N,LEE J,et al.Youtube-8m:A large-scale video classification benchmark[J].arXiv:1609.08675,2016.
[23]ANTIPOV G,BERRANI S A,RUCHAUD N,et al.Learnedvs.hand-crafted features for pedestrian gender recognition[C]//23rd Association for Computing Machinery International Conference on Multimedia.New York,USA,2015:1263-1266.
[24]KLASER A,MARSZALEK M,SCHMID C.A spatio-temporal descriptor based on 3D-gradients[C]//19th British Machine Vision Conference.Leeds,British,2008:1-10.
[25]WANG H,KLASER A,SCHMID C,et al.Action recognition by dense trajectories[C]//2011 Institute of Electrical and Electro-nics Engineering Conference on Computer Vision and Pattern Recognition.Colorado Springs,USA,2011:316903176.
[26]WANG H,SCHMID C.Action recognition with improved tra-jectories[C]//2013 Institute of Electrical and Electronics Engineering International Conference on Computer Vision.Sydney,Australia,2013:3551-3558.
[27]TRAN D,BOURDEV L,Fergus R,et al.Learning Spatiotemporal Features with 3D Convolutional Networks[C]//Institute of Electrical and Electronics Engineering International Conference on Computer Vision.Santiago,Chile,2015:4489-4497.
[28]HUANG K,DELANY S J,MCKEEVER S.Human Action Reco-gnition in Videos Using Transfer Learning[C]//Irish Machine Vision and Image Processing Conference.Dublin,Ireland,2019.
[29]ZHANG Z,SEJDIC E.Radiological images and machine lear-ning:trends,perspectives,and prospects[J].Computersin biology and medicine,2019,108(1):354-370.
[30]HINTON G E.Deep belief networks[J].Scholarpedia,2009,4(5):5947.
[31]TAYLOR G W,HINTON G E.Factored conditional restrictedBoltzmann machines for modeling motion style[C]//The 26th Annual International Conference on Machine Learning.New York,USA,2009:1025-1032.
[32]LAROCHELLE H,BENGIO Y.Classification using discriminative restricted Boltzmann machines[C]//The 25th International Conference on Machine Learning.New York,USA,2008:536-543.
[33]CHEN B.Deep learning of invariant spatio-temporal featuresfrom video[D].British Columbia:University of British Columbia,2010.
[34]YANG T A,SILVER D L.The Disadvantage of CNN versusDBN Image Classification Under Adversarial Conditions[C] //The 34th Canadian Conference on Artificial Intelligence.Vancouver,Canada,2021.
[35]CHEN M,RADFORD A,CHILD R,et al.Generative pretrai-ning from pixels[C]//International Conference on Machine Learning.Virtual,2020:1691-1703.
[36]SOCHER R,HUVAL B,BATH B,et al.Convolutional-recur-sive deep learning for 3d object classification[J].Advances in Neural Information Processing Systems,2012,25(1):656-664.
[37]VIJAYANARASIMHAN S,SHLENS J,MONGA R,et al.Deep networks with large output spaces[J].arXiv:1412.7479,2014.
[38]HOCHREITER S,SCHMIDHUBER J.Long short-term memory[J].Neural Computation,1997,9(8):1735-1780.
[39]NG Y H,HAUSKNECHT M,VIJAYANARASIMHAN S,et al.Beyond short snippets:Deep networks for video classification[C]//Institute of Electrical and Electronics Engineering Conference on Computer Vision and Pattern Recognition.Boston,USA,2015:4694-4702.
[40]DONAHUE J,HENDRICKS L A,ROHRBACH M,et al.Long-term recurrent convolutional networks for visual recognition and description[J].Institute of Electrical and Electronics Enginee-ring Transactions on Pattern Analysis & Machine Intelligence,2017,39(4):677-691.
[41]CARREIRA J,ZISSERMAN A.Quo vadis,action recognition? a new model and the kinetics dataset[C]//Institute of Electrical and Electronics Engineering Conference on Computer Vision and Pattern Recognition.Honolulu,USA,2017:6299-6308.
[42]SIMONYAN K,ZISSERMAN A.Two-stream convolutionalnetworks for action recognition in videos[J].Advances in Neural Information Processing Systems,2014,27:568-576.
[43]FEICHTENHOFER C,PINZ A,ZISSERMAN A.Convolutional two-stream network fusion for video action recognition[C]//Institute of Electrical and Electronics Engineering conference on Computer Vision and Pattern Recognition.Las Vegas,USA,2016:1933-1941.
[44]SIMONYAN K,ZISSERMAN A.Very deepconvolutional net-works for large-scale image recognition[J].arXiv:1409.1556,2014.
[45]LIN J,GAN C,HAN S.Tsm:Temporal shift module for efficient video understanding[C]//Institute of Electrical and Electronics Engineering/Computer Vision Foundation International Conference on Computer Vision.Seoul,Korea,2019:7083-7093.
[46]JI S,XU W,YANG M,et al.3D Convolutional Neural Networks for Human Action Recognition[J].Institute of Electrical and Electronics Engineering Transactions on Pattern Analysis & Machine Intelligence,2013,35(1):221-231.
[47]FEICHTENHOFER C,FAN H,MALIK J,et al.Slowfast networks for video recognition[C]//Institute of Electrical and Electronics Engineering International Conference on Computer Vision.Seoul,South Korea,2019:6202-6211.
[48]FEICHTENHOFER C.X3D:Expanding Architectures for Efficient Video Recognition[C]//Institute of Electrical and Electronics Engineering International Conference on Computer Vision.Virtual,2020:203-213.
[49]LEE Y,KIM H I,YUN K,et al.Diverse temporal aggregation and depthwise spatiotemporal factorization for efficient video classification[J].arXiv:2012.00317,2020.
[50]DU T,WANG H,TORRESANI L,et al.A Closer Look at Spatiotemporal Convolutions for Action Recognition[C]//Institute of Electrical and Electronics Engineering/Computer Vision Foundation Conference on Computer Vision and Pattern Recognition.Salt Lake City,USA,2018.
[51]QIU Z,YAO T,MEI T.Learning spatio-temporal representa-tion with pseudo-3D residual networks[C]//Institute of Electrical and Electronics Engineering International Conference on Computer Vision.Venice,Italy,2017:5534-5542.
[52]WANG L,XIONG Y,WANG Z,et al.Temporal segment networks:Towards good practices for deep action recognition[C]//European Conference on Computer Vision.Amsterdam,Netherlands,2016:20-36.
[53]LIU Z,LUO D,WANG Y,et al.TEINet:Towards an Efficient Architecture for Video Recognition[C]//Association for the Advance of Artificial Intelligence Conference on Artificial Intelligence.New York,USA,2020:11669-11676.
[54]LI Y,JI B,SHI X,et al.TEA:Temporal Excitation and Aggregation for Action Recognition[C]//Institute of Electrical and Electronics Engineering/Computer Vision Foundation Confe-rence on Computer Vision and Pattern Recognition.Virtual,2020:909-918.
[55]LIU Z,WANG L,WU W,et al.TAM:Temporal adaptive mo-dule for video recognition[C]//Institute of Electrical and Electronics Engineering/Computer Vision Foundation International Conference on Computer Vision.Virtual,2021:13708-13718.
[56]WANG L,TONG Z,JI B,et al.TDN:Temporal Difference Networks for Efficient Action Recognition[C]//Institute of Electrical and Electronics Engineering/Computer Vision Foundation Conference on Computer Vision and Pattern Recognition.Vir-tual,2021:1895-1904.
[57]DEVLIN J,CHANG M W,LEE K,et al.BERT:Pre-Training of Deep Bidirectional Transformers for Language Understanding[C]//Conference of the North American Chapter of the Asso-ciation for Computational Linguistics:Human Language Technologies,Volume 1(Long and Short Papers).Minneapolis,USA,2018:4171-4186.
[58]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isAll You Need[J].arXiv:1706.03762,2017.
[59]RUAN L,QIN J.Survey:Transformer Based Video-LanguagePre-Training[J].arXiv:2109.09920,2021.
[60]GIRDHAR R,CARREIRA J,DOERSCH C,et al.Video action transformer network[C]//Institute of Electrical and Electronics Engineering/Computer Vision Foundation Conference on Computer Vision and Pattern Recognition.Long Beach,USA,2019:244-253.
[61]HARA K,KATAOKA H,SATOH Y.Can spatiotemporal 3dcnns retrace the history of 2d cnns and imagenet? [C]//Institute of Electrical and Electronics Engineering/Computer Vision Foundation Conference on Computer Vision and Pattern Recognition.Salt Lake City,USA,2018:6546-6555.
[62]PARK J,JEON S,KIM S,et al.Learning to detect,associate,and recognize human actions and surrounding scenes in untrimmed videos[C]//The 1st Workshop and Challenge on Comprehensive Video Understanding in the Wild.Seoul,Korea,2018:21-26.
[63]SEONG H,HYUN J,KIM E.Video multitask transformer network[C]//Institute of Electrical and Electronics Engineering/Computer Vision Foundation International Conference on Computer Vision Workshops.Seoul,Korea,2019.
[64]BERTASIUS G,WANG H,TORRESANI L.Is Space-Time Attention All You Need for Video Understanding[J].arXiv:2102.05095,2021.
[65]ARNAB A,DEHGHANI M,HEIGOLD G,et al.ViViT:A Vi-deo Vision Transformer[C]//Institute of Electrical and Electro-nics Engineering/Computer Vision Foundation International Conference on Computer Vision.Virtual,2021:6836-6846.
[66]LIU Z,LIN Y,CAO Y,et al.Swin Transformer:Hierarchical Vision Transformer Using Shifted Windows[C]//Institute of Electrical and Electronics Engineering/Computer Vision Foundation International Conference on Computer Vision.Virtual,2021:10012-10022.
[67]KONDRATYUK D,YUAN L,LI Y,et al.Movinets:Mobilevideo networks for efficient video recognition[C]//Institute of Electrical and Electronics Engineering/Computer Vision Foundation Conference on Computer Vision and Pattern Recognition.Virtual,2021:16020-16030.
[68]KOOT R,HENNERBICHLER M,LU H.Evaluating Trans-formers for Lightweight Action Recognition[J].arXiv:2111.09641,2021.
[69]LANGERMAN D,JOHNSONA,BUETTNER K,et al.Beyond Floating-Point Ops:CNN Performance Prediction with Critical Datapath Length[C]//Institute of Electrical and Electronics Engineering High Performance Extreme Computing Conference.Virtual,2020:1-9.
[1] 周乐员, 张剑华, 袁甜甜, 陈胜勇.
多层注意力机制融合的序列到序列中国连续手语识别和翻译
Sequence-to-Sequence Chinese Continuous Sign Language Recognition and Translation with Multi- layer Attention Mechanism Fusion
计算机科学, 2022, 49(9): 155-161. https://doi.org/10.11896/jsjkx.210800026
[2] 徐涌鑫, 赵俊峰, 王亚沙, 谢冰, 杨恺.
时序知识图谱表示学习
Temporal Knowledge Graph Representation Learning
计算机科学, 2022, 49(9): 162-171. https://doi.org/10.11896/jsjkx.220500204
[3] 饶志双, 贾真, 张凡, 李天瑞.
基于Key-Value关联记忆网络的知识图谱问答方法
Key-Value Relational Memory Networks for Question Answering over Knowledge Graph
计算机科学, 2022, 49(9): 202-207. https://doi.org/10.11896/jsjkx.220300277
[4] 汤凌韬, 王迪, 张鲁飞, 刘盛云.
基于安全多方计算和差分隐私的联邦学习方案
Federated Learning Scheme Based on Secure Multi-party Computation and Differential Privacy
计算机科学, 2022, 49(9): 297-305. https://doi.org/10.11896/jsjkx.210800108
[5] 李宗民, 张玉鹏, 刘玉杰, 李华.
基于可变形图卷积的点云表征学习
Deformable Graph Convolutional Networks Based Point Cloud Representation Learning
计算机科学, 2022, 49(8): 273-278. https://doi.org/10.11896/jsjkx.210900023
[6] 王剑, 彭雨琦, 赵宇斐, 杨健.
基于深度学习的社交网络舆情信息抽取方法综述
Survey of Social Network Public Opinion Information Extraction Based on Deep Learning
计算机科学, 2022, 49(8): 279-293. https://doi.org/10.11896/jsjkx.220300099
[7] 郝志荣, 陈龙, 黄嘉成.
面向文本分类的类别区分式通用对抗攻击方法
Class Discriminative Universal Adversarial Attack for Text Classification
计算机科学, 2022, 49(8): 323-329. https://doi.org/10.11896/jsjkx.220200077
[8] 姜梦函, 李邵梅, 郑洪浩, 张建朋.
基于改进位置编码的谣言检测模型
Rumor Detection Model Based on Improved Position Embedding
计算机科学, 2022, 49(8): 330-335. https://doi.org/10.11896/jsjkx.210600046
[9] 陈泳全, 姜瑛.
基于卷积神经网络的APP用户行为分析方法
Analysis Method of APP User Behavior Based on Convolutional Neural Network
计算机科学, 2022, 49(8): 78-85. https://doi.org/10.11896/jsjkx.210700121
[10] 朱承璋, 黄嘉儿, 肖亚龙, 王晗, 邹北骥.
基于注意力机制的医学影像深度哈希检索算法
Deep Hash Retrieval Algorithm for Medical Images Based on Attention Mechanism
计算机科学, 2022, 49(8): 113-119. https://doi.org/10.11896/jsjkx.210700153
[11] 孙奇, 吉根林, 张杰.
基于非局部注意力生成对抗网络的视频异常事件检测方法
Non-local Attention Based Generative Adversarial Network for Video Abnormal Event Detection
计算机科学, 2022, 49(8): 172-177. https://doi.org/10.11896/jsjkx.210600061
[12] 檀莹莹, 王俊丽, 张超波.
基于图卷积神经网络的文本分类方法研究综述
Review of Text Classification Methods Based on Graph Convolutional Network
计算机科学, 2022, 49(8): 205-216. https://doi.org/10.11896/jsjkx.210800064
[13] 胡艳羽, 赵龙, 董祥军.
一种用于癌症分类的两阶段深度特征选择提取算法
Two-stage Deep Feature Selection Extraction Algorithm for Cancer Classification
计算机科学, 2022, 49(7): 73-78. https://doi.org/10.11896/jsjkx.210500092
[14] 张颖涛, 张杰, 张睿, 张文强.
全局信息引导的真实图像风格迁移
Photorealistic Style Transfer Guided by Global Information
计算机科学, 2022, 49(7): 100-105. https://doi.org/10.11896/jsjkx.210600036
[15] 戴朝霞, 李锦欣, 张向东, 徐旭, 梅林, 张亮.
基于DNGAN的磁共振图像超分辨率重建算法
Super-resolution Reconstruction of MRI Based on DNGAN
计算机科学, 2022, 49(7): 113-119. https://doi.org/10.11896/jsjkx.210600105
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!