计算机科学 ›› 2022, Vol. 49 ›› Issue (11A): 211200025-10.doi: 10.11896/jsjkx.211200025
钱文祥1,3, 衣杨1,2,3
QIAN Wen-xiang1,3, YI Yang1,2,3
摘要: 视频识别是计算机视觉领域中最重要的任务之一,受到了研究者的广泛关注。视频识别指从视频片段中提取特征,并依据特征识别视频动作。相比于静态图片,视频的各帧间存在较大的关联性。如何高效地使用来自时空等不同维度的特征信息准确地识别视频,是当前研究的重点。以视频识别技术为研究对象,首先介绍了视频识别研究的背景信息及常用数据集。然后,详细地梳理了视频识别方法的演变过程;回顾了基于时空兴趣点、密集轨迹、改进的密集轨迹等传统的视频识别方法,以及近年来提出的可用于视频识别的深度学习网络框架。其中,分别介绍了基于2D卷积神经网络的视频识别框架、基于3D卷积神经网络的视频框架、伪3D卷积神经网络,以及基于Transformer结构的网络,介绍了这些框架的演变,并总结了它们的实现细节及特点;评测了各网络在不同视频识别数据集上的表现情况,分析了各网络的适用场景。最后,展望了视频识别网络框架未来的研究趋势。视频识别任务可以自动、高效地识别出视频所属的类别,基于深度学习的视频识别具有广泛的实用价值。
中图分类号:
[1]DALAL N,TRIGGS B.Histograms of oriented gradients forhuman detection[C]//Institute of Electrical and Electronics Engineering Conference on Computer Vision and Pattern Recognition.San Diego,USA,2005:886-893. [2]CHAUDHRY R,RAVICHANDRAN A,HAGER G,et al.Histograms of oriented optical flow and binet-cauchy kernels on nonlinear dynamical systems for the recognition of human actions[C]//Institute of Electrical and Electronics Engineering Conference on Computer Vision and Pattern Recognition.Miami,USA,2009:1932-1939. [3]WANG H,KLASER A,SCHMID C,et al.Dense Trajectories and Motion Boundary Descriptors for Action Recognition[J].International Journal of Computer Vision,2013,103(1):61-79. [4]LAZEBNIK S,SCHMID C,PONCE J.Beyond Bags of Fea-tures:Spatial Pyramid Matching for Recognizing Natural Scene Categories[C]//Electrical and Electronics Engineering ComputerSociety Conference on Computer Vision and Pattern Recognition.New York,USA,2006:2169-2178. [5]YANG M,ZHANG L,FENG X,et al.Sparse representation based fisher discrimination dictionary learning for image classification[J].International Journal of Computer Vision,2014,109(3):209-232. [6]HINTON G E.Learning multiple layersof representation[J].Trends in Cognitive Sciences,2007,11(10):428-434. [7]DENG L,YU D.Deep learning:methods and applications[J].Foundations and Trendsr in Signal Processing,2014,7(3/4):197-387. [8]SCHMIDHUBER J.Deep learning in neural networks:an overview[J].Neural Networks,2015,61(1):85-117. [9]KRIZHEVSKY A,SUTSKEVER I,HINTON G E.Imagenetclassification with deep convolutional neural networks[J].Advances in Neural Information Processing Systems,2012,25(1):1097-1105. [10]KARPATHY A,TODERICI G,SHETTY S,et al.Large-Scale Video Classification with Convolutional Neural Networks[C]//Institute of Electrical and Electronics Engineering Conference on Computer Vision and Pattern Recognition.Columbus,USA,2014:1725-1732. [11]MATERZYNSKA J,XIAO T,HERZIG R,et al.Something-Else:Compositional Action Recognition With Spatial-Temporal Interaction Networks[C]//Institute of Electrical and Electroni-cs Engineering/Computer Vision Foundation Conference on Computer Vision and Pattern Recognition.Virtual,2020:1049-1059. [12]SOOMRO K,ZAMIR A R,SHAH M.UCF101:A Dataset of101 Human Actions Classes From Videos in The Wild[J].ar-Xiv:1212.0402,2012. [13]KAY W,CARREIRA J,SIMONYAN K,et al.The KineticsHuman Action Video Dataset[J].arXiv:1705.06950,2017. [14]GU C,CHEN S,DAVID A R,et al.Ava:A video dataset of spatio-temporally localized atomic visual actions[C]//Institute of Electrical and Electronics Engineering Conference on Computer Vision and Pattern Recognition.Salt Lake City,USA,2018:6047-6056. [15]LI A,THOTAKURI M,ROSS D A,et al.The AVA-Kinetics Localized Human Actions Video Dataset[J].arXiv:2005.00214,2020. [16]KUEHNE H,JHUANG H,GARROTE E,et al.HMDB:ALarge Video Database for Human Motion Recognition[C]//Institute of Electrical and Electronics Engineering International Conference on Computer Vision.Barcelona,Spain,2011:2556-2563. [17]SIGURDSSON G A,VAROL G,WANG X,et al.Hollywood in homes:Crowdsourcing data collection for activity understanding[C]//European Conference on Computer Vision.Cham:Swit-zerland,2016:510-526. [18]GUNNAR A S,ABHINAV G,CORDELIA S,et al.Charades-ego:A large-scale dataset of paired third and first person videos[J].arXiv:1804.09626,2018. [19]DAMEN D,DOUGHTY H,FARINELLA G M,et al.Rescaling Egocentric Vision:Collection,Pipeline and Challenges for EPIC-KITCHENS-100[J].International Journal of Computer Vision,2022,130(1):33-55. [20]DAMEN D,DOUGHTY H,FARINELLA G M,et al.Scalingegocentric vision:The epic-kitchens dataset[C]//The European Conference on Computer Vision.Munich,Germany,2018:720-736. [21]DAMEN D,DOUGHTY H,FARINELLA G M,et al.The epic-kitchens dataset:Collection,challenges and baselines[J].Institute of Electrical and Electronics Engineering Transactions on Pattern Analysis & Machine Intelligence,2020(1):1-1. [22]ABU-EL-HAIJA S,KOTHARI N,LEE J,et al.Youtube-8m:A large-scale video classification benchmark[J].arXiv:1609.08675,2016. [23]ANTIPOV G,BERRANI S A,RUCHAUD N,et al.Learnedvs.hand-crafted features for pedestrian gender recognition[C]//23rd Association for Computing Machinery International Conference on Multimedia.New York,USA,2015:1263-1266. [24]KLASER A,MARSZALEK M,SCHMID C.A spatio-temporal descriptor based on 3D-gradients[C]//19th British Machine Vision Conference.Leeds,British,2008:1-10. [25]WANG H,KLASER A,SCHMID C,et al.Action recognition by dense trajectories[C]//2011 Institute of Electrical and Electro-nics Engineering Conference on Computer Vision and Pattern Recognition.Colorado Springs,USA,2011:316903176. [26]WANG H,SCHMID C.Action recognition with improved tra-jectories[C]//2013 Institute of Electrical and Electronics Engineering International Conference on Computer Vision.Sydney,Australia,2013:3551-3558. [27]TRAN D,BOURDEV L,Fergus R,et al.Learning Spatiotemporal Features with 3D Convolutional Networks[C]//Institute of Electrical and Electronics Engineering International Conference on Computer Vision.Santiago,Chile,2015:4489-4497. [28]HUANG K,DELANY S J,MCKEEVER S.Human Action Reco-gnition in Videos Using Transfer Learning[C]//Irish Machine Vision and Image Processing Conference.Dublin,Ireland,2019. [29]ZHANG Z,SEJDIC E.Radiological images and machine lear-ning:trends,perspectives,and prospects[J].Computersin biology and medicine,2019,108(1):354-370. [30]HINTON G E.Deep belief networks[J].Scholarpedia,2009,4(5):5947. [31]TAYLOR G W,HINTON G E.Factored conditional restrictedBoltzmann machines for modeling motion style[C]//The 26th Annual International Conference on Machine Learning.New York,USA,2009:1025-1032. [32]LAROCHELLE H,BENGIO Y.Classification using discriminative restricted Boltzmann machines[C]//The 25th International Conference on Machine Learning.New York,USA,2008:536-543. [33]CHEN B.Deep learning of invariant spatio-temporal featuresfrom video[D].British Columbia:University of British Columbia,2010. [34]YANG T A,SILVER D L.The Disadvantage of CNN versusDBN Image Classification Under Adversarial Conditions[C] //The 34th Canadian Conference on Artificial Intelligence.Vancouver,Canada,2021. [35]CHEN M,RADFORD A,CHILD R,et al.Generative pretrai-ning from pixels[C]//International Conference on Machine Learning.Virtual,2020:1691-1703. [36]SOCHER R,HUVAL B,BATH B,et al.Convolutional-recur-sive deep learning for 3d object classification[J].Advances in Neural Information Processing Systems,2012,25(1):656-664. [37]VIJAYANARASIMHAN S,SHLENS J,MONGA R,et al.Deep networks with large output spaces[J].arXiv:1412.7479,2014. [38]HOCHREITER S,SCHMIDHUBER J.Long short-term memory[J].Neural Computation,1997,9(8):1735-1780. [39]NG Y H,HAUSKNECHT M,VIJAYANARASIMHAN S,et al.Beyond short snippets:Deep networks for video classification[C]//Institute of Electrical and Electronics Engineering Conference on Computer Vision and Pattern Recognition.Boston,USA,2015:4694-4702. [40]DONAHUE J,HENDRICKS L A,ROHRBACH M,et al.Long-term recurrent convolutional networks for visual recognition and description[J].Institute of Electrical and Electronics Enginee-ring Transactions on Pattern Analysis & Machine Intelligence,2017,39(4):677-691. [41]CARREIRA J,ZISSERMAN A.Quo vadis,action recognition? a new model and the kinetics dataset[C]//Institute of Electrical and Electronics Engineering Conference on Computer Vision and Pattern Recognition.Honolulu,USA,2017:6299-6308. [42]SIMONYAN K,ZISSERMAN A.Two-stream convolutionalnetworks for action recognition in videos[J].Advances in Neural Information Processing Systems,2014,27:568-576. [43]FEICHTENHOFER C,PINZ A,ZISSERMAN A.Convolutional two-stream network fusion for video action recognition[C]//Institute of Electrical and Electronics Engineering conference on Computer Vision and Pattern Recognition.Las Vegas,USA,2016:1933-1941. [44]SIMONYAN K,ZISSERMAN A.Very deepconvolutional net-works for large-scale image recognition[J].arXiv:1409.1556,2014. [45]LIN J,GAN C,HAN S.Tsm:Temporal shift module for efficient video understanding[C]//Institute of Electrical and Electronics Engineering/Computer Vision Foundation International Conference on Computer Vision.Seoul,Korea,2019:7083-7093. [46]JI S,XU W,YANG M,et al.3D Convolutional Neural Networks for Human Action Recognition[J].Institute of Electrical and Electronics Engineering Transactions on Pattern Analysis & Machine Intelligence,2013,35(1):221-231. [47]FEICHTENHOFER C,FAN H,MALIK J,et al.Slowfast networks for video recognition[C]//Institute of Electrical and Electronics Engineering International Conference on Computer Vision.Seoul,South Korea,2019:6202-6211. [48]FEICHTENHOFER C.X3D:Expanding Architectures for Efficient Video Recognition[C]//Institute of Electrical and Electronics Engineering International Conference on Computer Vision.Virtual,2020:203-213. [49]LEE Y,KIM H I,YUN K,et al.Diverse temporal aggregation and depthwise spatiotemporal factorization for efficient video classification[J].arXiv:2012.00317,2020. [50]DU T,WANG H,TORRESANI L,et al.A Closer Look at Spatiotemporal Convolutions for Action Recognition[C]//Institute of Electrical and Electronics Engineering/Computer Vision Foundation Conference on Computer Vision and Pattern Recognition.Salt Lake City,USA,2018. [51]QIU Z,YAO T,MEI T.Learning spatio-temporal representa-tion with pseudo-3D residual networks[C]//Institute of Electrical and Electronics Engineering International Conference on Computer Vision.Venice,Italy,2017:5534-5542. [52]WANG L,XIONG Y,WANG Z,et al.Temporal segment networks:Towards good practices for deep action recognition[C]//European Conference on Computer Vision.Amsterdam,Netherlands,2016:20-36. [53]LIU Z,LUO D,WANG Y,et al.TEINet:Towards an Efficient Architecture for Video Recognition[C]//Association for the Advance of Artificial Intelligence Conference on Artificial Intelligence.New York,USA,2020:11669-11676. [54]LI Y,JI B,SHI X,et al.TEA:Temporal Excitation and Aggregation for Action Recognition[C]//Institute of Electrical and Electronics Engineering/Computer Vision Foundation Confe-rence on Computer Vision and Pattern Recognition.Virtual,2020:909-918. [55]LIU Z,WANG L,WU W,et al.TAM:Temporal adaptive mo-dule for video recognition[C]//Institute of Electrical and Electronics Engineering/Computer Vision Foundation International Conference on Computer Vision.Virtual,2021:13708-13718. [56]WANG L,TONG Z,JI B,et al.TDN:Temporal Difference Networks for Efficient Action Recognition[C]//Institute of Electrical and Electronics Engineering/Computer Vision Foundation Conference on Computer Vision and Pattern Recognition.Vir-tual,2021:1895-1904. [57]DEVLIN J,CHANG M W,LEE K,et al.BERT:Pre-Training of Deep Bidirectional Transformers for Language Understanding[C]//Conference of the North American Chapter of the Asso-ciation for Computational Linguistics:Human Language Technologies,Volume 1(Long and Short Papers).Minneapolis,USA,2018:4171-4186. [58]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isAll You Need[J].arXiv:1706.03762,2017. [59]RUAN L,QIN J.Survey:Transformer Based Video-LanguagePre-Training[J].arXiv:2109.09920,2021. [60]GIRDHAR R,CARREIRA J,DOERSCH C,et al.Video action transformer network[C]//Institute of Electrical and Electronics Engineering/Computer Vision Foundation Conference on Computer Vision and Pattern Recognition.Long Beach,USA,2019:244-253. [61]HARA K,KATAOKA H,SATOH Y.Can spatiotemporal 3dcnns retrace the history of 2d cnns and imagenet? [C]//Institute of Electrical and Electronics Engineering/Computer Vision Foundation Conference on Computer Vision and Pattern Recognition.Salt Lake City,USA,2018:6546-6555. [62]PARK J,JEON S,KIM S,et al.Learning to detect,associate,and recognize human actions and surrounding scenes in untrimmed videos[C]//The 1st Workshop and Challenge on Comprehensive Video Understanding in the Wild.Seoul,Korea,2018:21-26. [63]SEONG H,HYUN J,KIM E.Video multitask transformer network[C]//Institute of Electrical and Electronics Engineering/Computer Vision Foundation International Conference on Computer Vision Workshops.Seoul,Korea,2019. [64]BERTASIUS G,WANG H,TORRESANI L.Is Space-Time Attention All You Need for Video Understanding[J].arXiv:2102.05095,2021. [65]ARNAB A,DEHGHANI M,HEIGOLD G,et al.ViViT:A Vi-deo Vision Transformer[C]//Institute of Electrical and Electro-nics Engineering/Computer Vision Foundation International Conference on Computer Vision.Virtual,2021:6836-6846. [66]LIU Z,LIN Y,CAO Y,et al.Swin Transformer:Hierarchical Vision Transformer Using Shifted Windows[C]//Institute of Electrical and Electronics Engineering/Computer Vision Foundation International Conference on Computer Vision.Virtual,2021:10012-10022. [67]KONDRATYUK D,YUAN L,LI Y,et al.Movinets:Mobilevideo networks for efficient video recognition[C]//Institute of Electrical and Electronics Engineering/Computer Vision Foundation Conference on Computer Vision and Pattern Recognition.Virtual,2021:16020-16030. [68]KOOT R,HENNERBICHLER M,LU H.Evaluating Trans-formers for Lightweight Action Recognition[J].arXiv:2111.09641,2021. [69]LANGERMAN D,JOHNSONA,BUETTNER K,et al.Beyond Floating-Point Ops:CNN Performance Prediction with Critical Datapath Length[C]//Institute of Electrical and Electronics Engineering High Performance Extreme Computing Conference.Virtual,2020:1-9. |
[1] | 周乐员, 张剑华, 袁甜甜, 陈胜勇. 多层注意力机制融合的序列到序列中国连续手语识别和翻译 Sequence-to-Sequence Chinese Continuous Sign Language Recognition and Translation with Multi- layer Attention Mechanism Fusion 计算机科学, 2022, 49(9): 155-161. https://doi.org/10.11896/jsjkx.210800026 |
[2] | 徐涌鑫, 赵俊峰, 王亚沙, 谢冰, 杨恺. 时序知识图谱表示学习 Temporal Knowledge Graph Representation Learning 计算机科学, 2022, 49(9): 162-171. https://doi.org/10.11896/jsjkx.220500204 |
[3] | 饶志双, 贾真, 张凡, 李天瑞. 基于Key-Value关联记忆网络的知识图谱问答方法 Key-Value Relational Memory Networks for Question Answering over Knowledge Graph 计算机科学, 2022, 49(9): 202-207. https://doi.org/10.11896/jsjkx.220300277 |
[4] | 汤凌韬, 王迪, 张鲁飞, 刘盛云. 基于安全多方计算和差分隐私的联邦学习方案 Federated Learning Scheme Based on Secure Multi-party Computation and Differential Privacy 计算机科学, 2022, 49(9): 297-305. https://doi.org/10.11896/jsjkx.210800108 |
[5] | 李宗民, 张玉鹏, 刘玉杰, 李华. 基于可变形图卷积的点云表征学习 Deformable Graph Convolutional Networks Based Point Cloud Representation Learning 计算机科学, 2022, 49(8): 273-278. https://doi.org/10.11896/jsjkx.210900023 |
[6] | 王剑, 彭雨琦, 赵宇斐, 杨健. 基于深度学习的社交网络舆情信息抽取方法综述 Survey of Social Network Public Opinion Information Extraction Based on Deep Learning 计算机科学, 2022, 49(8): 279-293. https://doi.org/10.11896/jsjkx.220300099 |
[7] | 郝志荣, 陈龙, 黄嘉成. 面向文本分类的类别区分式通用对抗攻击方法 Class Discriminative Universal Adversarial Attack for Text Classification 计算机科学, 2022, 49(8): 323-329. https://doi.org/10.11896/jsjkx.220200077 |
[8] | 姜梦函, 李邵梅, 郑洪浩, 张建朋. 基于改进位置编码的谣言检测模型 Rumor Detection Model Based on Improved Position Embedding 计算机科学, 2022, 49(8): 330-335. https://doi.org/10.11896/jsjkx.210600046 |
[9] | 陈泳全, 姜瑛. 基于卷积神经网络的APP用户行为分析方法 Analysis Method of APP User Behavior Based on Convolutional Neural Network 计算机科学, 2022, 49(8): 78-85. https://doi.org/10.11896/jsjkx.210700121 |
[10] | 朱承璋, 黄嘉儿, 肖亚龙, 王晗, 邹北骥. 基于注意力机制的医学影像深度哈希检索算法 Deep Hash Retrieval Algorithm for Medical Images Based on Attention Mechanism 计算机科学, 2022, 49(8): 113-119. https://doi.org/10.11896/jsjkx.210700153 |
[11] | 孙奇, 吉根林, 张杰. 基于非局部注意力生成对抗网络的视频异常事件检测方法 Non-local Attention Based Generative Adversarial Network for Video Abnormal Event Detection 计算机科学, 2022, 49(8): 172-177. https://doi.org/10.11896/jsjkx.210600061 |
[12] | 檀莹莹, 王俊丽, 张超波. 基于图卷积神经网络的文本分类方法研究综述 Review of Text Classification Methods Based on Graph Convolutional Network 计算机科学, 2022, 49(8): 205-216. https://doi.org/10.11896/jsjkx.210800064 |
[13] | 胡艳羽, 赵龙, 董祥军. 一种用于癌症分类的两阶段深度特征选择提取算法 Two-stage Deep Feature Selection Extraction Algorithm for Cancer Classification 计算机科学, 2022, 49(7): 73-78. https://doi.org/10.11896/jsjkx.210500092 |
[14] | 张颖涛, 张杰, 张睿, 张文强. 全局信息引导的真实图像风格迁移 Photorealistic Style Transfer Guided by Global Information 计算机科学, 2022, 49(7): 100-105. https://doi.org/10.11896/jsjkx.210600036 |
[15] | 戴朝霞, 李锦欣, 张向东, 徐旭, 梅林, 张亮. 基于DNGAN的磁共振图像超分辨率重建算法 Super-resolution Reconstruction of MRI Based on DNGAN 计算机科学, 2022, 49(7): 113-119. https://doi.org/10.11896/jsjkx.210600105 |
|