计算机科学 ›› 2020, Vol. 47 ›› Issue (4): 85-93.doi: 10.11896/jsjkx.190300005

• 计算机图形学&多媒体 • 上一篇    下一篇

基于深度学习的人体行为识别方法综述

蔡强1,2, 邓毅彪1,2, 李海生1,2, 余乐1,2, 明少锋1   

  1. 1 北京工商大学计算机与信息工程学院 北京100048;
    2 食品安全大数据技术北京市重点实验室 北京100048
  • 收稿日期:2019-03-06 出版日期:2020-04-15 发布日期:2020-04-15
  • 通讯作者: 邓毅彪(dyb9714@sina.com)
  • 基金资助:
    国家自然科学基金项目(61877002);国家社会科学基金项目(18BJL202);教育部人文社会科学基金项目(17YJCZH127);北京市教委项目(PXM2019_014213_000007);北京市科技计划(Z161100_001616004)

Survey on Human Action Recognition Based on Deep Learning

CAI Qiang1,2, DENG Yi-biao1,2, LI Hai-sheng1,2, YU Le1,2, MING Shao-feng1   

  1. 1 School of Computer and Information Engineering,Beijing Technology and Business University,Beijing 100048,China;
    2 Beijing Key Laboratory of Big Data Technology for Food Safety,Beijing 100048,China
  • Received:2019-03-06 Online:2020-04-15 Published:2020-04-15
  • Contact: DENG Yi-biao,born in 1994,postgradua-te.His main research interests include computer vision and human action re-cognition
  • About author:CAI Qiang,born in 1969,Ph.D,professor,Ph.D supervisor,is a member of China Computer Federation.His main research interests include computer graphics,scientific visualization and intelligent information processing.
  • Supported by:
    This work was supported by the National Natural Science Foundation of China(61877002),National Social Science Fund of China(18BJL202),MOE (Ministry of Education in China) Project of Humanities and Social Sciences (17YJCZH127),Beijing Municipal Commission of Education,China(PXM2019 014213 000007) and Science and Technology Program of Beijing,China(Z161100_001616004)

摘要: 人体行为识别作为计算机视觉领域的重要研究热点,在智能监控、智能家居、虚拟现实等诸多领域中具有重要的研究意义和广泛的应用前景,备受国内外学者的关注。基于传统手工特征的方法难以处理复杂场景下的人体行为识别。随着深度学习在图像分类方面取得巨大成功,将深度学习用于人体行为识别方法中已逐渐成为一种发展趋势,但其仍然存在一些困难与挑战。首先,根据特征提取方法的不同,简单回顾了早期基于传统手工特征的行为识别方法;然后,从网络结构的角度着重对近年来一些基于深度学习的人体行为识别方法进行论述和分析,其中包括目前常用的双流网络架构和三维卷积网络架构等;另外,还介绍了目前用于评价方法性能的人体行为识别数据集,同时总结了部分典型方法在UCF-101和HMDB51两个著名的公开数据集上的性能;最后,从性能和应用两个方面对基于深度学习的人体行为识别方法的未来发展方向进行了展望,并指出了当前方法存在的不足之处。

关键词: 卷积神经网络, 人体行为识别, 人体行为识别数据集, 深度学习

Abstract: As an important research hotspot in the computer vision community,human action recognition has important research significance and broad application prospects in many fields such as intelligent surveillance,smart home and virtual reality,and it has attracted the attention of scholars at home and abroad.For the methods based on traditional handcrafted features,it is difficult to deal with human action recognition in complex scenarios.With the great successes of deep learning in image classification,the application of deep learning to human action recognition has gradually become a development trend,but there are still some difficulties and challenges.In this paper,firstly,according to the difference of the feature extraction approaches,the early traditional handcrafted representation-based methods for human action recognition were simply overviewed.Then,from the perspective of network architecture,some deep learning-based approaches for human action recognition were discussed and analyzed,including Two-Stream Networks,3D Convolutional Networks,etc.Besides,this paper introduced the current human action recognition datasets used to evaluate the methods performance,and summarized the performance of some typical methods on two well-known public datasets of UCF-101 and HMDB-51.Finally,the future trends of deep learning-based methods were discussed from two aspects of performance and application,and the shortcomings were also pointed out.

Key words: Convolutional neural network, Deep learning, Human action recognition, Human action recognition dataset

中图分类号: 

  • TP391
[1]HUANG K Q,CHEN X T,KANG Y F,et al.Intelligent Visual Surveillance:A Review[J].Journal of Computers,2015,38(6):1093-1118.
[2]KRIZHEVSKY A,SUTSKEVER I,HINTON G.Imagenet Classification with Deep Convolutional Neural Networks[C]//Proceedings of the Annual Conference on Neural Information Processing Systems.2012:1097-1105.
[3]AGGARWAL J K,RYOO M S.Human Activity Analysis:A Review[J].ACM Computing Survey,2011,43(3):1-43.
[4]HASSNER T.A Critical Review of Action Recognition Benchmarks[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.IEEE,2013:245-250.
[5]ZHU F,SHAO L,JIN X,et al.From Handcrafted to Learned Representations For Human Action Recognitions:A Survey[J].Image and Vision Computing,2016,55(2):42-52.
[6]LUO H L,WANG C J,LU F.Survey of Video Behavior Recognition[J].Journal on Communications,2018,39(6):169-180.
[7]BOBICK A F,DAVIS J W.The Recognition of Human Movement using Temporal Templates[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2016,38(1):142-158.
[8]FUJIYOSHI H,LIPTON A J,KANADE T.Real-time Human Motion Analysis by Image Skeletonization[J].IEICE Transacions on Information and Systems,2004,E87-D(1):113-120.
[9]YANG X D,TIAN Y L.Effective 3D Action Recognition using EigenJoints[J].Journal of Visual Communication and Image Representation,2014,25(1):2-11.
[10]LAPTEV I.On Space-Time Interest Points[J].International Journal of Computer Vision,2005,64(2/3):107-123.
[11]HARRIS C J.A Combined Corner and Edge Detector[C]//Proceedings of the Alvey Vision Conferenc.1988:147-151.
[12]DOLLAR P,RABAUD V,COTTRELL G,et al.Behavior Recognition via Sparse Spatio-Temporal Features[C]//IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.2006:65-72.
[13]WILLEMS G,TUYTELAARS T,GOOL L.An Efficient Dense and Scale-Invariant Spatio-Temporal Interest Point Detector[C]//Proceedings of European Conference on Computer Vision.2008:650-663.
[14]WANG H,ULLAH M M,KLASER A,et al.Evaluation of Local Spatio-Temporal Features for Action Recognition[C]//Proceedings of the 2009 British Machine Vision Conference.2009:124-135.
[15]DALAL N,TRIGGS B.Histograms of Oriented Gradients for Human Detection[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.2005:886-893.
[16]DALAL N,TRIGGS B,SCHMID C.Human Detection using Oriented Histograms of Flow and Appearance[C]//Proceedings of the European Conference on Computer Vision.2006:428-441.
[17]LAPTEV I,MARSZALEK M,SCHMID C,et al.Learning Realistic Human Actions from Movies[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.2008:1-8.
[18]WANG H,KLASER A,SCHMID C,et al.Action Recognition by Dense Trajectories[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.2011:3169-3176.
[19]WANG H,SCHMID C.Action Recognition with Improved Trajectories[C]//Proceedings of IEEE International Conference on Computer Vision.2013:3551-3558.
[20]BAY H,TUYTELAARS T,VAN GOOL L.Surf:speeded up robust features[C]//Proceedings of the European Conference on Computer Vision.Springer,2006:404-417.
[21]SIMONYAN K,ZISSERMAN A.Two-Stream Convolutional Networks for Action Recognition in Videos[M]//Advances in Neural Information Processing Systems.Berlin:Springer,2014:568-576.
[22]WANG L M,XIONG Y J,WANG Z,et al.Towards Good Practices for Very Deep Two-Stream ConvNets[C]//Proceedings of the European Conference on Computer Vision.2015.
[23]WANG L M,XIONG Y J,WANG Z,et al.Temporal Segment Networks:Towards Good Practices for Deep Action Recognition[C]//Proceedings of the European Conference on Computer Vision.2016.
[24]FEICHTENHOFER C,PINZ A,ZISSERMAN A.Convolutional Two-Stream Network Fusion for Video Action Recognition[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.2016.
[25]HE K M,ZHANG X Y,REN S Q,et al.Deep Residual Learning for Image Recognition[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.2016.
[26]FEICHTENHOFER C,PINZ A,WILDES R P.Spatiotemporal Residual Networks for Video Action Recognition[C]//Neural Information Processing Systems.2016.
[27]FERNANDO B,ANDERSON P,HUTTER M,et al.Discriminative Hierarchical Rank Pooling for Activity Recognition[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.2016.
[28]FERNANDO B,GOULD S.Learning End-to-End Video Classification with Rank-Pooling[C]//Proceedings of the 33rd International Conference on International Conference on Machine Learning.2016:1187-1196.
[29]BILEN H,FERNANDO B,GAVVES E,et al.Action Recognition with Dynamic Image Networks[J].IEEE Transactions on Pattern Analysis & Machine Intelligence,2018,40(12):2799-2813.
[30]GAO R H,XIONG B,GRAUMAN K.Im2flow:Motion Hallucination from Static Images for Action Recognition[J].arXiv:1712.04109,2017.
[31]WANG L,GE L,LI R,et al.Three-Stream CNNs for Action Recognition[J].Pattern Recognition Letters,2017,92(C):33-40.
[32]GIRDHAR R,RAMANAN D,GUPTA A,et al.ActionVLAD:Learning Spatio-Temporal Aggregation for Action Classification[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.2017.
[33]LIN W Y,MI Y,WU J X,et al.Action Recognition with Coarse-to-Fine Deep Feature Integration and Asynchronous Fusion[C]//Thirty-Second AAAI Conference on Artificial Intelligence.North America:AAAI Publications,2018.
[34]JI S W,XU W,YANG M,et al.3D Convolutional Neural Networks for Human Action Recognition[C]//Proceedings of the International Conference on Machine Learning.2010:495-502.
[35]JI S W,XU W,YANG M,et al.3D Convolutional Neural Networks for Human Action Recognition[J].IEEE Transactions on Pattern Analysis & Machine Intelligence,2013,35(1):221-231.
[36]TRAN D,BOURDEV L,FERGUS R,et al.Learning Spatiotemporal Features with 3D Convolutional Networks[C]//Procee-dings of IEEE International Conference on Computer Vision.2015:4489-4497.
[37]SOOMRO K,ZAMIR A R,SHAH M.UCF101:a dataset of 101 human actions classes from videos in the wild[J].arXiv:1212.0402,2012.
[38]KUEHNE H,JHUANG H,STIEFELHAGEN R,et al.HMDB51:a large video database for human motion recognition[C]//IEEE International Conference on Computer Vision.2011:2556-2563.
[39]TRAN D,RAY J,SHOU Z,et al.ConvNet Architecture Search for Spatiotemporal Feature Learning[J].arXiv:1708.05038.2017.
[40]VAROL G,LAPTEV I,SHMID C.Long-Term Temporal Convolutions for Action Recognition[J].IEEE Transactions on Pattern Analysis & Machine Intelligence,2018,40(6):1510-1517.
[41]SUN L,JIA K ,YEUNG D Y,et al.Human Action Recognition using Factorized Spatio-Temporal Convolutional Networks[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.2015:4597-4605.
[42]MANSIMOV E,SRIVASTAVA N,SALAKHUTDINOV R.Initialization Strategies of Spatio-Temporal Convolutional Neural Networks[J].arXiv:1503.07274.2015.
[43]QIU Z F,YAO T,MEI T.Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks[C]//Proceedings of IEEE International Conference on Computer Vision.2017:5533-5541.
[44]CARREIRA J,ZISSERMAN A.Quo vadis,Action Recognition? A New Model and the Kinetics Dataset[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.2017:6299-6308.
[45]KAY W,CARREIRA J,SIMONYAN K,et al.The Kinetics Human Action Video Dataset[J].arXiv:1705.06950,2017.
[46]WANG L M,LI W,LI W,et al.Appearance-and-Relation Networks for Video Classification[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.2018:1430-1439.
[47]HINTON G.A Practical Guide to Training Restricted Boltzmann Machines[J].Momentum,2010,9(1):926-947.
[48]TAYLOR G W,FERGUS R,LECUN Y,et al.Convolutional Learning of Spatio-Temporal features[C]//Proceedings of the European Conference on Computer Vision.2010:140-153.
[49]SCHULDT C,LAPTEV I,CAPUTO B.Recognizing Human Actions:A Local SVM Approach[C]//Proceedings of the 17th International Conference on Pattern Recognition.2004:23-26.
[50]MARSZALEK M,LAPTEV I,SCHMID C.Actions in Context[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.2009:2929-2936.
[51]TRAN S N,BENETOS E,GARCEZ A.Learning Motion-Difference Features using Gaussian Restricted Boltzmann Machines for Efficient Human Action Recognition[C]//2014 International Joint Conference on Neural Networks.2014:2123-2129.
[52]GRAVES A,MOHAMED A,HINTON G.Speech Recognition with Deep Recurrent Neural Networks[C]//Proceedings of IEEE International Conference on Acoustics,Speech and Signal Processing.2013:6645-6649.
[53]HOCHREITER S,SCHMIDHUBER J.Long Short-Term Memory[J].Neural Computation,1997,9(8):1735-1780.
[54]BACCOUCHE M,MAMALET F,WOLF C,et al.Sequential Deep Learning for Human Action Recognition[C]//Proceedings of IEEE International Workshop on Human Behavior Understanding.2011:29-39.
[55]DONAHUE J,HENDRICKS L A,GUADARRAMA S,et al.Long-Term Re-current Convolutional Networks for Visual Re-cognition and Description [C]//The IEEE Conference on Computer Vision and Pattern Recognition.2015:2625-2634.
[56]WU Z X,WANG X,JIANG Y G,et al.Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification[C]//Proceedings of the ACM international Conference on Multimedia (ACM MM).2015:461-470.
[57]LI Z Y,GAVRILYUK K,GAVVES E,et al.VideoLSTM Convolves,Attends and Flows for Action Recognition[J].Computer Vision and Image Understanding,2018,166:41-50.
[58]GORELICK L,BLANK M,SHECHTMAN E,et al.Actions as Space-Time Shapes[J].IEEE Transactions on Pattern Analysis &Machine Intelligence,2007,29(12):2247-2253.
[59]RODRIGUEZ M D,AHMED J,SHAH M.Action MACH a Spatio-Temporal Maximum Average Correlation Height Filter for Action Recognition[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.2008:1-8.
[60]WANG X L,GIRSHICK R,GUPTA A,et al.Non-local Neural Networks[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Salt Lake City:IEEE Press,2018:7794-7803.
[61]DU Y,YUAN C F,LI B,et al.Interaction-aware Spatio-Temporal Pyramid Attention Networks for Action Classification[C]//Proceedings of the European Conference on Computer Vision.2018:388-403.
[62]AHSAN U,SUN C,ESSA I.DiscrimNet:Semi-Supervised Action Recognition from Videos using Generative Adversarial Networks[J].arXiv:1801.07230,2018.
[63]ZOLFAGHARI M,SINGH K,BROX T.ECO:Efficient Convolutional Network for Online Video Understanding[C]//Proceedings of the European Conference on Computer Vision.Munich:Springer,2018:695-712.
[1] 饶志双, 贾真, 张凡, 李天瑞.
基于Key-Value关联记忆网络的知识图谱问答方法
Key-Value Relational Memory Networks for Question Answering over Knowledge Graph
计算机科学, 2022, 49(9): 202-207. https://doi.org/10.11896/jsjkx.220300277
[2] 汤凌韬, 王迪, 张鲁飞, 刘盛云.
基于安全多方计算和差分隐私的联邦学习方案
Federated Learning Scheme Based on Secure Multi-party Computation and Differential Privacy
计算机科学, 2022, 49(9): 297-305. https://doi.org/10.11896/jsjkx.210800108
[3] 周乐员, 张剑华, 袁甜甜, 陈胜勇.
多层注意力机制融合的序列到序列中国连续手语识别和翻译
Sequence-to-Sequence Chinese Continuous Sign Language Recognition and Translation with Multi- layer Attention Mechanism Fusion
计算机科学, 2022, 49(9): 155-161. https://doi.org/10.11896/jsjkx.210800026
[4] 徐涌鑫, 赵俊峰, 王亚沙, 谢冰, 杨恺.
时序知识图谱表示学习
Temporal Knowledge Graph Representation Learning
计算机科学, 2022, 49(9): 162-171. https://doi.org/10.11896/jsjkx.220500204
[5] 李宗民, 张玉鹏, 刘玉杰, 李华.
基于可变形图卷积的点云表征学习
Deformable Graph Convolutional Networks Based Point Cloud Representation Learning
计算机科学, 2022, 49(8): 273-278. https://doi.org/10.11896/jsjkx.210900023
[6] 王剑, 彭雨琦, 赵宇斐, 杨健.
基于深度学习的社交网络舆情信息抽取方法综述
Survey of Social Network Public Opinion Information Extraction Based on Deep Learning
计算机科学, 2022, 49(8): 279-293. https://doi.org/10.11896/jsjkx.220300099
[7] 郝志荣, 陈龙, 黄嘉成.
面向文本分类的类别区分式通用对抗攻击方法
Class Discriminative Universal Adversarial Attack for Text Classification
计算机科学, 2022, 49(8): 323-329. https://doi.org/10.11896/jsjkx.220200077
[8] 姜梦函, 李邵梅, 郑洪浩, 张建朋.
基于改进位置编码的谣言检测模型
Rumor Detection Model Based on Improved Position Embedding
计算机科学, 2022, 49(8): 330-335. https://doi.org/10.11896/jsjkx.210600046
[9] 陈泳全, 姜瑛.
基于卷积神经网络的APP用户行为分析方法
Analysis Method of APP User Behavior Based on Convolutional Neural Network
计算机科学, 2022, 49(8): 78-85. https://doi.org/10.11896/jsjkx.210700121
[10] 朱承璋, 黄嘉儿, 肖亚龙, 王晗, 邹北骥.
基于注意力机制的医学影像深度哈希检索算法
Deep Hash Retrieval Algorithm for Medical Images Based on Attention Mechanism
计算机科学, 2022, 49(8): 113-119. https://doi.org/10.11896/jsjkx.210700153
[11] 孙奇, 吉根林, 张杰.
基于非局部注意力生成对抗网络的视频异常事件检测方法
Non-local Attention Based Generative Adversarial Network for Video Abnormal Event Detection
计算机科学, 2022, 49(8): 172-177. https://doi.org/10.11896/jsjkx.210600061
[12] 檀莹莹, 王俊丽, 张超波.
基于图卷积神经网络的文本分类方法研究综述
Review of Text Classification Methods Based on Graph Convolutional Network
计算机科学, 2022, 49(8): 205-216. https://doi.org/10.11896/jsjkx.210800064
[13] 胡艳羽, 赵龙, 董祥军.
一种用于癌症分类的两阶段深度特征选择提取算法
Two-stage Deep Feature Selection Extraction Algorithm for Cancer Classification
计算机科学, 2022, 49(7): 73-78. https://doi.org/10.11896/jsjkx.210500092
[14] 张颖涛, 张杰, 张睿, 张文强.
全局信息引导的真实图像风格迁移
Photorealistic Style Transfer Guided by Global Information
计算机科学, 2022, 49(7): 100-105. https://doi.org/10.11896/jsjkx.210600036
[15] 戴朝霞, 李锦欣, 张向东, 徐旭, 梅林, 张亮.
基于DNGAN的磁共振图像超分辨率重建算法
Super-resolution Reconstruction of MRI Based on DNGAN
计算机科学, 2022, 49(7): 113-119. https://doi.org/10.11896/jsjkx.210600105
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!