Computer Science ›› 2020, Vol. 47 ›› Issue (4): 85-93.doi: 10.11896/jsjkx.190300005

• Computer Graphics & Multimedia • Previous Articles     Next Articles

Survey on Human Action Recognition Based on Deep Learning

CAI Qiang1,2, DENG Yi-biao1,2, LI Hai-sheng1,2, YU Le1,2, MING Shao-feng1   

  1. 1 School of Computer and Information Engineering,Beijing Technology and Business University,Beijing 100048,China;
    2 Beijing Key Laboratory of Big Data Technology for Food Safety,Beijing 100048,China
  • Received:2019-03-06 Online:2020-04-15 Published:2020-04-15
  • Contact: DENG Yi-biao,born in 1994,postgradua-te.His main research interests include computer vision and human action re-cognition
  • About author:CAI Qiang,born in 1969,Ph.D,professor,Ph.D supervisor,is a member of China Computer Federation.His main research interests include computer graphics,scientific visualization and intelligent information processing.
  • Supported by:
    This work was supported by the National Natural Science Foundation of China(61877002),National Social Science Fund of China(18BJL202),MOE (Ministry of Education in China) Project of Humanities and Social Sciences (17YJCZH127),Beijing Municipal Commission of Education,China(PXM2019 014213 000007) and Science and Technology Program of Beijing,China(Z161100_001616004)

Abstract: As an important research hotspot in the computer vision community,human action recognition has important research significance and broad application prospects in many fields such as intelligent surveillance,smart home and virtual reality,and it has attracted the attention of scholars at home and abroad.For the methods based on traditional handcrafted features,it is difficult to deal with human action recognition in complex scenarios.With the great successes of deep learning in image classification,the application of deep learning to human action recognition has gradually become a development trend,but there are still some difficulties and challenges.In this paper,firstly,according to the difference of the feature extraction approaches,the early traditional handcrafted representation-based methods for human action recognition were simply overviewed.Then,from the perspective of network architecture,some deep learning-based approaches for human action recognition were discussed and analyzed,including Two-Stream Networks,3D Convolutional Networks,etc.Besides,this paper introduced the current human action recognition datasets used to evaluate the methods performance,and summarized the performance of some typical methods on two well-known public datasets of UCF-101 and HMDB-51.Finally,the future trends of deep learning-based methods were discussed from two aspects of performance and application,and the shortcomings were also pointed out.

Key words: Convolutional neural network, Deep learning, Human action recognition, Human action recognition dataset

CLC Number: 

  • TP391
[1]HUANG K Q,CHEN X T,KANG Y F,et al.Intelligent Visual Surveillance:A Review[J].Journal of Computers,2015,38(6):1093-1118.
[2]KRIZHEVSKY A,SUTSKEVER I,HINTON G.Imagenet Classification with Deep Convolutional Neural Networks[C]//Proceedings of the Annual Conference on Neural Information Processing Systems.2012:1097-1105.
[3]AGGARWAL J K,RYOO M S.Human Activity Analysis:A Review[J].ACM Computing Survey,2011,43(3):1-43.
[4]HASSNER T.A Critical Review of Action Recognition Benchmarks[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.IEEE,2013:245-250.
[5]ZHU F,SHAO L,JIN X,et al.From Handcrafted to Learned Representations For Human Action Recognitions:A Survey[J].Image and Vision Computing,2016,55(2):42-52.
[6]LUO H L,WANG C J,LU F.Survey of Video Behavior Recognition[J].Journal on Communications,2018,39(6):169-180.
[7]BOBICK A F,DAVIS J W.The Recognition of Human Movement using Temporal Templates[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2016,38(1):142-158.
[8]FUJIYOSHI H,LIPTON A J,KANADE T.Real-time Human Motion Analysis by Image Skeletonization[J].IEICE Transacions on Information and Systems,2004,E87-D(1):113-120.
[9]YANG X D,TIAN Y L.Effective 3D Action Recognition using EigenJoints[J].Journal of Visual Communication and Image Representation,2014,25(1):2-11.
[10]LAPTEV I.On Space-Time Interest Points[J].International Journal of Computer Vision,2005,64(2/3):107-123.
[11]HARRIS C J.A Combined Corner and Edge Detector[C]//Proceedings of the Alvey Vision Conferenc.1988:147-151.
[12]DOLLAR P,RABAUD V,COTTRELL G,et al.Behavior Recognition via Sparse Spatio-Temporal Features[C]//IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.2006:65-72.
[13]WILLEMS G,TUYTELAARS T,GOOL L.An Efficient Dense and Scale-Invariant Spatio-Temporal Interest Point Detector[C]//Proceedings of European Conference on Computer Vision.2008:650-663.
[14]WANG H,ULLAH M M,KLASER A,et al.Evaluation of Local Spatio-Temporal Features for Action Recognition[C]//Proceedings of the 2009 British Machine Vision Conference.2009:124-135.
[15]DALAL N,TRIGGS B.Histograms of Oriented Gradients for Human Detection[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.2005:886-893.
[16]DALAL N,TRIGGS B,SCHMID C.Human Detection using Oriented Histograms of Flow and Appearance[C]//Proceedings of the European Conference on Computer Vision.2006:428-441.
[17]LAPTEV I,MARSZALEK M,SCHMID C,et al.Learning Realistic Human Actions from Movies[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.2008:1-8.
[18]WANG H,KLASER A,SCHMID C,et al.Action Recognition by Dense Trajectories[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.2011:3169-3176.
[19]WANG H,SCHMID C.Action Recognition with Improved Trajectories[C]//Proceedings of IEEE International Conference on Computer Vision.2013:3551-3558.
[20]BAY H,TUYTELAARS T,VAN GOOL L.Surf:speeded up robust features[C]//Proceedings of the European Conference on Computer Vision.Springer,2006:404-417.
[21]SIMONYAN K,ZISSERMAN A.Two-Stream Convolutional Networks for Action Recognition in Videos[M]//Advances in Neural Information Processing Systems.Berlin:Springer,2014:568-576.
[22]WANG L M,XIONG Y J,WANG Z,et al.Towards Good Practices for Very Deep Two-Stream ConvNets[C]//Proceedings of the European Conference on Computer Vision.2015.
[23]WANG L M,XIONG Y J,WANG Z,et al.Temporal Segment Networks:Towards Good Practices for Deep Action Recognition[C]//Proceedings of the European Conference on Computer Vision.2016.
[24]FEICHTENHOFER C,PINZ A,ZISSERMAN A.Convolutional Two-Stream Network Fusion for Video Action Recognition[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.2016.
[25]HE K M,ZHANG X Y,REN S Q,et al.Deep Residual Learning for Image Recognition[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.2016.
[26]FEICHTENHOFER C,PINZ A,WILDES R P.Spatiotemporal Residual Networks for Video Action Recognition[C]//Neural Information Processing Systems.2016.
[27]FERNANDO B,ANDERSON P,HUTTER M,et al.Discriminative Hierarchical Rank Pooling for Activity Recognition[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.2016.
[28]FERNANDO B,GOULD S.Learning End-to-End Video Classification with Rank-Pooling[C]//Proceedings of the 33rd International Conference on International Conference on Machine Learning.2016:1187-1196.
[29]BILEN H,FERNANDO B,GAVVES E,et al.Action Recognition with Dynamic Image Networks[J].IEEE Transactions on Pattern Analysis & Machine Intelligence,2018,40(12):2799-2813.
[30]GAO R H,XIONG B,GRAUMAN K.Im2flow:Motion Hallucination from Static Images for Action Recognition[J].arXiv:1712.04109,2017.
[31]WANG L,GE L,LI R,et al.Three-Stream CNNs for Action Recognition[J].Pattern Recognition Letters,2017,92(C):33-40.
[32]GIRDHAR R,RAMANAN D,GUPTA A,et al.ActionVLAD:Learning Spatio-Temporal Aggregation for Action Classification[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.2017.
[33]LIN W Y,MI Y,WU J X,et al.Action Recognition with Coarse-to-Fine Deep Feature Integration and Asynchronous Fusion[C]//Thirty-Second AAAI Conference on Artificial Intelligence.North America:AAAI Publications,2018.
[34]JI S W,XU W,YANG M,et al.3D Convolutional Neural Networks for Human Action Recognition[C]//Proceedings of the International Conference on Machine Learning.2010:495-502.
[35]JI S W,XU W,YANG M,et al.3D Convolutional Neural Networks for Human Action Recognition[J].IEEE Transactions on Pattern Analysis & Machine Intelligence,2013,35(1):221-231.
[36]TRAN D,BOURDEV L,FERGUS R,et al.Learning Spatiotemporal Features with 3D Convolutional Networks[C]//Procee-dings of IEEE International Conference on Computer Vision.2015:4489-4497.
[37]SOOMRO K,ZAMIR A R,SHAH M.UCF101:a dataset of 101 human actions classes from videos in the wild[J].arXiv:1212.0402,2012.
[38]KUEHNE H,JHUANG H,STIEFELHAGEN R,et al.HMDB51:a large video database for human motion recognition[C]//IEEE International Conference on Computer Vision.2011:2556-2563.
[39]TRAN D,RAY J,SHOU Z,et al.ConvNet Architecture Search for Spatiotemporal Feature Learning[J].arXiv:1708.05038.2017.
[40]VAROL G,LAPTEV I,SHMID C.Long-Term Temporal Convolutions for Action Recognition[J].IEEE Transactions on Pattern Analysis & Machine Intelligence,2018,40(6):1510-1517.
[41]SUN L,JIA K ,YEUNG D Y,et al.Human Action Recognition using Factorized Spatio-Temporal Convolutional Networks[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.2015:4597-4605.
[42]MANSIMOV E,SRIVASTAVA N,SALAKHUTDINOV R.Initialization Strategies of Spatio-Temporal Convolutional Neural Networks[J].arXiv:1503.07274.2015.
[43]QIU Z F,YAO T,MEI T.Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks[C]//Proceedings of IEEE International Conference on Computer Vision.2017:5533-5541.
[44]CARREIRA J,ZISSERMAN A.Quo vadis,Action Recognition? A New Model and the Kinetics Dataset[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.2017:6299-6308.
[45]KAY W,CARREIRA J,SIMONYAN K,et al.The Kinetics Human Action Video Dataset[J].arXiv:1705.06950,2017.
[46]WANG L M,LI W,LI W,et al.Appearance-and-Relation Networks for Video Classification[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.2018:1430-1439.
[47]HINTON G.A Practical Guide to Training Restricted Boltzmann Machines[J].Momentum,2010,9(1):926-947.
[48]TAYLOR G W,FERGUS R,LECUN Y,et al.Convolutional Learning of Spatio-Temporal features[C]//Proceedings of the European Conference on Computer Vision.2010:140-153.
[49]SCHULDT C,LAPTEV I,CAPUTO B.Recognizing Human Actions:A Local SVM Approach[C]//Proceedings of the 17th International Conference on Pattern Recognition.2004:23-26.
[50]MARSZALEK M,LAPTEV I,SCHMID C.Actions in Context[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.2009:2929-2936.
[51]TRAN S N,BENETOS E,GARCEZ A.Learning Motion-Difference Features using Gaussian Restricted Boltzmann Machines for Efficient Human Action Recognition[C]//2014 International Joint Conference on Neural Networks.2014:2123-2129.
[52]GRAVES A,MOHAMED A,HINTON G.Speech Recognition with Deep Recurrent Neural Networks[C]//Proceedings of IEEE International Conference on Acoustics,Speech and Signal Processing.2013:6645-6649.
[53]HOCHREITER S,SCHMIDHUBER J.Long Short-Term Memory[J].Neural Computation,1997,9(8):1735-1780.
[54]BACCOUCHE M,MAMALET F,WOLF C,et al.Sequential Deep Learning for Human Action Recognition[C]//Proceedings of IEEE International Workshop on Human Behavior Understanding.2011:29-39.
[55]DONAHUE J,HENDRICKS L A,GUADARRAMA S,et al.Long-Term Re-current Convolutional Networks for Visual Re-cognition and Description [C]//The IEEE Conference on Computer Vision and Pattern Recognition.2015:2625-2634.
[56]WU Z X,WANG X,JIANG Y G,et al.Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification[C]//Proceedings of the ACM international Conference on Multimedia (ACM MM).2015:461-470.
[57]LI Z Y,GAVRILYUK K,GAVVES E,et al.VideoLSTM Convolves,Attends and Flows for Action Recognition[J].Computer Vision and Image Understanding,2018,166:41-50.
[58]GORELICK L,BLANK M,SHECHTMAN E,et al.Actions as Space-Time Shapes[J].IEEE Transactions on Pattern Analysis &Machine Intelligence,2007,29(12):2247-2253.
[59]RODRIGUEZ M D,AHMED J,SHAH M.Action MACH a Spatio-Temporal Maximum Average Correlation Height Filter for Action Recognition[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.2008:1-8.
[60]WANG X L,GIRSHICK R,GUPTA A,et al.Non-local Neural Networks[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Salt Lake City:IEEE Press,2018:7794-7803.
[61]DU Y,YUAN C F,LI B,et al.Interaction-aware Spatio-Temporal Pyramid Attention Networks for Action Classification[C]//Proceedings of the European Conference on Computer Vision.2018:388-403.
[62]AHSAN U,SUN C,ESSA I.DiscrimNet:Semi-Supervised Action Recognition from Videos using Generative Adversarial Networks[J].arXiv:1801.07230,2018.
[63]ZOLFAGHARI M,SINGH K,BROX T.ECO:Efficient Convolutional Network for Online Video Understanding[C]//Proceedings of the European Conference on Computer Vision.Munich:Springer,2018:695-712.
[1] RAO Zhi-shuang, JIA Zhen, ZHANG Fan, LI Tian-rui. Key-Value Relational Memory Networks for Question Answering over Knowledge Graph [J]. Computer Science, 2022, 49(9): 202-207.
[2] TANG Ling-tao, WANG Di, ZHANG Lu-fei, LIU Sheng-yun. Federated Learning Scheme Based on Secure Multi-party Computation and Differential Privacy [J]. Computer Science, 2022, 49(9): 297-305.
[3] ZHOU Le-yuan, ZHANG Jian-hua, YUAN Tian-tian, CHEN Sheng-yong. Sequence-to-Sequence Chinese Continuous Sign Language Recognition and Translation with Multi- layer Attention Mechanism Fusion [J]. Computer Science, 2022, 49(9): 155-161.
[4] XU Yong-xin, ZHAO Jun-feng, WANG Ya-sha, XIE Bing, YANG Kai. Temporal Knowledge Graph Representation Learning [J]. Computer Science, 2022, 49(9): 162-171.
[5] WANG Jian, PENG Yu-qi, ZHAO Yu-fei, YANG Jian. Survey of Social Network Public Opinion Information Extraction Based on Deep Learning [J]. Computer Science, 2022, 49(8): 279-293.
[6] HAO Zhi-rong, CHEN Long, HUANG Jia-cheng. Class Discriminative Universal Adversarial Attack for Text Classification [J]. Computer Science, 2022, 49(8): 323-329.
[7] JIANG Meng-han, LI Shao-mei, ZHENG Hong-hao, ZHANG Jian-peng. Rumor Detection Model Based on Improved Position Embedding [J]. Computer Science, 2022, 49(8): 330-335.
[8] CHEN Yong-quan, JIANG Ying. Analysis Method of APP User Behavior Based on Convolutional Neural Network [J]. Computer Science, 2022, 49(8): 78-85.
[9] ZHU Cheng-zhang, HUANG Jia-er, XIAO Ya-long, WANG Han, ZOU Bei-ji. Deep Hash Retrieval Algorithm for Medical Images Based on Attention Mechanism [J]. Computer Science, 2022, 49(8): 113-119.
[10] SUN Qi, JI Gen-lin, ZHANG Jie. Non-local Attention Based Generative Adversarial Network for Video Abnormal Event Detection [J]. Computer Science, 2022, 49(8): 172-177.
[11] HOU Yu-tao, ABULIZI Abudukelimu, ABUDUKELIMU Halidanmu. Advances in Chinese Pre-training Models [J]. Computer Science, 2022, 49(7): 148-163.
[12] ZHOU Hui, SHI Hao-chen, TU Yao-feng, HUANG Sheng-jun. Robust Deep Neural Network Learning Based on Active Sampling [J]. Computer Science, 2022, 49(7): 164-169.
[13] SU Dan-ning, CAO Gui-tao, WANG Yan-nan, WANG Hong, REN He. Survey of Deep Learning for Radar Emitter Identification Based on Small Sample [J]. Computer Science, 2022, 49(7): 226-235.
[14] HU Yan-yu, ZHAO Long, DONG Xiang-jun. Two-stage Deep Feature Selection Extraction Algorithm for Cancer Classification [J]. Computer Science, 2022, 49(7): 73-78.
[15] DAI Zhao-xia, LI Jin-xin, ZHANG Xiang-dong, XU Xu, MEI Lin, ZHANG Liang. Super-resolution Reconstruction of MRI Based on DNGAN [J]. Computer Science, 2022, 49(7): 113-119.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!