计算机科学 ›› 2020, Vol. 47 ›› Issue (6A): 139-147.doi: 10.11896/JsJkx.190900176

• 计算机图形学 & 多媒体 • 上一篇    下一篇

基于深度学习的行为识别算法综述

赫磊, 邵展鹏, 张剑华, 周小龙   

  1. 浙江工业大学计算机科学与技术学院 杭州 310023
  • 发布日期:2020-07-07
  • 通讯作者: 邵展鹏(zpshao@zJut.edu.cn)
  • 作者简介:1434347689@qq.com
  • 基金资助:
    国家自然科学基金(20160283,61603341);浙江省自然科学基金(KYY-ZX-20190013,KYY-ZX-20180114)

Review of Deep Learning-based Action Recognition Algorithms

HE Lei, SHAO Zhan-peng, ZHANG Jian-hua and ZHOU Xiao-long   

  1. College of Computer Science and Technology,ZheJiang University of Technology,Hangzhou 310023,China
  • Published:2020-07-07
  • About author:HE Lei, born in 1994, master.His main research interests include image processing and action recognition.
    SHAO Zhan-peng, Ph.D, is a member of China Computer Federation.His research interests include action recognition and pose estimation.
  • Supported by:
    This work was supported the National Natural Science Foundation of China (20160283,61603341) and Natural Science Foundation of ZheJiang Province,China (KYY-ZX-20190013,KYY-ZX-20180114).

摘要: 行为识别是计算机视觉领域的基本问题之一,基于深度学习的行为识别算法是当前行为识别的主流算法。在已有的研究中,传统特征提取方法一般是通过人工观察和设计,手动设计出能够表征视频动作的特征。然而,在手工特征表达的基础上构建复杂分类模型的方法已经不能适应高识别精度和应用性的要求,而深度学习的引入为行为识别带来了新的发展方向。文中主要综述了基于深度学习的行为识别算法,首先介绍了行为识别的研究背景和意义,并分别对行为识别的传统学习方法和深度学习方法进行了介绍;然后对深度学习下的算法模型结构进行分类介绍,包括Two-Stream、3D-ConvNet、融合CNN-LSTM 3种算法模型结构;最后介绍了目前常用的公开验证数据集,并主要针对基于两种数据模态的识别算法进行了横向比较,一种是基于RGB视频的UCF101和HMDB51数据集,一种是基于人体骨架序列视频的NTU RGB+D数据集。实验结果表明:深度学习方法已经取得了很大的进步,卷积神经网络的应用极大地促进了行为识别算法的发展,逐步替代了基于手工提取特征的传统方法,尤其采用了卷积神经网络算法之后在行为数据集上的准确率有了显著提高。对于RGB视频而言,Two-Stream和3DConvNet是算法模型结构的主流,对于骨架序列视频而言,Two-Stream和融合时空图模型是算法模型结构的主流。

关键词: 行为识别, 深度学习, 卷积神经网络, 循环神经网络, 3D卷积

Abstract: Action recognition is one of the fundamental problems in the field of computer vision.Currently,deep learning-based method is one of the mainstream methods for action recognition.In the existing researches,the traditional feature extraction method generally manually designs features that can represent video actions.However,this method usually requires a particular model to classify features,which cannot achieve high performance in real applications,while the introduction of deep learning brings a new development direction for action recognition.This paper briefly reviews on the action recognition methods based on deep learning.Firstly,the research background and significance of action recognition are introduced,and the traditional methods and deep learning-based methods are surveyed respectively.Then,the model architectures of three algorithms based on deep learning are classified and introduced,namely Two-Stream network,3DConvNet,CNN-LSTM network.Finally,the common used public validation datasets are introduced,and horizontal comparison is carried out on the recognition algorithms based on two data modes.Among these datasets,they can be grouped into two categories,RGB-based (e.g.,UCF101,HMDB51) and skeleton-based datasets (e.g.,NTU RGB+D).Experimental results show that the deep learning-based methods have made great advances,and the application of convolutional neural network has greatly promoted the development of action recognition algorithm.They gradually replace the traditional method based on manual features extraction.For RGB-based action recognition,Two-Stream and 3DConvNet are currently state-of-the-art methods.For skeleton-based action recognition,Two-Stream and spatiotemporal graph network achieve the best performance.

Key words: Action recognition, Deep learning, Convolutional neural network, Recurrent neural network, 3D-ConvNet

中图分类号: 

  • TP391.4
[1] WANG X.Intelligent multi-camera video surveillance:A review.Pattern Recognition Letters,2013,34(1):3-19.
[2] TURAGA P,CHELLAPPA R,SUBRAHMANIAN V S,et al.Machine Recognition of Human Activities:A Survey.IEEE Trans.Circuits Syst.Video Technol.,2008,18(11):1473-1488.
[3] ELLIS C,MASOOD S Z,TAPPEN M F,et al.Exploring the trade-off between accuracy and observational latency in action recognition.Int.J.Comput.Vis.,2013,101(3):420-436.
[4] ZHANG W,SMITH M L,SMITH L N,et al.Gender and gaze gesture recognition for human-computer interaction//Computer Vision and Image Understanding.2016:32-50.
[5] TAKANO W,NAKAMURA Y.Statistical mutual conversion between whole body motion primitives and linguistic sentences for human motions.Int.J.Rob.Res.,2015,34(10):1314-1328.
[6] CALO R.Robotics and the Lessons of Cyberlaw.California Law Review,2014,103(3).
[7] CAMPORESI C,KALLMANN M,HAN J J,et al.VR solutions for improving physical therapy//IEEE Virtual Reality Conference.2013:77-78.
[8] CHAO M W,LIN C H,ASSA J,et al.Human motion retrieval from hand-drawn sketch.IEEE Trans.Vis.Comput.Graph.,2012,18(5):729-740.
[9] KRIZHEVSKY A,SUTSKEVER I,HINTON G.Imagenet classification with deep convolutional neural networks//Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS).2012:1097-1105.
[10] GIRSHICK R,DONAHUE J,DARRELL T,et al.Rich feature hierarchies for accurate obJect detection and semantic segmentation//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).2014:580-587.
[11] FARABET C,COUPRIE C,LECUN Y.Learning hierarchicalfeatures for scene labeling,IEEE Trans.Pattern Anal.Mach.Intell.,2013,35(8):1915-1929.
[12] SU S,LIU Z,XU S,et al.Sparse auto-encoder based feature learning for human body detection indepth image.Signal Processing,2015,112(1):43-52.
[13] VIEIRA A W,NASCIMENTO E R,OLIVEIRA G L,et al.STOP:Space-Time Occupancy Patterns for 3D Action Recognition from Depth Map Sequences//Iberoamerican Congress on Pattern Recognition.2012:252-259.
[14] SEVILLALARA L,LIAO Y,GUNEY F,et al.On the Integration of Optical Flow and Action Recognition//German Conference on Pattern Recognition.2018:281-297.
[15] HE K,ZHANG X,REN S,et al.Deep Residual Learning for Image Recognition//Computer Vision and Pattern Recognition.2016:770-778.
[16] TSENG H C,SHYU J J,CHANG J Y,et al.Exploiting Automatic Image Segmentation to Human Detection and Depth Estimation///Proc of the IEEE Symposium on Computational Intelligence for Multimedia,Signal and Vision Processing.Paris,France,2011:19-25.
[17] KIM W H,JEONG T I,KIM J N.Video Segmentation Algorithm Using Threshold and Weighting Based on Moving Sliding Window//Proc of the 11th International Conference on Advanced Communication Technology.Pyeongchang County,Repulic of Korea,2009:1781-1784
[18] SALMANE H,RUICHEK Y,KHOUDOUR L.ObJect Tracking Using Harris Corner Points Based Optical Flow Propagation and Kalman filter//Proc of the 14th International IEEE Confe-rence on Intelligent Transportation Systems.Washington,USA,2011:67-73.
[19] YANG J,XU Y S,CHEN C S.Hidden Markov Model Approach to Skill Learning and Its Application to Telerobotics.IEEE Trans on Robotics and Automation,1994,10(5):621-631.
[20] BOBICK A,DAVIS J.An appearance-based representation of action//Proceedings of the 13th International Conference on Pattern Recognition.Vienna:IEEE,1996:307-312.
[21] WEINLAND D,RONFARD R,BOYER E.Free viewpoint action recognition using motion history volumes.Computer Vision and Image Understanding,2006,104(2/3):249-257.
[22] DOLLAR P,RABAUD V,COTTRELL G W,et al.Behavior recognition via sparse spatio-temporal features//International Conference on Computer Communications and Networks.2005:65-72.
[23] LAPTEV I.On space-time interest points .International Journal of Computer Vision,2005,64 (2/3):10-123.
[24] WONG S,CIPOLLA R.Extracting Spatiotemporal Interest Points using Global Information//International Conference on Computer Vision.2007:1-8.
[25] WANG H,ULLAH M M,KLASER A,et al.Evaluation of local spatio-temporal features for action recognition//British Machine Vision Conference.2009:1-11.
[26] WANG H,KLASER A,SCHMID C,et al.DensetraJectories and motion boundary descriptors foraction recognition.International Journal of Computer Vision,2013,103(1):60-79.
[27] DOLLAR P,RABAUD V,COTTRELL G W,et al.Behavior recognition via sparse spatio-temporal features//International Conference on Computer Communications and Networks.2005:65-72.
[28] WANG H,KLASER A,SCHMID C,et al.Dense traJectories and motion boundary descriptors for action recognition.International Journal of Computer Vision,2013,103(1):60-79.
[29] NGUYEN T PMANZANERA A.Action recognition using bag of features extracted from a beam of traJectories//2013 IEEE International Conference on Image Processing.Melbourne,VIC,2013:4354-4357.
[30] WANG H,SCHMID C.Action Recognition with Improved TraJectories//International Conference on Computer Vision.2013:3551-3558.
[31] SHI C,WANG Y,JIA F,et al.Fisher vector for scene character recognition:A comprehensive evaluation.Pattern Recognition,2017,2017(72):1-14.
[32] DANAFAR S,GHEISSARI N.Action recognition for surveillance applications using optic flow and SVM//AsianConfe-rence on Computer Vision.2007:457-466.
[33] WANG Y,XU W.Leveraging deep learning with LDA-basedtext analytics to detect automobile insurance fraud.Decision Support Systems,2018,105:87-95.
[34] IJJINA E P,MOHAN C K.Hybrid deep neural network model for human action recognition.Applied Soft Computing,2016,46:936-952.
[35] KRIZHEVSKY A,SUTSKEVER I,HINTON G E,et al.ImageNet Classification with Deep Convolutional Neural Networks.Neural Information Processing Systems,2012,141(5):1097-1105.
[36] GREFF K,SRIVASTAVA R K,KOUTNIK J,et al.LSTM:A Search Space Odyssey.IEEE Transactions on Neural Networks,2017,28(10):2222-2232.
[37] COLLOBERT R,WESTON J,BOTTOU L,et al.Natural Language Processing (Almost) from Scratch.arXiv:1103.0398.
[38] TARWANI K M,EDEM S.Survey on Recurrent Neural Network in Natural Language Processing.International Journal of Engineering Trends and Technology,2017,48(6):301-304.
[39] WANG P,LI Z,HOU Y,et al.Action Recognition Based on Joint TraJectory Maps Using Convolutional Neural Networks//ACM Multimedia.2016:102-106.
[40] LI C,HOU Y,WANG P,et al.Joint Distance Maps Based Action Recognition With Convolutional Neural Networks.IEEE Signal Processing Letters,2017,24(5):624-628.
[41] WANG X,GAO L,WANG P,et al.Two-Stream 3-D convNetFusion for Action Recognition in Videos With Arbitrary Size and Length.IEEE Transactions on Multimedia,2018,20(3):634-644.
[42] HOCHREITER S,SCHMIDHUBER J.Long short-term memo-ry.Neural Computation,1997,9(8):1735-1780.
[43] DONAHUE J,HENDRICKS L A,GUADARRAMA S,et al.Long-term recurrent convolutional networks for visual recognition and description//Computer Vision and Pattern Recognition,2015.2625-2634.
[44] KE Q,BENNAMOUN M,AN S,et al.A New Representation of Skeleton Sequences for 3D Action Recognition//Computer Vision and Pattern Recognition.2017:4570-4579.
[45] SHAO Z,LI Y,GUO Y,et al.A Hierarchical Model for Action Recognition Based on Body Parts//2018 IEEE International Conference on Robotics and Automation (ICRA).Brisbane,QLD,2018:1978-1985.
[46] SIMONYAN K,ZISSERMAN A.Two-Stream ConvolutionalNetworks for Action Recognition in Videos.arXiv:1406.2199.
[47] FEICHTENHOFER C,PINZ A,ZISSERMAN A,et al.Convolutional Two-Stream Network Fusion for Video Action Recognition//Computer Vision and Pattern Recognition.2016:1933-1941.
[48] WANG L,XIONG Y,WANG Z,et al.Towards Good Practices for Very Deep Two-Stream ConvNets.arXiv:1507.02159,2015.
[49] SZEGEDY C,LIU W,JIA Y,et al.Going deeper with convolutions//Computer Vision and Pattern Recognition.2015:1-9.
[50] SIMONYAN K,ZISSERMAN A.Very Deep Convolutional Networks for Large-Scale Image Recognition//International Conference on Learning Representations.2015.
[51] WANG L,XIONG Y,WANG Z,et al.Temporal Segment Networks:Towards Good Practices for Deep Action Recognition//European Conference on Computer Vision.2016:20-36.
[52] HE K,ZHANG X,REN S,et al.Deep Residual Learning forImage Recognition//Computer Vision and Pattern Recognition.2016:770-778.
[53] FEICHTENHOFER C,PINZ A,WILDES R P,et al.Spatiotemporal Residual Networks for Video Action Recognition//Neural Information Processing Systems.2016:3468-3476.
[54] YANG M,JI S,XU W,et al.Detecting human actions in surveillance videos//Proceedings of the TREC Video Retrieval Evaluation Workshop.2009.
[55] BACCOUCHE M,MAMALET F,WOLF C,et al.Sequential deep learning for human action recognition//Human Beha-vior Unterstanding.2011:29-39.
[56] JI S,XU W,YANG M,et al.3D Convolutional Neural Networks for Human Action Recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence,2013,35(1):221-231.
[57] JI S,XU W,YANG M,et al.3D Convolutional Neural Networks for Human Action Recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence,2013,35(1):221-231.
[58] TRAN D,BOURDEV L,FERGUS R,et al.Learning Spatiotemporal Features with 3D Convolutional Networks//International Conference on Computer Vision.2015:4489-4497.
[59] TRAN D,RAY J,SHOU Z,et al.ConvNet Architecture Search for Spatiotemporal Feature Learning..arXiv:1708.05038,2017.
[60] QIU Z,YAO T,MEI T,et al.Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks//International Conference on Computer Vision.2017:5534-5542.
[61] SZEGEDY C,VANHOUCKE V,IOFFE S,et al.Rethinking the Inception Architecture for Computer Vision//Computer Vision and Pattern Recognition.2016:2818-2826.
[62] DIBA A,FAYYAZ M,SHARMA V,et al.Temporal 3D ConvNets:New Architecture and Transfer Learning for Video Classification.arXiv:1711.08200,2017.
[63] KIPF T,WELLING M.Semi-Supervised Classification with Graph Convolutional Networks//International Conference on Learning Representations.2017.
[64] SHI L,ZHANG Y,CHENG J,et al.Non-Local Graph Convolutional Networks for Skeleton-Based Action Recognition.ar-Xiv:1805.07694v2.
[65] KARPATHY A,TODERICI G,SHETTY S,et al.Large-Scale Video Classification with Convolutional Neural Networks//Computer Vision and Pattern Recognition.2014:1725-1732.
[66] DONAHUE J,HENDRICKS L A,GUADARRAMA S,et al.Long-term recurrent convolutional networks for visual recognition and description//Computer Vision and Pattern Recognition.2015:2625-2634.
[67] NG J Y,HAUSKNECHT M J,VIJAYANARASIMHAN S, et al.Beyond short snippets:Deep networks for video classification//Computer Vision and Pattern Recognition.2015:4694-4702.
[68] SRIVASTAVA N,MANSIMOV E,SALAKHUDINOV R,et al.Unsupervised Learning of Video Representations using LSTMs//International Conference on Machine Learning.2015:843-852.
[69] YAN S,XIONG Y,LIN D,et al.Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition//National Conference on Artificial Intelligence.2018:7444-7452.
[70] SI C,JING Y,WANG W,et al.Skeleton-Based Action Recognition with Spatial Reasoning and Temporal Stack Learning//European Conference on Computer Vision.2018:106-121.
[71] LI C,ZHONG Q,XIE D,et al.Co-occurrence Feature Learning from Skeleton Data for Action Recognition and Detection with Hierarchical Aggregation//International Joint Conference on Artificial Intelligence.2018:786-792.
[72] HASSNER T.A critical review of action recognition benchmarks//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR).2013:245-250.
[73] KARPATHY A,TODERICI G,SHETTY S,et al.Large-Scale Video Classification with Convolutional Neural Networks//Computer Vision and Pattern Recognition.2014:1725-1732.
[74] SOOMRO K,ZAMIR A R,SHAH M,et al.UCF101:A Dataset of 101 Human Actions Classes From Videos in The Wild.arXiv:1212.0402,2012.
[75] REDDY K K,SHAH M.Recognizing 50 human action categories of web videos.Machine Vision Applications,2013,24(5):971-981.
[76] KUEHNE H,JHUANG H,GARROTE E,et al.HMDB:Alarge video database for human motion recognition//International Conference on Computer Vision.2011:2556-2563.
[77] ZISSERMAN A,CARREIRA J,SIMONYAN K,et al.The Kinetics Human Action Video Dataset.arXiv:1705.06950,2017.
[78] LI W,ZHANG Z,LIU Z,et al.Action recognition based on abag of 3D points//Computer Vision and Pattern Recognition.2010:9-14.
[79] SHAHROUDY A,LIU J,NG T,et al.NTU RGB+D:A Large Scale Dataset for 3D Human Activity Analysis//Computer Vision and Pattern Recognition.2016:1010-1019.
[80] TRAN D,WANG H,TORRESANI L,et al.A Closer Look at Spatiotemporal Convolutions for Action Recognition//Computer Vision and Pattern Recognition.2018:6450-6459.
[81] ZHANG P,LAN C,XING J,et al.View Adaptive Neural Networks for High Performance Skeleton-based Human Action Recognition//IEEE Transactions on Pattern Analysis and Machine Intelligence.2019:1-1.
[82] SHI L,ZHANG Y,CHENG J,et al.Adaptive spectral graphconvolutional networks for skeleton-based action recognition.arXiv:1805.07694,2018.
[83] SI C,CHEN W,WANG W,et al.An Attention Enhanced Graph Convolutional LSTM Network for Skeleton-Based Action Re-cognition.arXiv:1902.09130,2019.
[1] 单美静, 秦龙飞, 张会兵. L-YOLO:适用于车载边缘计算的实时交通标识检测模型[J]. 计算机科学, 2021, 48(1): 89-95.
[2] 何彦辉, 吴桂兴, 吴志强. 基于域适应的X光图像的目标检测[J]. 计算机科学, 2021, 48(1): 175-181.
[3] 李亚男, 胡宇佳, 甘伟, 朱敏. 基于深度学习的miRNA靶位点预测研究综述[J]. 计算机科学, 2021, 48(1): 209-216.
[4] 王瑞平, 贾真, 刘畅, 陈泽威, 李天瑞. 基于DeepFM的深度兴趣因子分解机网络[J]. 计算机科学, 2021, 48(1): 226-232.
[5] 于文家, 丁世飞. 基于自注意力机制的条件生成对抗网络[J]. 计算机科学, 2021, 48(1): 241-246.
[6] 仝鑫, 王斌君, 王润正, 潘孝勤. 面向自然语言处理的深度学习对抗样本综述[J]. 计算机科学, 2021, 48(1): 258-267.
[7] 丁钰, 魏浩, 潘志松, 刘鑫. 网络表示学习算法综述[J]. 计算机科学, 2020, 47(9): 52-59.
[8] 庄世杰, 於志勇, 郭文忠, 黄昉菀. 基于Zoneout的跨尺度循环神经网络及其在短期电力负荷预测中的应用[J]. 计算机科学, 2020, 47(9): 105-109.
[9] 何鑫, 许娟, 金莹莹. 行为关联网络:完整的变化行为建模[J]. 计算机科学, 2020, 47(9): 123-128.
[10] 张佳嘉, 张小洪. 多分支卷积神经网络肺结节分类方法及其可解释性[J]. 计算机科学, 2020, 47(9): 129-134.
[11] 叶亚男, 迟静, 于志平, 战玉丽, 张彩明. 基于改进CycleGan模型和区域分割的表情动画合成[J]. 计算机科学, 2020, 47(9): 142-149.
[12] 朱玲莹, 桑庆兵, 顾婷婷. 基于视差信息的无参考立体图像质量评价[J]. 计算机科学, 2020, 47(9): 150-156.
[13] 邓良, 许庚林, 李梦杰, 陈章进. 基于深度学习与多哈希相似度加权实现快速人脸识别[J]. 计算机科学, 2020, 47(9): 163-168.
[14] 游兰, 韩雪薇, 何正伟, 肖丝雨, 何渡, 潘筱萌. 基于改进Seq2Seq的短时AIS轨迹序列预测模型[J]. 计算机科学, 2020, 47(9): 169-174.
[15] 崔彤彤, 王桂玲, 高晶. 基于1DCNN-LSTM的船舶轨迹分类方法[J]. 计算机科学, 2020, 47(9): 175-184.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 雷丽晖,王静. 可能性测度下的LTL模型检测并行化研究[J]. 计算机科学, 2018, 45(4): 71 -75 .
[2] 孙启,金燕,何琨,徐凌轩. 用于求解混合车辆路径问题的混合进化算法[J]. 计算机科学, 2018, 45(4): 76 -82 .
[3] 张佳男,肖鸣宇. 带权混合支配问题的近似算法研究[J]. 计算机科学, 2018, 45(4): 83 -88 .
[4] 伍建辉,黄中祥,李武,吴健辉,彭鑫,张生. 城市道路建设时序决策的鲁棒优化[J]. 计算机科学, 2018, 45(4): 89 -93 .
[5] 史雯隽,武继刚,罗裕春. 针对移动云计算任务迁移的快速高效调度算法[J]. 计算机科学, 2018, 45(4): 94 -99 .
[6] 周燕萍,业巧林. 基于L1-范数距离的最小二乘对支持向量机[J]. 计算机科学, 2018, 45(4): 100 -105 .
[7] 刘博艺,唐湘滟,程杰仁. 基于多生长时期模板匹配的玉米螟识别方法[J]. 计算机科学, 2018, 45(4): 106 -111 .
[8] 耿海军,施新刚,王之梁,尹霞,尹少平. 基于有向无环图的互联网域内节能路由算法[J]. 计算机科学, 2018, 45(4): 112 -116 .
[9] 崔琼,李建华,王宏,南明莉. 基于节点修复的网络化指挥信息系统弹性分析模型[J]. 计算机科学, 2018, 45(4): 117 -121 .
[10] 王振朝,侯欢欢,连蕊. 抑制CMT中乱序程度的路径优化方案[J]. 计算机科学, 2018, 45(4): 122 -125 .