计算机科学 ›› 2025, Vol. 52 ›› Issue (10): 106-114.doi: 10.11896/jsjkx.240800108

• 计算机图形学&多媒体 • 上一篇    下一篇

基于时空关节映射的骨架动作识别方法

赵晨, 彭舰, 黄军豪   

  1. 四川大学计算机学院 成都 610065
  • 收稿日期:2024-08-21 修回日期:2024-11-24 出版日期:2025-10-15 发布日期:2025-10-14
  • 通讯作者: 彭舰(jianpeng@scu.edu.cn)
  • 作者简介:(2022223045223@stu.scu.edu.cn)
  • 基金资助:
    四川省重点研发计划(2023YFG0115,2023YFG0112);四川省省级工业发展资金产业基础攻关任务项目(2023JB06,2023JB03);四川大学和自贡市合作项目(2022CDZG-6)

Spatial-Temporal Joint Mapping for Skeleton-based Action Recognition

ZHAO Chen, PENG Jian, HUANG Junhao   

  1. College of Computer Science,Sichuan University,Chengdu 610065,China
  • Received:2024-08-21 Revised:2024-11-24 Online:2025-10-15 Published:2025-10-14
  • About author:ZHAO Chen,born in 1999,postgra-duate.His main research interest is skeleton-based human action recognition.
    PENG Jian,born in 1970,Ph.D,professor,Ph.D supervisor.His main research interests include artificial intelligence,Internet of Things technology and big data.
  • Supported by:
    Sichuan Science and Technology Program(2023YFG0115,2023YFG0112),Sichuan Industrial Development Fund Industry Foundation Task Project(2023JB06,2023JB03) and Cooperative Program of Sichuan University and Zigong(2022CDZG-6).

摘要: 近年来,基于骨架的动作识别任务受到了研究人员的广泛关注,并取得了长足的研究进展。图卷积网络和卷积神经网络作为强大且有效的模型范式,在骨架动作识别领域同样受到了研究人员的青睐。1)大多数基于GCN(Graph Convolutional Network)的方法使用的是时间、空间分别建模的方式,这阻碍了时空信息的直接交互;2)基于CNN(Convolutional Neural Network)的方法有效地建模了时空信息,但相比于基于GCN的方法,它并没有很好地利用空间信息。针对上述问题,提出了一个新颖的时空信息聚合操作,称作时空节点映射(Spatial-Temporal Joint Mapping,STJM)。该方法既结合了基于GCN的方法中图的拓扑信息,又采用了基于CNN的方法来同时聚合时空信息。相较于传统的GCN方法,该方法将节点进行了高维映射,拥有更强的表意能力。在进行节点高维映射后,只需要一个简单的τ×K的卷积核即可同时聚合时间与空间特征。作为一个新颖的时空信息聚合模块,许多基于GCN的拓扑增强策略都可以应用在STJM block上。实验表明,将STJM作为一个即插即用的模块与现有模型进行结合,在NTU RGB+D 60和NTU RGB+D 120两个大规模骨架数据集上,其性能获得了显著提升。

关键词: 图卷积网络, 卷积神经网络, 动作识别, 时空建模, 骨架序列

Abstract: In recent years,skeleton-based motion recognition tasks have received extensive attention from researchers and have made great progress in research.As powerful and effective model paradigms,graph convolutional networks and convolutional neural networks are also favored by researchers in the field of skeleton action recognition.However,1)most GCN-based methods use the paradigm of modeling spatial features and temporal features alternately,which hinders the direct communication of spatial-temporal information;2)For CNN-based methods,they effectively model spatial-temporal information.However,compared with GCN-based methods,they do not make good use of spatial information.In order to solve the above problems,this paper proposes a novel method called Spatial-Temporal Joint Mapping(STJM).The proposed method not only combines the topological information of the graph in GCN-based methods,but also uses CNN-based methods to aggregate spatial-temporal information simulta-neously.Compared with the traditional GCN method, the STJM maps the nodes in high dimension and has stronger ideographic ability.After high-dimensional mapping of nodes,only a simple τ×K convolution kernel is needed to aggregate both temporal and spatial features.As a novel spatial-temporal information aggregation module,many GCN-based topology enhancement strategies can be applied to STJM block.Compared with the previous spatial-temporal simultaneous aggregation model,the proposed me-thod has better performance.Experiments show that combining the proposed STJM Block as a plug-and-play module with GCN exceeds the previous state-of-the-art models on two large-scale datasets:NTU RGB+D 60 and NTU RGB+D 120.

Key words: GCN,CNN,Action recognition,Spatial-Temporal modeling,Skeleton sequence

中图分类号: 

  • TP183
[1]REN B,LIU M,DING R,et al.A survey on 3d skeleton-based action recognition using learning method[J].arXiv:2002.05907,2020.
[2]ZHANG Z.Microsoft kinect sensor and its effect[J].IEEE Mul-timedia,2012,19(2):4-10.
[3]CHU X,YANG W,OUYANG W,et al.Multi-context attention for human pose estimation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:1831-1840.
[4]YANG W,OUYANG W,LI H,et al.End-to-end learning of deformable mixture of parts and deep convolutional neural networks for human pose estimation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:3073-3082.
[5]CAO Z,SIMON T,WEI S E,et al.Realtime multi-person 2dpose estimation using part affinity fields[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:7291-7299.
[6]YAN S,XIONG Y,LIN D.Spatial temporal graph convolutional networks for skeleton-based action recognition[C]//Procee-dings of the AAAI Conference on Artificial Intelligence.2018.
[7]SHI L,ZHANG Y,CHENG J,et al.Two-stream adaptive graphconvolutional networks for skeleton-based action recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019:12026-12035.
[8]LIU Z,ZHANG H,CHEN Z,et al.Disentangling and unifying graph convolutions for skeleton-based action recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:143-152.
[9]CHEN Y,ZHANG Z,YUAN C,et al.Channel-wise topology refinement graph convolution for skeleton-based action recognition[C]//Proceedings of the IEEE/CVF International Confe-rence on Computer Vision.2021:13359-13368.
[10]GEDAMU K,JI Y,GAO L L,et al.Relation-mining self-attention network for skeleton-basedhuman action recognition[J].Pattern Recognition,2023,139:109455.
[11]LI C,ZHONG Q,XIE D,et al.Co-occurrence feature learningfrom skeleton data for action recognition and detection with hierarchical aggregation[J].arXiv:1804.06055,2018.
[12]XU K,YE F,ZHONG Q,et al.Topology-aware convolutionalneural network for efficient skeleton-based action recognition[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2022:2866-2874.
[13]LI C,XIE C,ZHANG B,et al.Memory attention networks for skeleton-based action recognition[J].IEEE Transactions on Neural Networks and Learning Systems,2021,33(9):4800-4814.
[14]THAKKAR K,NARAYANAN P J.Part-based graph convolutional network for action recognition[J].arXiv:1809.04983,2018.
[15]PENG W,HONG X,CHEN H,et al.Learning graph convolutional network for skeleton-based human action recognition by neural searching[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2020:2669-2676.
[16]SONG Y F,ZHANG Z,SHAN C,et al.Stronger,faster andmore explainable:A graph convolutional baseline for skeleton-based action recognition[C]//Proceedings of the 28th ACM International Conference on Multimedia.2020:1625-1633.
[17]SHAHROUDY A,LIU J,NG T T,et al.NTU RGB+D:A large scale dataset for 3d human activity analysis[C]//Procee-dings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:1010-1019.
[18]LIU J,SHAHROUDY A,PEREZ M,et al.NTU RGB+D 120:A large-scale benchmark for 3d human activity understanding[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2019,42(10):2684-2701.
[19]ZHANG P,XUE J,LAN C,et al.Adding attentiveness to the neurons in recurrent neural networks[C]//Proceedings of the European Conference on Computer Vision (ECCV).2018:135-151.
[20]WANG H,WANG L.Modeling temporal dynamics and spatialconfigurations of actions using two-stream recurrent neural networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:499-508.
[21]SI C,CHEN W,WANG W,et al.An attention enhanced graph convolutional lstm network for skeleton-based action recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019:1227-1236.
[22]ZHAO R,ALI H,VAN DER SMAGT P.Two-stream RNN/CNN for action recognition in 3D videos[C]//2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).IEEE,2017:4260-4267.
[23]LI W,WEN L,CHANG M C,et al.Adaptive RNN tree for large-scale human action recognition[C]//Proceedings of the IEEE International Conference on Computer Vision.2017:1444-1452.
[24]YE F,PU S,ZHONG Q,et al.Dynamic gcn:Context-enriched topology learning for skeleton-based action recognition[C]//Proceedings of the 28th ACM International Conference on Multimedia.2020:55-63.
[25]LI M,CHEN S,CHEN X,et al.Actional-structural graph con-volutional networks for skeleton-based action recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019:3595-3603.
[26]PAN L,LU J,TANG X.Spatial-temporal graph neural ODEnetworks for skeleton-based action recognition[J].Scientific Reports,2024,14(1):7629.
[27]SALVADOR S,CHAN P.Toward accurate dynamic time warping in linear time and space[J].Intelligent Data Analysis,2007,11(5):561-580.
[28]CHEN Z,LI S,YANG B,et al.Multi-scale spatial temporalgraph convolutional network for skeleton-based action recognition[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2021:1113-1122.
[29]LU J,HUANG T T,ZHAO B,et al.Dual Excitation Spatial-temporal Graph Convolution Network for Skeleton-Based Action Recognition[J].IEEE Sensors Journal,2024,24(6):8184-8196.
[30]CAO Y,XIA Y,GAO Q Y,et al.Skeleton-based action recognition based on hyper-connected graph convolutional network[J].Journal of Jilin University(Engineering and Technology Edition),2025,55(2):731-740.
[31]DU Y,FU Y,WANG L.Skeleton based action recognition with convolutional neural network[C]//2015 3rd IAPR Asian Conference on Pattern Recognition (ACPR).IEEE,2015:579-583.
[32]KIM T S,REITER A.Interpretable 3d human action analysis with temporal convolutional networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops.2017:20-28.
[33]HU J,SHEN L,SUN G.Squeeze-and-excitation networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:7132-7141.
[34]KIPF T N,WELLING M.Semi-supervised classification withgraph convolutional networks[J].arXiv:1609.02907,2016.
[35]LECUN Y,BOSER B,DENKER J,et al.Handwritten digit re-cognition with a back-propagation network[C]//Proceedings of the 3rd International Conference on Neural Information Proces-sing Systems.1989:396-404.
[36]HOWARD A G,ZHU M,CHEN B,et al.Mobilenets:Efficient convolutional neural networks for mobile vision applications[J].arXiv:1704.04861,2017.
[37]SZEGEDY C,VANHOUCKE V,IOFFE S,et al.Rethinking the inception architecture for computer vision[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:2818-2826.
[38]HE K,ZHANG X,REN S,et al.Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:770-778.
[39]ZHANG P,LAN C,ZENG W,et al.Semantics-guided neuralnetworks for efficient skeleton-based human action recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:1112-1121.
[40]VERMA V,LAMB A,BECKHAM C,et al.Manifold mixup:Better representations by interpolating hidden states[C]//International Conference on Machine Learning.PMLR,2019:6438-6447.
[41]SHI L,ZHANG Y,CHENG J,et al.Skeleton-based action recognition with multi-stream adaptive graph convolutional networks[J].IEEE Transactions on Image Processing,2020,29:9532-9545.
[42]WU L,ZHANG C,ZOU Y.SpatioTemporal focus for skeleton-based action recognition[J].Pattern Recognition,2023,136:109231.
[43]SHI L,ZHANG Y,CHENG J,et al.Adasgn:Adapting jointnumber and model size for efficient skeleton-based action recognition[C]//Proceedings of the IEEE/CVF International Confe-rence on Computer Vision.2021:13413-13422.
[44]CHENG K,ZHANG Y,HE X,et al.Skeleton-based action recognition with shift graph convolutional network[C]//Procee-dings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:183-192.
[45]YANG D,WANG Y,DANTCHEVA A,et al.Unik:A unified framework for real-world skeleton-based action recognition[J].arXiv:2107.08580,2021.
[46]KANG M S,KANG D,KIM H S.Efficient skeleton-based action recognition via joint-mapping strategies[C]//Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision.2023:3403-3412.
[47]GEDAMU K,JI Y,GAO L L,et al.Relation-mining self-attention network for skeleton-based human action recognition[J].Pattern Recognition,2023,139:109455.
[48]YANG W,ZHANG J,CAI J,et al.HybridNet:Integrating GCNand CNN for skeleton-based action recognition[J].Applied Intelligence,2023,53(1):574-585.
[49]BAVIL A F,DAMIRCHI H,TAGHIRAD H D.Action Capsules:Human skeleton action recognition[J].Computer Vision and Image Understanding,2023,233:103722.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!