计算机科学 ›› 2025, Vol. 52 ›› Issue (3): 68-76.doi: 10.11896/jsjkx.240600063

• 三维视觉与元宇宙 • 上一篇    下一篇

基于中心点注意力的多视角多人三维人体姿态估计

江以恒1, 李洋1,2, 刘春颜1, 赵蕴龙1   

  1. 1 南京航空航天大学计算机科学与技术学院 南京 211106
    2 南京航空航天大学无人机研究院 南京 211106
  • 收稿日期:2024-06-06 修回日期:2024-09-26 出版日期:2025-03-15 发布日期:2025-03-07
  • 通讯作者: 李洋(liyangnuaa@nuaa.edu.cn)
  • 作者简介:(915200547@qq.com)
  • 基金资助:
    新一代人工智能国家科技重大专项(2022ZD0115403)

Multi-view Multi-person 3D Human Pose Estimation Based on Center-point Attention

JIANG Yiheng1, LI Yang1,2, LIU Chunyan1 , ZHAO Yunlong1   

  1. 1 College of Computer Science and Technology,Nanjing University of Aeronautics and Astronautics,Nanjing 211106,China
    2 Unmanned Aerial Vehicles Research Institute,Nanjing University of Aeronautics and Astronautics,Nanjing 211106,China
  • Received:2024-06-06 Revised:2024-09-26 Online:2025-03-15 Published:2025-03-07
  • About author:JIANG Yiheng,born in 1999, postgra-duate.His main research interests include artificial intelligence and compu-ter vision.
    LI Yang,born in 1986,Ph.D,is a member of CCF(No.J4845M).His main research interests include artificial intelligence,collective computing and privacy protection.
  • Supported by:
    National Science and Technology Major Project(2022ZD0115403).

摘要: 多视角多人三维人体姿态估计被广泛应用于各类计算机视觉任务中。当前基于空间体素的方法由于需要消耗巨大的资源难以实现在边缘计算设备上的实时性运算;而回归方法因缺乏几何约束导致泛化能力有限,在新的环境中无法直接应用而需要采集数据进行微调。通过结合空间体素方法与基于回归的姿态估计方法并融合二者的特点,提出了基于中心点注意力回归的多视角多人三维人体姿态估计模型。该模型通过一个小规模的体素网络粗略估计人体中心点位置,并以此构建初始姿态,随后在人体中心点的范围内进行回归预测得到更精确的人体姿态。本研究通过结合空间关键点位置,使得模型的回归预测更加准确,在大尺度上平均准确率提升1.16%,同时使得模型非常容易训练,在小样本微调中准确率最多提升了12%。这使得基于回归的模型可以在新的场景下通过小数据量的训练快速部署而实现泛化性能和通用性的大幅提升。

关键词: 三维人体姿态估计, 多视角, 中心点预测网络, 中心点注意力, Transformer, 体素网络

Abstract: Multi-view multi-person 3D human pose estimation is widely used in various computer vision tasks.Current spatial voxel-based methods are difficult to achieve real-time computing on edge computing devices due to huge resource consumption.However,the regression method has limited generalization ability due to the lack of geometric constraints.In a new environment,it cannot be directly applied and needs to collect data for fine-tuning.By combining the spatial voxel method and the regression-based pose estimation method,we propose a multi-view multi-person 3D human pose estimation model based on center point attention regression.The model roughly estimates the position of the human body center through a small-scale voxel network,and constructs the initial pose based on it.Then the regression prediction is carried out within the range of the human body center point to obtain more accurate human pose.In this study,by combining the spatial key point positions,the regression prediction of the model is more accurate,and the average accuracy is improved by 1.16% on large scales.At the same time,the model is very easy to train,and the accuracy is improved by up to 12% in small sample fine-tuning.This allows regression-based models to greatly expand the generalization performance and versatility of such models in new scenarios by rapidly deploying them with small amounts of training data.

Key words: 3D human pose estimation, Multi-view, Center-point proposal network, Center-point attention, Transformer, VoxelNet

中图分类号: 

  • TP311
[1]TU H,WANG C,ZENG W.Voxelpose:Towards multi-camera 3d human pose estimation in wild environment[C]//ECCV 2020:16th European Conference,Glasgow,UK,August 23-28,2020,Proceedings,Part I 16.Springer,2020:197-212.
[2]ZHANG J,CAI Y,YAN S,et al.Direct multi-view multi-person 3d pose estimation[J].Advances in Neural Information Proces-sing Systems,2021,34:153-164.
[3]MARTINEZ J,HOSSAIN R,ROMERO J,et al.A simple yeteffective baseline for 3d human pose estimation[C]//Procee-dings of the IEEE International Conference on Computer Vision.2017:2640-2649.
[4]GONG K,ZHANG J,FENG J.Poseaug:A differentiable pose augmentation framework for 3d human pose estimation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2021:8575-8584.
[5]SUN X,XIAO B,WEI F,et al.Integral human pose regression[C]//Proceedings of the European Conference on Computer Vision(ECCV).2018:529-545.
[6]POPA A I,ZANFIR M,SMINCHISESCU C.Deep multitask architecture for integrated 2d and 3d human sensing[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:6289-6298.
[7]MEHTA D,SRIDHAR S,SOTNYCHENKO O,et al.with asingle rgb camera[J].ACM Transactions on Graphics,2017,36(4):1-14.
[8]ZHAO L,PENG X,TIAN Y,et al.Semantic graph convolu-tional networks for 3d human pose regression[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019:3425-3435.
[9]HARTLEYR,ZISSERMAN A.Multiple view geometry in computer vision[M].Cambridge University Press,2003.
[10]ISKAKOV K,BURKOV E,LEMPITSKY V,et al.Learnabletriangulation of human pose[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2019:7718-7727.
[11]QIU H,WANG C,WANG J,et al.Cross view fusion for 3d human pose estimation[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2019:4342-4351.
[12]PAVLAKOS G,ZHOU X,DERPANIS K G,et al.Harvesting multiple views for marker-less 3d human pose annotations[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:6988-6997.
[13]DONG J,JIANG W,HUANG Q,et al.Fast and robust multi-person 3d pose estimation from multiple views[J].IEEE Tran-sactions on Pattern Analysis and Machine Intelligence,2044,44(10):6981-6992.
[14]WU S,JIN S,LIU W,et al.Graph-based 3d multi-person pose estimation using multi-view images[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:148-157.
[15]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[C]//Proceedings of the 31st International Confe-rence on Neural Information Processing Systems December.2017:6000-6010.
[16]DOSOVITSKIY A, BEYER L,KOLESNIKOV A,et al.Animage is worth 16x16 words:Transformers for image recognition at scale[J].arXiv:2010.11929,2020.
[17]BELAGIANNIS V,AMIN S,ANDRILUKA M,et al.3d picto-rial structures for multiple human pose estimation[C]//Procee-dings of the IEEE Conference on Computer Vision and Pattern Recognition.2014:1669-1676.
[18]HANBYUL J,LIU H,TAN L,et al.Panoptic studio:A mas-sively multiview system for social interaction capture[C]//2015 IEEE International Conference on Computer Vision.2016.
[19]IONESCU C,PAPAVA D,OLARU V,et al.Human3.6m:Large scale datasets and predictive methods for 3d human sen-sing in natural environments[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2013,36(7):1325-1339.
[20]CAI Z,REN D,ZENG A,et al.Humman:Multi-modal 4d human dataset for versatile sensing and modeling[C]//ECCV 2022.Springer,2022:557-577.
[21]WANG J,YANG F,GOU W,et al.Freeman:Towards benchmarking 3d human pose estimation in the wild[J].arXiv:2309.05073,2023.
[22]QIU L,ZHANG X,LI Y,et al.Peeking into occluded joints:A novel framework for crowd pose estimation[C]//ECCV 2020:16th European Conference,Glasgow,UK,August 23-28,2020,Proceedings,Part XIX 16.Springer,2020:488-504.
[23]CI H,WANG C,MA X,et al.Optimizing network structure for 3d human pose estimation[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2019:2262-2271.
[24]TANG W,WU Y.Does learning specific features for relatedparts help human pose estimation?[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019:1107-1116.
[25]SUN Z Y,LI H Y,YE J Y.3D human joint point recognition based on weakly supervised migration network[J].Journal of Jilin University(Engineering and Technology Edition),2024,54(1):251-258.
[26]MOON G,CHANG J Y,LEE K M.V2v-posenet:Voxel-to-vo-xel prediction network for accurate 3d hand and human pose estimation mfrom a single depth map[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:5079-5088.
[27]ZHANG Y,WANG C,WANG X,et al.Voxeltrack:Multi-person 3d human pose estimation and tracking in the wild[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2022,45(2):2613-2626.
[28]YE H, ZHU W,WANG C,et al.Faster voxelpose:Real-time 3d human pose estimation by orthographic projection[C]//ECCV 2022.Springer,2022:142-159.
[29]LIN J,LEE J H.Multi-view multi-person 3d pose estimationwith plane sweep stereo[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2021:11886-11895.
[30]CHEN Y,GU R,HUANG O,et al.Vtp:volumetric transformer for multi-view multi-person 3d pose estimation[J].Applied Intelligence,2023,53(22)26568-26579.
[31]LIU H,WU J,HE R.Center point to pose:Multiple views 3d human pose estimation for multi-person[J].Plos One,2022,17(9):e0274450.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!