计算机科学 ›› 2023, Vol. 50 ›› Issue (11A): 221100007-5.doi: 10.11896/jsjkx.221100007

• 图像处理&多媒体技术 • 上一篇    下一篇

基于局部特征与全局表征耦合的2D人体姿态估计

陈乔松1, 吴济良1, 蒋波1, 谭冲冲2, 孙开伟1, 邓欣1, 王进1   

  1. 1 重庆邮电大学计算机科学与技术学院 重庆 400065
    2 重庆邮电大学自动化学院/工业互联网学院 重庆 400065
  • 发布日期:2023-11-09
  • 通讯作者: 陈乔松(chenqs@cqupt.edu.cn)
  • 基金资助:
    国家重点研发项目(2022YFE0101000)

Coupling Local Features and Global Representations for 2D Human Pose Estimation

CHEN Qiaosong1, WU Jiliang1, JIANG Bo1, TAN Chongchong2, SUN Kaiwei1, DEN Xin1, WANG Jin1   

  1. 1 School of Computer Science and Technology,Chongqing University of Posts and Telecommunications,Chongqing 400065,China
    2 School of Automation/School of Industrial Internet,Chongqing University of Posts and Telecommunications,Chongqing 400065,China
  • Published:2023-11-09
  • About author:CHEN Qiaosong,born in 1978,Ph.D,associate professor.His main research interests include image processing,image understanding,artificial intelligence and computer vision.
  • Supported by:
    National Key Research and Development Program of China(2022YFE0101000).

摘要: 近年来卷积神经网络和Transformer都在人体姿态估计领域中取得进步,卷积神经网络(Convolutional neural network,CNN)擅长提取局部特征,Transformer擅长捕捉全局表征,但目前结合两者实现人体姿态估计的研究较少且效果不佳。针对此问题,提出一种耦合局部特征和全局表征的的模型CNPose(CNN-Nest Pose),该框架的局部-全局特征耦合模块利用多头注意力计算和残差结构的方式深度耦合局部特征和全局表征;还提出了局部-全局信息交流模块解决局部-全局特征耦合模块在计算过程中局部特征和全局表征数据源范围不一致的问题。在COCO-val2017和COCO-dev-test2017数据集上进行了验证,实验表明,采用了局部特征和全局表征耦合的CNPose模型相较于同类型方法有着更为优越的表现。

关键词: 人体姿态估计, Transformer, 卷积神经网络, 局部特征, 全局表征, 特征耦合, 注意力

Abstract: In recent years,both convolutional neural network and Transformer have made progress in the field of human pose estimation.Convolutional neural network(CNN) is good at extracting local features,and Transformer does well in capturing global representations.However,there are few studies on the combination of the two to achieve human pose estimation,as the same time the results are not good.Aiming at solving this problem,this paper proposes a model CNPose(CNN-Nest Pose) that couples local features and global representations.The local-global feature coupling module of this framework uses multi-head attention calculation method and residual structure to deeply couple local features and global representations.At the same time this paper proposes a local-global information exchange module to solve the problem that therange of data sources of local features and global representationis inconsistent in the local-global feature coupling module during the calculation process.The CNPose framework has been verified on COCO-val2017 and COCO-dev-test2017 datasets.Experiment results show that the CNPose model using the coupling of local features and global representations has superior performance compared to similar methods.

Key words: Human pose estimation, Transformer, Convolutional neural networks, Local features, Global representations, Feature coupling, Attention

中图分类号: 

  • TP391.4
[1]IQBAL U,MILAN A,GALL J.Posetrack:Joint multi-person pose estimation and tracking[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Honolulu:IEEE,2017:2011-2020.
[2]HUANG S,GONG M,TAO D.A coarse-fine network for keypoint localization[C]//Proceedings of the IEEE International Conference on Computer Vision.Los Alamitos:IEEE Computer Society,2017:3028-3037.
[3]PISHCHULIN L,INSAFUTDINOV E,TANG S,et al.Deep-cut:Joint subset partition and labeling for multi person pose estimation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Las Vegas:IEEE,2016:4929-4937.
[4]CAO Z,SIMON T,WEI S E,et al.Realtime multi-person 2dpose estimation using part affinity fields[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Honolulu:IEEE,2017:7291-7299.
[5]KRIZHEVSKY A,SUTSKEVER I,HINTON G E.Imagenet classification with deep convolutional neural networks[J].Communications of the ACM,2017,60(6):84-90.
[6]HE K,ZHANG X,REN S,et al.Deep residual learning forimage recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Las Vegas:IEEE,2016:770-778.
[7]SIMONYAN K,ZISSERMAN A.Very deep convolutional networks for large-scale image recognition[J].arXiv:1409.1556,2014.
[8]HOWARD A G,ZHU M,CHEN B,et al.Mobilenets:Efficientconvolutional neural networks for mobile vision applications[J].arXiv:1704.04861,2017.
[9]TOSHEV A,SZEGEDY C.Deeppose:Human pose estimationvia deep neural networks[C]//Proceedings of the IEEE Confe-rence on Computer Vision and Pattern Recognition.Columbus:IEEE,2014:1653-1660.
[10]WEI S E,RAMAKRISHNA V,KANADE T,et al.Convolu-tional pose machines[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Las Vegas:IEEE,2016:4724-4732.
[11]SUN K,XIAO B,LIU D,et al.Deep high-resolution representation learning for human pose estimation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Long Beach:IEEE,2019:5693-5703.
[12]MAO W,GE Y,SHEN C,et al.Tfpose:Direct human pose estimation with transformers[J].arXiv:2103.15320,2021.
[13]ZHANG Z,ZHANG H,ZHAO L,et al.Aggregating nestedtransformers[J].arXiv:2105.12723,2021.
[14]LIN T Y,MAIRE M,BELONGIE S,et al.Microsoft coco:Common objects in context[C]//European Conference on Computer Vision.Cham:Springer International Publishing,2014.
[15]RUGGERO RONCHI M,PERONA P.Benchmarking and error diagnosis in multi-instance pose estimation[C]//Proceedings of the IEEE International Conference on Computer Vision.Venice:IEEE,2017:369-378.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!