计算机科学 ›› 2022, Vol. 49 ›› Issue (11A): 210900094-5.doi: 10.11896/jsjkx.210900094

• 图像处理&多媒体技术 • 上一篇    下一篇

基于时空图卷积网络的语音驱动个人风格手势生成方法

张斌, 刘长红, 曾胜, 揭安全   

  1. 江西师范大学 南昌 330022
  • 出版日期:2022-11-10 发布日期:2022-11-21
  • 通讯作者: 刘长红(liuch@jxnu.edu.cn)
  • 作者简介:(zhangbin@jxnu.edu.cn)
  • 基金资助:
    国家自然科学基金(62067004,61662030)

Speech-driven Personal Style Gesture Generation Method Based on Spatio-Temporal GraphConvolutional Networks

ZHANG Bin, LIU Chang-hong, ZENG Sheng, JIE An-quan   

  1. School of Computer & Information Engineering,Jiangxi Normal University,Nanchang 330022,China
  • Online:2022-11-10 Published:2022-11-21
  • About author:ZHANG Bin,born in 1997,postgra-duate.His main research interests include cross-modal generation and computer vison.
    LIU Chang-hong,born in 1977,Ph.D,associate professor,is a member of China Computer Federation.Her main research interests include computer vison,cross-modal retrieval and hyper-spectral image processing.
  • Supported by:
    National Natural Science Foundation of China(62067004,61662030).

摘要: 人们在发言时的手势动作往往具有自己独特的个人风格,研究者们提出了基于生成式对抗网络的语音驱动个人风格手势生成的方法,然而所生成的动作不自然,存在时序上动作不连贯的问题。针对该问题,文中提出了一种基于时空图卷积网络的语音驱动个人风格手势生成的方法,引入以时空图卷积网络为基础的时序动态性判别器,构建手势动作关节点之间空间和时间上的结构关系,并通过时空图卷积网络捕获手势动作关节点在空间上的相关性和提取时序上的动态性特征,使所生成的手势动作保持时序上的连贯性,以更符合真实手势的行为和结构。在Ginosar等构建的语音手势数据集上进行实验验证,与相关方法相比,正确关键点百分比指标提高了2%~5%,所生成的手势动作更自然。

关键词: 跨模态生成, 手势生成, 个人风格学习, 时空图卷积网络, 时序动态性

Abstract: People’s gestures in speaking often have their own unique personal style.Researchers have proposed a speech-driven personal style gesture generation method based on generative adversarial networks.However,the generated actions are unnatural for temporal discontinuity.To solve this problem,this paper proposes a speech-driven personal style gesture generation method based on the spatio-temporal graph convolutional networks,which adds the temporal dynamic discriminator based on spatio-temporal graph convolutional network.The spatial and temporal structural relationships between gesture joint points is firstly constructed,and then the spatial correlation of gesture joint points is captured and the dynamic characteristics in time sequence are extracted through the spatio-temporal graph convolution network(STGCN),so that the generated gestures maintain the consistency in time sequenceand are more consistent with the behavior and structure of real gestures.The proposed method is verified on the speech and gesture dataset constructed by Ginosar et al.Compared with relevant methods,the percentage of correct keypoints improves by about 2%~5%,and the generated gestures are more natural.

Key words: Cross-modal generation, Gesture generation, Personal style learning, Spatio-Temporal graph convolutional networks, Temporal dynamics

中图分类号: 

  • TP391.1
[1]YAGHOUBZADEH R,KRAMER M,PITSCH K,et al.Virtual agents as daily assistants for elderly or cognitively impaired people[C]//International Workshop on Intelligent Virtual Agents.Berlin,Springer:2013:79-91.
[2]LI J,KIZILCEC R,BAILENSONJ,et al.Social robots and vir-tual agents as lecturers for video instruction[J].Computers in Human Behavior,2016,55:1222-1230.
[3]PACELLA D,LÓPEZ-PÉREZ B.Assessing children’s interpersonal emotion regulation with virtual agents:The serious game Emodiscovery[J].Computers & Education,2018,123:1-12.
[4]TAN S M,LIEW T W.Designing embodied virtual agents asproduct specialists in a multi-product category E-commerce:The roles of source credibility and social presence[J].International Journal of Human-Computer Interaction,2020,36(12):1136-1149.
[5]YOON Y,KO W R,JANG M,et al.Robots learn social skills:End-to-end learning of co-speech gesture generation for huma-noid robots[C]//2019 International Conference on Robotics and Automation(ICRA).IEEE,2019:4303-4309.
[6]VAN VUUREN S,CHERNEY L R.A virtual therapist forspeech and language therapy[C]//International Conference on Intelligent Virtual Agents.Cham:Springer,2014:438-448.
[7]KANG S H,FENG A W,SEYMOURM,et al.Smart Mobile Virtual Characters:Video Characters vs.Animated Characters[C]//Proceedings of the Fourth International Conference on Human Agent Interaction.2016:371-374.
[8]HOLLER J,LEVINSONS C.Multimodal language processing in human communication[J].Trends in Cognitive Sciences,2019,23(8):639-652.
[9]BAVELAS J,GERWING J,SUTTON C,et al.Gesturing on the telephone:Independent effects of dialogue and visibility[J].Journal of Memory and Language,2008,58(2):495-520.
[10]POUW W,HARRISON S J,DIXON J A.Gesture-speech phy-sics:The biomechanical basis for the emergence of gesture-speech synchrony[J].Journal of Experimental Psychology:General,2020,149(2):391.
[11]BUTTERWORTH B,HADARU.Gesture,speech,and computational stages:A reply to McNeill[J].Psychological Review,1989,96(1):168-174.
[12]GINOSAR S,BAR A,KOHAVI G,et al.Learning individual styles of conversational gesture[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019:3497-3506.
[13]WANG X,MENG H H,JIANG X T,et al.Survey on Character Motion Synthesis Based on Neural Network[J].Computer Science,2019,46(9):22-27.
[14]XIN Q Q,CHEN Z X,FENG X X,et al.Movement Drive andControl Constraints of Virtual Hand Based on Multi-curve Spectrum[J].Computer Science,2014,41(1):126-129,151.
[15]MARSELLA S,XU Y,LHOMMET M,et al.Virtual character performance from speech[C]//Proceed-ings of the 12th ACM SIGGRAPH/Eurographics Symposium on Computer Animation.2013:25-35.
[16]THIEBAUX M,MARSELLA S,MARSHALLA N,et al.Smartbody:Behavior realization for embodied conversational agents[C]//Proceedings of the 7th International Joint Confe-rence on Autonomous Agents and Multiagent Systems-Volume 1.2008:151-158.
[17]NEFF M,KIPP M,ALBRECHT I,et al.Gesture modeling and animation based on a probabilistic recreation of speaker style[J].ACM Transactions on Graphics(TOG),2008,27(1):1-24.
[18]SADOUGHI N,BUSSO C.Speech-driven animation with mea-ningful behaviors[J].Speech Communication,2019,110:90-100.
[19]ALEXANDERSON S,HENTER G E,KUCHERENKO T,et al.Style-Controllable Speech-Driven Gesture Synthesis Using Normalising Flows[C]//Computer Graphics Forum.2020,39(2):487-496.
[20]GUO D,TANG S G,HONG R C,et al.Review of Sign Language Recognition,Translation and Generation[J].Computer Science,2021,48(3):60-70.
[21]KUCHERENKO T,NAGY R,JONELL P,et al.Speech Properties Gestures:Gesture-Property Prediction as a Tool for Generating Representational Gestures from Speech[J].arXiv:2106.14736,2021.
[22]HASEGAWA D,KANEKO N,SHIRAKAWA S,et al.Evaluation of speech-to-gesture generation using bi-directional LSTM network[C]//Proceedings of the 18th International Conference on Intelligent Virtual Agents.2018:79-86.
[23]KUCHERENKO T,HASEGAWA D,HENTER G E,et al.Analyzing input and output representations for speech-driven gesture generation[C]//Proceedings of the 19th ACM Internatio-nal Conference on Intelligent Virtual Agents.2019:97-104.
[24]YUNUS F,CLAVEL C,PELACHAUD C.Sequence-to-Seque-nce Predictive Model:From Prosody To Communicative Gestures[C]//International Conference on Human-Computer Interaction.Springer,Cham,2021:355-374.
[25]REBOL M,GÜTI C,PIETROSZEK K.Passing a Non-verbalTuring Test:Evaluatina Gesture Animations Generated from Speech[C]//2021 IEEE Virtual Reality and 3D User Interfaces(VR).IEEE,2021:573-581.
[26]HABIBIE I,XU W,MEHTA D,et al.Learning Speech-driven 3D Conversational Gestures from Video[J].arXiv:2102.06837,2021.
[27]YAN S,XIONG Y,LIN D.Spatial temporal graph convolutional networks for skeleton-based action recognition[C]//Thirty-se-cond AAAI Conference on Artificial Intelligence.2018.
[28]REN X,LI H,HUANG Z,et al.Music-oriented dance videosynthesis with pose perceptual loss[J].arXiv:1912.06606,2019.
[29]CAO Z,HIDALGO G,SIMON T,et al.OpenPose:realtimemulti-person 2D pose estimation using Part Affinity Fields[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2019,43(1):172-186.
[30]ABADI M.TensorFlow:learning functions at scale[C]//Proceedings of the 21st ACM SIGPLAN International Conference on Functional Programming.2016.
[31]KINGMA D P,BA J.Adam:A method for stochastic optimization[J].arXiv:1412.6980,2014.
[32]YANG Y,RAMANAN D.Articulated human detection withflexible mixtures of parts[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2012,35(12):2878-2890.
[1] 叶松涛, 周扬正, 范红杰, 陈正雷.
融合因果关系和时空图卷积网络的人体动作识别
Joint Learning of Causality and Spatio-Temporal Graph Convolutional Network for Skeleton- based Action Recognition
计算机科学, 2021, 48(11A): 130-135. https://doi.org/10.11896/jsjkx.201200205
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!