基于时空图卷积网络的语音驱动个人风格手势生成方法

doi:10.11896/jsjkx.210900094

Abstract

Abstract: People’s gestures in speaking often have their own unique personal style.Researchers have proposed a speech-driven personal style gesture generation method based on generative adversarial networks.However,the generated actions are unnatural for temporal discontinuity.To solve this problem,this paper proposes a speech-driven personal style gesture generation method based on the spatio-temporal graph convolutional networks,which adds the temporal dynamic discriminator based on spatio-temporal graph convolutional network.The spatial and temporal structural relationships between gesture joint points is firstly constructed,and then the spatial correlation of gesture joint points is captured and the dynamic characteristics in time sequence are extracted through the spatio-temporal graph convolution network(STGCN),so that the generated gestures maintain the consistency in time sequenceand are more consistent with the behavior and structure of real gestures.The proposed method is verified on the speech and gesture dataset constructed by Ginosar et al.Compared with relevant methods,the percentage of correct keypoints improves by about 2%~5%,and the generated gestures are more natural.

Key words: Cross-modal generation, Gesture generation, Personal style learning, Spatio-Temporal graph convolutional networks, Temporal dynamics

CLC Number:

TP391.1

ZHANG Bin, LIU Chang-hong, ZENG Sheng, JIE An-quan. Speech-driven Personal Style Gesture Generation Method Based on Spatio-Temporal GraphConvolutional Networks[J].Computer Science, 2022, 49(11A): 210900094-5.

References

[1]YAGHOUBZADEH R,KRAMER M,PITSCH K,et al.Virtual agents as daily assistants for elderly or cognitively impaired people[C]//International Workshop on Intelligent Virtual Agents.Berlin,Springer:2013:79-91.
[2]LI J,KIZILCEC R,BAILENSONJ,et al.Social robots and vir-tual agents as lecturers for video instruction[J].Computers in Human Behavior,2016,55:1222-1230.
[3]PACELLA D,LÓPEZ-PÉREZ B.Assessing children’s interpersonal emotion regulation with virtual agents:The serious game Emodiscovery[J].Computers & Education,2018,123:1-12.
[4]TAN S M,LIEW T W.Designing embodied virtual agents asproduct specialists in a multi-product category E-commerce:The roles of source credibility and social presence[J].International Journal of Human-Computer Interaction,2020,36(12):1136-1149.
[5]YOON Y,KO W R,JANG M,et al.Robots learn social skills:End-to-end learning of co-speech gesture generation for huma-noid robots[C]//2019 International Conference on Robotics and Automation(ICRA).IEEE,2019:4303-4309.
[6]VAN VUUREN S,CHERNEY L R.A virtual therapist forspeech and language therapy[C]//International Conference on Intelligent Virtual Agents.Cham:Springer,2014:438-448.
[7]KANG S H,FENG A W,SEYMOURM,et al.Smart Mobile Virtual Characters:Video Characters vs.Animated Characters[C]//Proceedings of the Fourth International Conference on Human Agent Interaction.2016:371-374.
[8]HOLLER J,LEVINSONS C.Multimodal language processing in human communication[J].Trends in Cognitive Sciences,2019,23(8):639-652.
[9]BAVELAS J,GERWING J,SUTTON C,et al.Gesturing on the telephone:Independent effects of dialogue and visibility[J].Journal of Memory and Language,2008,58(2):495-520.
[10]POUW W,HARRISON S J,DIXON J A.Gesture-speech phy-sics:The biomechanical basis for the emergence of gesture-speech synchrony[J].Journal of Experimental Psychology:General,2020,149(2):391.
[11]BUTTERWORTH B,HADARU.Gesture,speech,and computational stages:A reply to McNeill[J].Psychological Review,1989,96(1):168-174.
[12]GINOSAR S,BAR A,KOHAVI G,et al.Learning individual styles of conversational gesture[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019:3497-3506.
[13]WANG X,MENG H H,JIANG X T,et al.Survey on Character Motion Synthesis Based on Neural Network[J].Computer Science,2019,46(9):22-27.
[14]XIN Q Q,CHEN Z X,FENG X X,et al.Movement Drive andControl Constraints of Virtual Hand Based on Multi-curve Spectrum[J].Computer Science,2014,41(1):126-129,151.
[15]MARSELLA S,XU Y,LHOMMET M,et al.Virtual character performance from speech[C]//Proceed-ings of the 12th ACM SIGGRAPH/Eurographics Symposium on Computer Animation.2013:25-35.
[16]THIEBAUX M,MARSELLA S,MARSHALLA N,et al.Smartbody:Behavior realization for embodied conversational agents[C]//Proceedings of the 7th International Joint Confe-rence on Autonomous Agents and Multiagent Systems-Volume 1.2008:151-158.
[17]NEFF M,KIPP M,ALBRECHT I,et al.Gesture modeling and animation based on a probabilistic recreation of speaker style[J].ACM Transactions on Graphics(TOG),2008,27(1):1-24.
[18]SADOUGHI N,BUSSO C.Speech-driven animation with mea-ningful behaviors[J].Speech Communication,2019,110:90-100.
[19]ALEXANDERSON S,HENTER G E,KUCHERENKO T,et al.Style-Controllable Speech-Driven Gesture Synthesis Using Normalising Flows[C]//Computer Graphics Forum.2020,39(2):487-496.
[20]GUO D,TANG S G,HONG R C,et al.Review of Sign Language Recognition,Translation and Generation[J].Computer Science,2021,48(3):60-70.
[21]KUCHERENKO T,NAGY R,JONELL P,et al.Speech Properties Gestures:Gesture-Property Prediction as a Tool for Generating Representational Gestures from Speech[J].arXiv:2106.14736,2021.
[22]HASEGAWA D,KANEKO N,SHIRAKAWA S,et al.Evaluation of speech-to-gesture generation using bi-directional LSTM network[C]//Proceedings of the 18th International Conference on Intelligent Virtual Agents.2018:79-86.
[23]KUCHERENKO T,HASEGAWA D,HENTER G E,et al.Analyzing input and output representations for speech-driven gesture generation[C]//Proceedings of the 19th ACM Internatio-nal Conference on Intelligent Virtual Agents.2019:97-104.
[24]YUNUS F,CLAVEL C,PELACHAUD C.Sequence-to-Seque-nce Predictive Model:From Prosody To Communicative Gestures[C]//International Conference on Human-Computer Interaction.Springer,Cham,2021:355-374.
[25]REBOL M,GÜTI C,PIETROSZEK K.Passing a Non-verbalTuring Test:Evaluatina Gesture Animations Generated from Speech[C]//2021 IEEE Virtual Reality and 3D User Interfaces(VR).IEEE,2021:573-581.
[26]HABIBIE I,XU W,MEHTA D,et al.Learning Speech-driven 3D Conversational Gestures from Video[J].arXiv:2102.06837,2021.
[27]YAN S,XIONG Y,LIN D.Spatial temporal graph convolutional networks for skeleton-based action recognition[C]//Thirty-se-cond AAAI Conference on Artificial Intelligence.2018.
[28]REN X,LI H,HUANG Z,et al.Music-oriented dance videosynthesis with pose perceptual loss[J].arXiv:1912.06606,2019.
[29]CAO Z,HIDALGO G,SIMON T,et al.OpenPose:realtimemulti-person 2D pose estimation using Part Affinity Fields[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2019,43(1):172-186.
[30]ABADI M.TensorFlow:learning functions at scale[C]//Proceedings of the 21st ACM SIGPLAN International Conference on Functional Programming.2016.
[31]KINGMA D P,BA J.Adam:A method for stochastic optimization[J].arXiv:1412.6980,2014.
[32]YANG Y,RAMANAN D.Articulated human detection withflexible mixtures of parts[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2012,35(12):2878-2890.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Speech-driven Personal Style Gesture Generation Method Based on Spatio-Temporal GraphConvolutional Networks

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 2

Metrics

Comments

Recommended 0

[1]	LI Chang-qing and ZHANG Yan-lan. Updating Approximations for a Type of Covering-based Rough Sets [J]. Computer Science, 2016, 43(1): 73-76.
[2]	ZUO Wan-li, HAN Jia-yu, LIU Lu, WANG Ying and PENG Tao. Incremental User Interest Mining Based on Artificial Immune Algorithm [J]. Computer Science, 2015, 42(5): 34-41.