文本驱动的情绪多样化人脸动画生成研究

doi:10.11896/jsjkx.240100094

摘要/Abstract

摘要： 文中介绍了一种新型的文本驱动人脸动画合成技术,该技术通过融合情绪模型以增强面部表情的表现力。这一技术主要由两个核心部分构成:面部情感模拟和唇形与语音的一致性。首先,通过对输入文本的深度分析,识别出其中包含的情感类型及其强度。然后,基于这些情感信息,应用三维自由变形算法(DFFD)来生成相应的面部表情。与此同时,收集人类发音时的语音音素和唇形数据,并利用强制对齐技术,将这些数据与文本中的语音音素在时间上进行精确匹配,从而产生一系列唇部关键点的变化。随后,通过线性插值方法生成中间帧,以进一步细化唇部运动的时间序列。最后,使用DFFD算法根据这些时间序列数据合成相应的唇形动画。通过对面部情感和唇形动画进行细致的权重配比,成功实现了高度逼真的虚拟人脸表情动画。该研究不仅解决了文本驱动面部表情合成中的信息缺失问题,而且克服了表情单一和面部表情与唇形不协调的挑战,为人机交互、游戏开发、影视制作等领域提供了一种创新的应用方案。

关键词: 文本驱动动画, 情绪模型, DFFD, 面部动画合成, 情绪强度, 唇形语音一致性

Abstract: This paper presents an innovative text-driven facial animation synthesis technique,which integrates emotion models to enhance the expressiveness of facial expressions.The methodology is composed of two core components:facial emotion simulation and the consistency between lip movements and speech.Initially,a deep analysis of the input text identifies the types of emotions contained and their intensities.Subsequently,these emotional cues are utilized to generate corresponding facial expressions using the three-dimensional free-form deformation algorithm(DFFD).Concurrently,phonemes and lip movement data from human speech are collected.These are then precisely aligned with the phonemes in the text over time using forced alignment technology,resulting in a sequence of changes in lip key points.Following this,intermediate frames are generated through linear interpolation to further refine the timeline of lip movements.Finally,the DFFD algorithm synthesizes the lip animation based on this time series data.By meticulously balancing the weights between facial emotions and lip animations,this approach successfully achieves highly realistic virtual facial expressions.

Key words: Text-driven animation, Emotion model, DFFD, Facial animation synthesis, Emotion intensity, Lip-Sync consistency

中图分类号:

TP315.69

刘增科, 殷继彬. 文本驱动的情绪多样化人脸动画生成研究[J]. 计算机科学, 2024, 51(11A): 240100094-8. https://doi.org/10.11896/jsjkx.240100094

LIU Zengke, YIN Jibin. Text-driven Generation of Emotionally Diverse Facial Animations[J]. Computer Science, 2024, 51(11A): 240100094-8. https://doi.org/10.11896/jsjkx.240100094

参考文献

[1]YANG D,LI R,YANG Q,et al.3d head-talk:speech synthesis3d head movement face animation[J].Soft Computing,2024,28(1):363-379.
[2]ZHANG H,YIN J,ZHANG X.The study of a five-dimensional emotional model for facial emotion recognition[J].Mobile Information Systems,2020,2020(1):8860608.
[3]ILIC S,FUA P.Using dirichlet free form deformation to fit deformable models to noisy 3-D data[C]//European Conference on Computer Vision(Springer,2002).2002:704-717.
[4]MUZAHIDIN S,RAKUN E.Text-driven talking head using dy-namic viseme and DFFD for SIBI[C]//2020 7th International Conference on Information Technology,Computer,and Electrical Engineering(ICITACEE 2020).IEEE,2020:173-178.
[5]IGARASHI T,MOSCOVICH T,HUGHES J F.Spatial key-framing for performance-driven animation[J].ACM SIGGRAPH 2006 Courses,2006:17-es.
[6] MAI H N,KIM J,CHOI Y H,et al.Accuracy of portable face-scanning devices for obtaining three-dimensional face models:a systematic review and meta-analysis[J].International Journal of Environmental Research and Public Health,2021,18(1):94.
[7]DENG Z,CHIANG P Y,FOX P,et al.Animating blendshape faces by cross-mapping motion capture data[C]//Proceedings of the 2006 symposium on Interactive 3D Graphics and Games.2006:43-48.
[8]JAVAID M,HALEEM A,SINGH R P,et al.Industrial perspectives of 3d scanning:features,roles and it's analytical applications[J].Sensors International,2021(2):100114.
[9]PELEG S,BEN-EZRA M.Stereo panorama with a single camera[C]//1999 IEEE Computer Society Conference on Computer Vision and Pattern Recognition(Cat.No PR00149),vol.1.IEEE,1999:395-401.
[10] ZHANG L,SNAVELY N,CURLESS B,et al.Spacetime faces:high resolution capture for modeling and animation[J].ACM Transactions on Graphics,2004,23(3):548-558.
[11] FURUKAWA Y,PONCE J.Dense 3d motion capture for human faces[C]//2009 IEEE Conference on Computer Vision and Pattern Recognition.IEEE,2009:1674-1681.
[12]DOUKAS M C,SHARMANSKA V,ZAFEIRIOU S.Video-to-video translation for visual speech synthesis[J].arXiv:1905.12043,2019.
[13] TAYLOR S,KIM T,YUE Y,et al.A deep learning approach for generalized speech animation[J].ACM Transactions on Graphics(TOG),2017,36(4):1-11.
[14] LING Z H,RICHMOND K,YAMAGISHI J.An analysis ofhmm-based prediction of articulatory movements[J].Speech Communication,2010,52(10):834-846.
[15] YU L,YU J,LING Q.Bltrcnn-based 3-d articulatory movement prediction:Learning articulatory synchronicity from both text and audio inputs[J].IEEE Transactions on Multimedia,2018,21(7):1621-1632.
[16]ZHU P,XIE L,CHEN Y.Articulatory movement predictionusing deep bidirectional long short-term memory based recurrent neural networks and word/phone embeddings[J].Sixteenth Annual Conference of the International Speech Communication Association,2015.
[17]KING S A,PARENT R E.Creating speech-synchronized animation[J].IEEE Transactions on Visualization and Computer Graphics,2005,11(3):341-352.
[18]ZHOU Y,XU Z,LANDRETH C,et al.Visemenet:Audio-dri-ven animator-centric speech animation[J].ACM Transactions on Graphics(TOG),2018,37(4):1-10.
[19]LIU K,OSTERMANN J.Realistic facial expression synthesis for an image-based talking head[C]//IEEE International Conference on Multimedia and Expo.IEEE,2011:1-6.
[20]MEHRABIAN A.Framework for a comprehensive description and measurement of emotional states[J].Genetic,Social,and General Psychology Monographs,1995,121(3):339-361.
[21]MEHRABIAN A.Pleasure-arousal-dominance:A general frame-work for describing and measuring individual differences in temperament[J].Current Psychology,1996(14):261-292.
[22]MEHRABIAN A,WIHARDJA C,LJUNGGREN E,Emotional correlates of preferences for situation-activity combinations in everyday life[J].Genetic,Social,and General Psychology Monographs,1997,123(4):461-478.
[23]FISCHER A H,VAN KLEEF G A.Where have all the people gone?a plea for including social interaction in emotion research[J].Emotion Review,2010,2(3):208-211.
[24]IZARD C E,ACKERMAN B P,SCHULTZ D.Independentemotions and consciousness:Self-consciousness and dependent emotions[J].At play in the fields of consciousness:Essays in honor of Jerome L.Singer,1999,83:102.
[25]EKMAN P,FRIESEN W V.Facial Action Coding System(FACS):a Technique for the Measurement of Facial Actions[J].Rivista Di Psichiatria,1978,47(2):126-138.
[26]KURENKOV A,JI J,GARG A,et al.Deformnet:Free-form deformation network for 3d shape reconstruction from a single image[C]//2018 IEEE Winter Conference on Applications of Computer Vision(WACV).IEEE,2018:858-866.
[27]SEDERBERG T W,PARRY S R.Free-form deformation of so-lid geometric models[C]//Proceedings of the 13th Annual Conference on Computer Graphics and Interactive Techniques.1986:151-160.
[28]ZADEH M,IMANI M,MAJIDI B.Fast facial emotion recognition using convolution alneural networks and gabor filters[C]//2019 5th Conference on Knowledge Based Engineering and Innovation(KBEI).IEEE,2019:577-581.
[29]JIANG P,WAN B,WANG Q,et al.Fast and efficient facial expression recognition using a gabor convolutional network[J].IEEE Signal Processing Letters,2020,27:1954-1958.
[30]AKHAND M H A,SHUVENDU R,NAZMUL S,et al.Facial emotionrecognition using transfer learning in the deep cnn[J].Electronics,2021,10(9):1036.
[31]SWAMINATHAN A,ADIVEL A V,AROCK M.Ferce:facialexpression recognition for combined emotions using ferce algorithm[J].IETE Journal of Research,2022),68(5):3235-3250.
[32]DOROTA ,KADIR A,DAVIT R,et al.Two-stage recognition and beyond for compound facial emotion recognition[J].Electronics,2021,10(22):2847.
[33]PENDHARI H,NAGDEOTE S,RATHOD S,et al.Compound emotions:a mixed emotion detection[C]//Proceedings of the International Conference on Innovative Computing & Communication(ICICC).2022.
[34]MACEDONIA M.A bizarre virtual trainer outperforms a hu-man trainer in foreign language word learning[J].International Journal of Computer Science and Artificial Intelligence,2014,4(2):24-34.
[35]FAN Y,LIN Z,SAITO J,et al.Faceformer:Speech-driven 3d fa-cial animation with transformers[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2022:18770-18780.
[36]ZENG J,HE X,LI S,et al.Virtual Face Animation GenerationBased on Conditional Generative Adversarial Networks[C]//2022 International Conference on Image Processing,Computer Vision and Machine Learning(ICICML).IEEE,2022:580-583.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed