Computer Science ›› 2026, Vol. 53 ›› Issue (4): 326-336.doi: 10.11896/jsjkx.251200015

• Computer Graphics & Multimedia • Previous Articles     Next Articles

Improved Facial Animation Generation Algorithm Based on EchoMimic and Its Application Specifications

ZHAN Qiwei1, REN Haojia2, XIAO Tiantian3   

  1. 1 School of Criminal Justice, China University of Political Science and Law, Beijing 100088, China
    2 Xinjiang Agricultural University, College of Resources and Environment, Urumqi 830052, China
    3 Ministry of Emergency Management Big Data Center, Beijing 100013, China
  • Received:2025-12-01 Revised:2026-03-10 Online:2026-04-15 Published:2026-04-08
  • About author:ZHAN Qiwei,born in 1992,Ph.D,associate professor,master’s supervisor.His main research intersets include artificial intelligence law and governance of science and technology ethics.
    XIAO Tiantian,born in 1989,master,engineer.Her main research interests include emergency management,disaster prevention and mitigation,equipment development,artificial intelligence and machine learning.
  • Supported by:
    Special Fund for Basic Research Operations of Central Universities(24CXTD02).

Abstract: In recent years,diffusion model-based approaches for speech-driven facial animation generation have achieved breakthrough progress,which can efficiently produce high-resolution talking videos with long temporal sequences and precise audio-lip synchronization.However,the videos generated by current methods generally suffer from noticeable blurring and artifacts in the mouth region,which seriously impairs the realism and visual credibility of the synthesized videos.To address this issue,this paper proposes LiveEchoMimic,an improved facial animation generation algorithm based on EchoMimic,and further explores its stan-dardized application paradigm.From the technical implementation perspective,it constructs an end-to-end framework for natural talking video generation,with the EchoMimic diffusion model and implicit key point model serving as the dual-core architecture.Specifically,the EchoMimic diffusion model leverages a joint constraint mechanism of audio features and facial key points to accomplish the generation of coarse-grained talking videos.In contrast,the implicit key point model adopts a video-driven paradigm,which realizes the refined generation of high-quality facial animation videos by regulating the displacement features in the implicit key point space.Furthermore,it constructs an audio-lip mapping model to accurately model the intrinsic correlation between audio features and mouth motion states,and a dedicated mapping network is designed to enhance the audio-lip synchronization accuracy of the generated videos.Finally,extensive experimental evaluations are conducted on two public datasets (CelebV-HQ and MEAD) and one private dataset (Avatar).Both quantitative and qualitative results demonstrate that the proposed LiveEchoMimic method significantly outperforms state-of-the-art approaches in core metrics such as visual quality and audio-lip synchronization,achieving superior video generation performance.From the perspective of application norms,considering that highly realistic speech-driven facial animation technology may give rise to identity forgery and behavioral distortion issues,this paper puts forward operable recommendations from the dimensions of challenges,application principles,and implementation measures.These recommendations are intended to promote the sound development of speech-driven facial animation technology to better meet the demands of social development under controllable and secure premises.

Key words: Diffusion model, Implicit keypoint model, Audio-lip synchronization, Implicit space, Mapping network, Facial animation, Identity and behavior incongruence

CLC Number: 

  • TP391
[1]YAN W B.An Audio-Driven Facial Animation Generation Mo-del with Controllable Emotional Intensity[J].Information Technology and Informatization,2025(8):161-165.
[2]LIU L,LI H,ZHANG M,et al.A survey of deep learning-based facial animation driving methods[J].Journal of Xidian University,2025,52(2):57-84.
[3]HU Q H.Research on Key Technologies of Facial Animation Generation for Digital Humans[D].Chengdu:University of Electronic Science and Technology of China,2024.
[4]JI X J.Research on 3D Facial Animation Generation Based on Deep Learning[D].Hefei:University of Science and Technology of China,2023.
[5]LIU X M,LIU L,JIA D,et al.A Survey of 3D Facial Animation Technology Driven by Speech[J].Computer Systems & Applications,2022,31(10):44-50.
[6]DING N.The Performativity of Human Translators in the Era of Artificial General Intelligence:A Case Study of Conference Interpreters[J].Technology Enhanced Foreign Language Education,2025(2):10-16,98.
[7]WANG H W.Research on AR Interaction Design of Remote Office Meeting Platform from the Perspective of Embodiment[D].Nanjing:Nanjing University of Science and Technology,2022.
[8]XU H X,LIU L,WANG J,etal.An Augmented Reality-Enabled Motion Monitoring and Interaction System for Mobile Robots[J/OL].China Mechanical Engineering,1-10[2024-11-24].https://link.cnki.net/urlid/42.1294.th.20251111.1619.007.
[9]HU J W,HE H Y,LEI Y J,et al.An Augmented Reality Human-Robot Interaction Teleoperation System for Dual-Arm Collaborative Robots[J].Transducer and Microsystem Technologies,2025,44(11):87-92.
[10]ZHENG F,LIU X Y.The Application of Virtual Reality Technology in Film and Television Production[J].China Information Times,2025 (3):49-51.
[11]AN J.The Integration and Innovation of Cross-Media Art and Film Production in the Digital-Intelligence Era[N].Henan Economic Daily,2024-06-22(11).
[12]CHENG C,ZHAO Z K,DONG W J,et al.A Multimodal-Driven Facial Animation Generation Model with Controllable Emotion[J].Science Technology and Engineering,2025,25(28):12120-12129.
[13]CHENG C,ZHAO Z K,DONG W J,et al.Multimodal-driven facial animation generation model with controllable emotion[J].Science Technology and Engineering,2025,25(28):12120-12129.
[14]DOU Z W,LI W S.Facial Animation Generation Based onTransformer[J].Software Engineering,2023,26(12):59-62.
[15]CAI G X.Audio2Face:Intelligently Generating Facial Animation for Virtual Characters from Audio Files[J].Modern Film Technology,2021(9):60-61.
[16]BLANZ V,VETTER T.Face recognition based on fitting a 3D morphable model[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2003,25(9):1063-1074.
[17]BOOTH J,ROUSSOS A,ZAFEIRIOU S,et al.A 3d morphable model learnt from 10,000 faces[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:5543-5552.
[18]GENOVA K,COLE F,MASCHINOT A,et al.Unsupervisedtraining for 3d morphable model regression[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:8377-8386.
[19]ROMDHANI V.Efficient,robust and accurate fitting of a 3D morphable model[C]//Proceedings Ninth IEEE International Conference on Computer Vision.IEEE,2003:59-66.
[20]EGGER B,SMITH W A P,TEWARI A,et al.3d morphable face models-past,present,and future[J].ACM Transactions on Graphics,2020,39(5):1-38.
[21]REMONDINO F,KARAMI A,YAN Z,et al.A critical analysis of NeRF-based 3D reconstruction[J].Remote Sensing,2023,15(14):3585.
[22]WANG Z,WU S,XIE W,et al.NeRF--:Neural radiance fields without known camera parameters[J].arXiv:2102.07064,2021.
[23]YARIV L,GU J,KASTEN Y,et al.Volume rendering of neural implicit surfaces[J].Advances in Neural Information Processing Systems,2021,34:4805-4815.
[24]NIEMEYER M,MESCHEDER L,OECHSLE M,et al.Diffe-rentiable volumetric rendering:Learning implicit 3d representations without 3d supervision[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:3504-3515.
[25]TEWARI A,THIES J,MILDENHALL B,et al.Advances in neural rendering[C]//Computer Graphics Forum.2022:703-735.
[26]CHEN S,SUN P,SONG Y,et al.Diffusiondet:Diffusion model for object detection[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2023:19830-19843.
[27]BERRY F S,BERRY W D.Innovation and diffusion models in policy research[J].Theories of the Policy Process,2018:253-297.
[28]HO J,SALIMANS T,GRITSENKO A,et al.Video diffusion models[J].Advances in Neural Information Processing Systems,2022,35:8633-8646.
[29]WU L,SUN P,FU Y,et al.A neural influence diffusion model for social recommendation[C]//Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval.2019:235-244.
[30]WU L,SUN P,FU Y,et al.A neural influence diffusion model for social recommendation[C]//Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval.2019:235-244.
[31]ZHU H,WU W,ZHU W,et al.CelebV-HQ:A large-scale video facial attributes dataset[C]//European Conference on Computer Vision.Cham:Springer,2022:650-667.
[32]WANG K,WU Q,SONG L,et al.Mead:A large-scale audio-visual dataset for emotional talking-face generation[C]//European Conference on Computer Vision.Cham:Springer,2020:700-717.
[33]SHAO M H,LU H R,WANG G D.A Speech-Driven Digital Human Face Generation Method Based on 4D Gaussian Splatting[EB/OL].https://link.cnki.net/urlid/50.1075.TP.20251205.1334.026.
[34]HUANG C X,LU T L,PENG S F.Research on Active Defense Against Face Forgery Based on Hybrid Color Space and Attention Mechanism[EB/OL].https://link.cnki.net/urlid/50.1075.tp.20251219.1739.045.
[35]CUDEIRO D,BOLKART T,LAIDLAW C,et al.Capture,learning,and synthesis of 3D speaking styles[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019:10101-10111.
[36]FAN Y,LIN Z,SAITO J,et al.Faceformer:Speech-driven 3d facial animation with transformers[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2022:18770-18780.
[37]ZHANG W,CUN X,WANG X,et al.Sadtalker:Learning realistic 3d motion coefficients for stylized audio-driven single image talking face animation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2023:8652-8661.
[38]WEI H,YANG Z,WANG Z.Aniportrait:Audio-driven synthesis of photorealistic portrait animation[J].arXiv:2403.17694,2024.
[39]TAN S,JI B,BI M,et al.Edtalk:Efficient disentanglement for emotional talking head synthesis[C]//European Conference on Computer Vision.Cham:Springer,2024:398-416.
[40]CHEN Z,CAO J,CHEN Z,et al.Echomimic:Lifelike audio-driven portrait animations through editable landmark conditions[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2025:2403-2410.
[41]CUI J,LI H,YAO Y,et al.Hallo2:Long-duration and high-resolution audio-driven portrait image animation[J].arXiv:2410.07718,2024.
[42]LI W,ZHANG L,WANG D,et al.One-shot high-fidelity talking-head synthesis with deformable neural radiance field[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2023:17969-17978.
[43]MA Z,ZHU X,QI G J,et al.Otavatar:One-shot talking faceavatar with controllable tri-plane rendering[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2023:16901-16910.
[44]SUN J,WANG X,WANG L,et al.Next3d:Generative neural texture rasterization for 3d-aware head avatars[C]//Procee-dings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2023:20991-21002.
[45]DENG Y,WANG D,REN X,et al.Portrait4d:Learning one-shot 4d head avatar synthesis using synthetic data[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2024:7119-7130.
[46]GUO J,ZHANG D,LIU X,et al.LivePortrait:Efficient Portrait Animation with Stitching and Retargeting Control[J].arXiv:2407.03168,2024.
[47]SUN R,FA Y W,FENG H D,et al.Research progress on face presentation attack detection method based on deep learning[J].Computer Science,2025,52(2):323-335.
[48]JIANG J,ZHANG Q,WANG C Y.A review of iris recognition based on deep learning[J].Computer Science and Exploration,2024,18(6):1421-1437.
[49]CHEN C F,XU Y F.Deepfakes in the Intelligent Era and Approaches to Their Governance[J].News and Writing,2020(4):66-71.
[50]XIE J C,TANG E S.The Social Harm and Governance Controversy of Deepfakes[J].News and Writing,2023(4):96-105.
[51]XIONG B.The Expansive Risks of Criminal Governance inDeepfake Technology and Its Limits[J].Journal of Anhui University (Philosophy and Social Science),2020(6):105-113.
[52]CHEN Z B,ZHANG L H.Legal Regulation of Algorithm Technology:Governance Dilemma,Development Logic,and Optimization Path[J].China Journal of Applied Jurisprudence,2024(4):155-166.
[53]ZHAN Q W.Review and Optimization of China’s Criminal Law Protection Model for Personality Rights[J].China Criminal Law Journal,2025(4):69-86.
[54]LI H S.Criminal Law Responses to Identity Fraud in the Digital Age[J].Jianghuai Forum,2024(3):124-132.
[55]LI H S.On the Criminal Responsibility about the Abuse of Personal Biometric Information-Taking Artificial Intelligence “Deepfake” as an Example[J].Tribune of Political Science and Law,2020,38(4):144-154.
[56]LIU Y H.The Iterative Upgrading of Cyber Crime to DigitalCrime and The Response From Criminal Law[J].Journal of Comparative Law,2025(1):1-15.
[57]ZHAO B Z,ZHAN Q W.Reality Challenges and Future Prospects:A Reflection on Artificial Intelligence in Criminal Jurisprudence[J].Jinan Journal (Philosophy and Social Science),2019(1):98-110.
[58]SHEN W X.Reconstruction of the Digital Rights System:Toward a Pattern of Differential Order of Privacy,Information and Data[J].Tribune of Political Science and Law,2022,40(3):89-102.
[1] LIU Dehua, YU Saixuan, QIAO Jinlan, HUANG Heqing, CHENG Wenhui. Denoising Diffusion Model-enhanced Algorithm for Battery Swap Demand Data Generation [J]. Computer Science, 2026, 53(4): 163-172.
[2] ZHAO Haihua, TANG Rui, MO Xian. Review of Methods and Applications of Graph Diffusion Models [J]. Computer Science, 2026, 53(3): 115-128.
[3] WANG Yiming, JIAO Min, ZHAO Suyun, CHEN Hong, LI Cuiping. Prompt-conditioned Representation Learning with Diffusion Models for Semi-supervised Clustering [J]. Computer Science, 2026, 53(3): 158-165.
[4] ZOU Rui, YANG Jian, ZHANG Kai. Low-resource Vietnamese Speech Synthesis Based on Phoneme Large Language Model andDiffusion Model [J]. Computer Science, 2025, 52(6A): 240700138-6.
[5] HOU Zhexiao, LI Bicheng, CAI Bingyan, XU Yifei. High Quality Image Generation Method Based on Improved Diffusion Model [J]. Computer Science, 2025, 52(6A): 240500094-9.
[6] GENG Sheng, DING Weiping, JU Hengrong, HUANG Jiashuang, JIANG Shu, WANG Haipeng. FDiff-Fusion:Medical Image Diffusion Fusion Network Segmentation Model Driven Based onFuzzy Logic [J]. Computer Science, 2025, 52(6): 274-285.
[7] KANG Kai, WANG Jiabao, XU Kun. Balancing Transferability and Imperceptibility for Adversarial Attacks [J]. Computer Science, 2025, 52(6): 381-389.
[8] YANG Lan, ZHAO Jinxiong, LI Zhiru, ZHANG Xun, DI Lei, CAI Yunjie, ZHANG Hehui. Few-shot Image Generative Adaptation for Power Defect Scenes [J]. Computer Science, 2025, 52(11A): 241100149-8.
[9] LI Sihui, CAI Guoyong, JIANG Hang, WEN Yimin. Novel Discrete Diffusion Text Generation Model with Convex Loss Function [J]. Computer Science, 2025, 52(10): 231-238.
[10] HUANG Feihu, LI Peidong, PENG Jian, DONG Shilei, ZHAO Honglei, SONG Weiping, LI Qiang. Multi-agent Based Bidding Strategy Model Considering Wind Power [J]. Computer Science, 2024, 51(6A): 230600179-8.
[11] GE Yinchi, ZHANG Hui, SUN Haohang. Differential Privacy Data Synthesis Method Based on Latent Diffusion Model [J]. Computer Science, 2024, 51(3): 30-38.
[12] LIU Zengke, YIN Jibin. Text-driven Generation of Emotionally Diverse Facial Animations [J]. Computer Science, 2024, 51(11A): 240100094-8.
[13] YAN Zhihao, ZHOU Zhangbing, LI Xiaocui. Survey on Generative Diffusion Model [J]. Computer Science, 2024, 51(1): 273-283.
[14] ZHENG Hong-bo, WU Bin, XU Fei, ZHANG Mei-yu, QIN Xu-jia. Visualization of Solid Waste Incineration Exhaust Emissions Based on Gaussian Diffusion Model [J]. Computer Science, 2019, 46(6A): 527-531.
[15] ZHAO Hai-yong and JIA Yang-li. Improved Anisotropic Diffusion Denosing Model [J]. Computer Science, 2013, 40(Z11): 147-149.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!