计算机科学 ›› 2026, Vol. 53 ›› Issue (4): 326-336.doi: 10.11896/jsjkx.251200015

• 计算机图形学&多媒体 • 上一篇    下一篇

基于EchoMimic改进的面部动画生成算法及其应用规范

詹奇玮1, 任好佳2, 肖甜甜3   

  1. 1 中国政法大学刑事司法学院 北京 100088
    2 新疆农业大学资源与环境学院 乌鲁木齐 830052
    3 应急管理部大数据中心 北京 100013
  • 收稿日期:2025-12-01 修回日期:2026-03-10 出版日期:2026-04-15 发布日期:2026-04-08
  • 通讯作者: 肖甜甜(smilesweetsweet@163.com)
  • 作者简介:(zqwjxh@163.com)
  • 基金资助:
    中央高校基本科研业务费专项资金(24CXTD02)

Improved Facial Animation Generation Algorithm Based on EchoMimic and Its Application Specifications

ZHAN Qiwei1, REN Haojia2, XIAO Tiantian3   

  1. 1 School of Criminal Justice, China University of Political Science and Law, Beijing 100088, China
    2 Xinjiang Agricultural University, College of Resources and Environment, Urumqi 830052, China
    3 Ministry of Emergency Management Big Data Center, Beijing 100013, China
  • Received:2025-12-01 Revised:2026-03-10 Published:2026-04-15 Online:2026-04-08
  • About author:ZHAN Qiwei,born in 1992,Ph.D,associate professor,master’s supervisor.His main research intersets include artificial intelligence law and governance of science and technology ethics.
    XIAO Tiantian,born in 1989,master,engineer.Her main research interests include emergency management,disaster prevention and mitigation,equipment development,artificial intelligence and machine learning.
  • Supported by:
    Special Fund for Basic Research Operations of Central Universities(24CXTD02).

摘要: 近年来,基于扩散模型的语音驱动面部动画生成方法已取得突破性进展,此类方法能够高效生成长时序、音频嘴型同步的高分辨率讲话视频。然而,当前方法生成的视频在嘴部区域普遍存在显著的模糊与伪影问题,严重制约了合成视频的真实感与视觉可信度。针对这一问题,提出一种基于 EchoMimic 改进的面部动画生成算法LiveEchoMimic,并深入探讨其标准化应用规范。首先,在技术应用层面,以 EchoMimic 扩散模型与隐式关键点模型为双核心基础架构,构建了一套端到端的自然化讲话视频生成框架。其中,EchoMimic 扩散模型借助音频特征与面部关键点的联合控制机制,完成粗粒度讲话视频的生成任务;隐式关键点模型则采用视频驱动的范式,通过控制隐式关键点空间的位移特征,实现高质量面部动画视频的精细化生成。其次,构建音频-嘴型映射模型,用于精准建模音频特征与嘴部运动状态间的内在关联,并针对性设计映射网络,以强化生成视频的音频-嘴型同步精度。最后,在公开数据集CelebV-HQ、MEAD及私有数据集Avatar上开展大规模实验验证,定量与定性结果表明,LiveEchoMimic方法在视觉质量、音频-嘴型同步性等核心指标上显著优于当前主流方法,实现了最佳的视频生成性能。在应用规范层面,鉴于高度逼真的语音驱动面部动画技术可能引发身份与行为的失真问题,从面临挑战、应用理念、实施措施等方面提出了可操作性的建议,以促进语音驱动面部动画技术在可控、安全前提下更好地契合社会发展需求。

关键词: 扩散模型, 隐式关键点模型, 音频-嘴型同步, 隐式空间, 映射网络, 面部动画, 身份与行为失真

Abstract: In recent years,diffusion model-based approaches for speech-driven facial animation generation have achieved breakthrough progress,which can efficiently produce high-resolution talking videos with long temporal sequences and precise audio-lip synchronization.However,the videos generated by current methods generally suffer from noticeable blurring and artifacts in the mouth region,which seriously impairs the realism and visual credibility of the synthesized videos.To address this issue,this paper proposes LiveEchoMimic,an improved facial animation generation algorithm based on EchoMimic,and further explores its stan-dardized application paradigm.From the technical implementation perspective,it constructs an end-to-end framework for natural talking video generation,with the EchoMimic diffusion model and implicit key point model serving as the dual-core architecture.Specifically,the EchoMimic diffusion model leverages a joint constraint mechanism of audio features and facial key points to accomplish the generation of coarse-grained talking videos.In contrast,the implicit key point model adopts a video-driven paradigm,which realizes the refined generation of high-quality facial animation videos by regulating the displacement features in the implicit key point space.Furthermore,it constructs an audio-lip mapping model to accurately model the intrinsic correlation between audio features and mouth motion states,and a dedicated mapping network is designed to enhance the audio-lip synchronization accuracy of the generated videos.Finally,extensive experimental evaluations are conducted on two public datasets (CelebV-HQ and MEAD) and one private dataset (Avatar).Both quantitative and qualitative results demonstrate that the proposed LiveEchoMimic method significantly outperforms state-of-the-art approaches in core metrics such as visual quality and audio-lip synchronization,achieving superior video generation performance.From the perspective of application norms,considering that highly realistic speech-driven facial animation technology may give rise to identity forgery and behavioral distortion issues,this paper puts forward operable recommendations from the dimensions of challenges,application principles,and implementation measures.These recommendations are intended to promote the sound development of speech-driven facial animation technology to better meet the demands of social development under controllable and secure premises.

Key words: Diffusion model, Implicit keypoint model, Audio-lip synchronization, Implicit space, Mapping network, Facial animation, Identity and behavior incongruence

中图分类号: 

  • TP391
[1]YAN W B.An Audio-Driven Facial Animation Generation Mo-del with Controllable Emotional Intensity[J].Information Technology and Informatization,2025(8):161-165.
[2]LIU L,LI H,ZHANG M,et al.A survey of deep learning-based facial animation driving methods[J].Journal of Xidian University,2025,52(2):57-84.
[3]HU Q H.Research on Key Technologies of Facial Animation Generation for Digital Humans[D].Chengdu:University of Electronic Science and Technology of China,2024.
[4]JI X J.Research on 3D Facial Animation Generation Based on Deep Learning[D].Hefei:University of Science and Technology of China,2023.
[5]LIU X M,LIU L,JIA D,et al.A Survey of 3D Facial Animation Technology Driven by Speech[J].Computer Systems & Applications,2022,31(10):44-50.
[6]DING N.The Performativity of Human Translators in the Era of Artificial General Intelligence:A Case Study of Conference Interpreters[J].Technology Enhanced Foreign Language Education,2025(2):10-16,98.
[7]WANG H W.Research on AR Interaction Design of Remote Office Meeting Platform from the Perspective of Embodiment[D].Nanjing:Nanjing University of Science and Technology,2022.
[8]XU H X,LIU L,WANG J,etal.An Augmented Reality-Enabled Motion Monitoring and Interaction System for Mobile Robots[J/OL].China Mechanical Engineering,1-10[2024-11-24].https://link.cnki.net/urlid/42.1294.th.20251111.1619.007.
[9]HU J W,HE H Y,LEI Y J,et al.An Augmented Reality Human-Robot Interaction Teleoperation System for Dual-Arm Collaborative Robots[J].Transducer and Microsystem Technologies,2025,44(11):87-92.
[10]ZHENG F,LIU X Y.The Application of Virtual Reality Technology in Film and Television Production[J].China Information Times,2025 (3):49-51.
[11]AN J.The Integration and Innovation of Cross-Media Art and Film Production in the Digital-Intelligence Era[N].Henan Economic Daily,2024-06-22(11).
[12]CHENG C,ZHAO Z K,DONG W J,et al.A Multimodal-Driven Facial Animation Generation Model with Controllable Emotion[J].Science Technology and Engineering,2025,25(28):12120-12129.
[13]CHENG C,ZHAO Z K,DONG W J,et al.Multimodal-driven facial animation generation model with controllable emotion[J].Science Technology and Engineering,2025,25(28):12120-12129.
[14]DOU Z W,LI W S.Facial Animation Generation Based onTransformer[J].Software Engineering,2023,26(12):59-62.
[15]CAI G X.Audio2Face:Intelligently Generating Facial Animation for Virtual Characters from Audio Files[J].Modern Film Technology,2021(9):60-61.
[16]BLANZ V,VETTER T.Face recognition based on fitting a 3D morphable model[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2003,25(9):1063-1074.
[17]BOOTH J,ROUSSOS A,ZAFEIRIOU S,et al.A 3d morphable model learnt from 10,000 faces[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:5543-5552.
[18]GENOVA K,COLE F,MASCHINOT A,et al.Unsupervisedtraining for 3d morphable model regression[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:8377-8386.
[19]ROMDHANI V.Efficient,robust and accurate fitting of a 3D morphable model[C]//Proceedings Ninth IEEE International Conference on Computer Vision.IEEE,2003:59-66.
[20]EGGER B,SMITH W A P,TEWARI A,et al.3d morphable face models-past,present,and future[J].ACM Transactions on Graphics,2020,39(5):1-38.
[21]REMONDINO F,KARAMI A,YAN Z,et al.A critical analysis of NeRF-based 3D reconstruction[J].Remote Sensing,2023,15(14):3585.
[22]WANG Z,WU S,XIE W,et al.NeRF--:Neural radiance fields without known camera parameters[J].arXiv:2102.07064,2021.
[23]YARIV L,GU J,KASTEN Y,et al.Volume rendering of neural implicit surfaces[J].Advances in Neural Information Processing Systems,2021,34:4805-4815.
[24]NIEMEYER M,MESCHEDER L,OECHSLE M,et al.Diffe-rentiable volumetric rendering:Learning implicit 3d representations without 3d supervision[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:3504-3515.
[25]TEWARI A,THIES J,MILDENHALL B,et al.Advances in neural rendering[C]//Computer Graphics Forum.2022:703-735.
[26]CHEN S,SUN P,SONG Y,et al.Diffusiondet:Diffusion model for object detection[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2023:19830-19843.
[27]BERRY F S,BERRY W D.Innovation and diffusion models in policy research[J].Theories of the Policy Process,2018:253-297.
[28]HO J,SALIMANS T,GRITSENKO A,et al.Video diffusion models[J].Advances in Neural Information Processing Systems,2022,35:8633-8646.
[29]WU L,SUN P,FU Y,et al.A neural influence diffusion model for social recommendation[C]//Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval.2019:235-244.
[30]WU L,SUN P,FU Y,et al.A neural influence diffusion model for social recommendation[C]//Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval.2019:235-244.
[31]ZHU H,WU W,ZHU W,et al.CelebV-HQ:A large-scale video facial attributes dataset[C]//European Conference on Computer Vision.Cham:Springer,2022:650-667.
[32]WANG K,WU Q,SONG L,et al.Mead:A large-scale audio-visual dataset for emotional talking-face generation[C]//European Conference on Computer Vision.Cham:Springer,2020:700-717.
[33]SHAO M H,LU H R,WANG G D.A Speech-Driven Digital Human Face Generation Method Based on 4D Gaussian Splatting[EB/OL].https://link.cnki.net/urlid/50.1075.TP.20251205.1334.026.
[34]HUANG C X,LU T L,PENG S F.Research on Active Defense Against Face Forgery Based on Hybrid Color Space and Attention Mechanism[EB/OL].https://link.cnki.net/urlid/50.1075.tp.20251219.1739.045.
[35]CUDEIRO D,BOLKART T,LAIDLAW C,et al.Capture,learning,and synthesis of 3D speaking styles[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019:10101-10111.
[36]FAN Y,LIN Z,SAITO J,et al.Faceformer:Speech-driven 3d facial animation with transformers[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2022:18770-18780.
[37]ZHANG W,CUN X,WANG X,et al.Sadtalker:Learning realistic 3d motion coefficients for stylized audio-driven single image talking face animation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2023:8652-8661.
[38]WEI H,YANG Z,WANG Z.Aniportrait:Audio-driven synthesis of photorealistic portrait animation[J].arXiv:2403.17694,2024.
[39]TAN S,JI B,BI M,et al.Edtalk:Efficient disentanglement for emotional talking head synthesis[C]//European Conference on Computer Vision.Cham:Springer,2024:398-416.
[40]CHEN Z,CAO J,CHEN Z,et al.Echomimic:Lifelike audio-driven portrait animations through editable landmark conditions[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2025:2403-2410.
[41]CUI J,LI H,YAO Y,et al.Hallo2:Long-duration and high-resolution audio-driven portrait image animation[J].arXiv:2410.07718,2024.
[42]LI W,ZHANG L,WANG D,et al.One-shot high-fidelity talking-head synthesis with deformable neural radiance field[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2023:17969-17978.
[43]MA Z,ZHU X,QI G J,et al.Otavatar:One-shot talking faceavatar with controllable tri-plane rendering[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2023:16901-16910.
[44]SUN J,WANG X,WANG L,et al.Next3d:Generative neural texture rasterization for 3d-aware head avatars[C]//Procee-dings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2023:20991-21002.
[45]DENG Y,WANG D,REN X,et al.Portrait4d:Learning one-shot 4d head avatar synthesis using synthetic data[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2024:7119-7130.
[46]GUO J,ZHANG D,LIU X,et al.LivePortrait:Efficient Portrait Animation with Stitching and Retargeting Control[J].arXiv:2407.03168,2024.
[47]SUN R,FA Y W,FENG H D,et al.Research progress on face presentation attack detection method based on deep learning[J].Computer Science,2025,52(2):323-335.
[48]JIANG J,ZHANG Q,WANG C Y.A review of iris recognition based on deep learning[J].Computer Science and Exploration,2024,18(6):1421-1437.
[49]CHEN C F,XU Y F.Deepfakes in the Intelligent Era and Approaches to Their Governance[J].News and Writing,2020(4):66-71.
[50]XIE J C,TANG E S.The Social Harm and Governance Controversy of Deepfakes[J].News and Writing,2023(4):96-105.
[51]XIONG B.The Expansive Risks of Criminal Governance inDeepfake Technology and Its Limits[J].Journal of Anhui University (Philosophy and Social Science),2020(6):105-113.
[52]CHEN Z B,ZHANG L H.Legal Regulation of Algorithm Technology:Governance Dilemma,Development Logic,and Optimization Path[J].China Journal of Applied Jurisprudence,2024(4):155-166.
[53]ZHAN Q W.Review and Optimization of China’s Criminal Law Protection Model for Personality Rights[J].China Criminal Law Journal,2025(4):69-86.
[54]LI H S.Criminal Law Responses to Identity Fraud in the Digital Age[J].Jianghuai Forum,2024(3):124-132.
[55]LI H S.On the Criminal Responsibility about the Abuse of Personal Biometric Information-Taking Artificial Intelligence “Deepfake” as an Example[J].Tribune of Political Science and Law,2020,38(4):144-154.
[56]LIU Y H.The Iterative Upgrading of Cyber Crime to DigitalCrime and The Response From Criminal Law[J].Journal of Comparative Law,2025(1):1-15.
[57]ZHAO B Z,ZHAN Q W.Reality Challenges and Future Prospects:A Reflection on Artificial Intelligence in Criminal Jurisprudence[J].Jinan Journal (Philosophy and Social Science),2019(1):98-110.
[58]SHEN W X.Reconstruction of the Digital Rights System:Toward a Pattern of Differential Order of Privacy,Information and Data[J].Tribune of Political Science and Law,2022,40(3):89-102.
[1] 刘德华, 喻赛萱, 乔金兰, 黄河清, 程文辉.
基于去噪扩散模型增强的换电需求数据生成算法
Denoising Diffusion Model-enhanced Algorithm for Battery Swap Demand Data Generation
计算机科学, 2026, 53(4): 163-172. https://doi.org/10.11896/jsjkx.250600205
[2] 赵海华, 唐瑞, 莫先.
图扩散模型方法与应用研究综述
Review of Methods and Applications of Graph Diffusion Models
计算机科学, 2026, 53(3): 115-128. https://doi.org/10.11896/jsjkx.250200118
[3] 王一鸣, 焦敏, 赵素云, 陈红, 李翠平.
基于指示词表征学习的半监督聚类方法
Prompt-conditioned Representation Learning with Diffusion Models for Semi-supervised Clustering
计算机科学, 2026, 53(3): 158-165. https://doi.org/10.11896/jsjkx.250600063
[4] 侯哲晓, 李弼程, 蔡炳炎, 许逸飞.
基于改进扩散模型的高质量图像生成方法
High Quality Image Generation Method Based on Improved Diffusion Model
计算机科学, 2025, 52(6A): 240500094-9. https://doi.org/10.11896/jsjkx.240500094
[5] 邹睿, 杨鉴, 张凯.
基于音素大语言模型及扩散模型的低资源越南语语音合成
Low-resource Vietnamese Speech Synthesis Based on Phoneme Large Language Model andDiffusion Model
计算机科学, 2025, 52(6A): 240700138-6. https://doi.org/10.11896/jsjkx.240700138
[6] 康凯, 王家宝, 徐堃.
平衡可迁移与不可察觉的对抗攻击
Balancing Transferability and Imperceptibility for Adversarial Attacks
计算机科学, 2025, 52(6): 381-389. https://doi.org/10.11896/jsjkx.240300083
[7] 耿胜, 丁卫平, 鞠恒荣, 黄嘉爽, 姜舒, 王海鹏.
FDiff-Fusion:基于模糊逻辑驱动的医学图像扩散融合网络分割模型
FDiff-Fusion:Medical Image Diffusion Fusion Network Segmentation Model Driven Based onFuzzy Logic
计算机科学, 2025, 52(6): 274-285. https://doi.org/10.11896/jsjkx.240600006
[8] 杨岚, 赵金雄, 李志茹, 张驯, 狄磊, 蔡云婕, 张和慧.
面向电力缺陷场景的小样本图像生成适应
Few-shot Image Generative Adaptation for Power Defect Scenes
计算机科学, 2025, 52(11A): 241100149-8. https://doi.org/10.11896/jsjkx.241100149
[9] 李思慧, 蔡国永, 蒋航, 文益民.
一种新的基于凸损失函数的离散扩散文本生成模型
Novel Discrete Diffusion Text Generation Model with Convex Loss Function
计算机科学, 2025, 52(10): 231-238. https://doi.org/10.11896/jsjkx.240800147
[10] 黄飞虎, 李沛东, 彭舰, 董石磊, 赵红磊, 宋卫平, 李强.
计及风电的发电商报价多智能体模型
Multi-agent Based Bidding Strategy Model Considering Wind Power
计算机科学, 2024, 51(6A): 230600179-8. https://doi.org/10.11896/jsjkx.230600179
[11] 葛胤池, 张辉, 孙浩航.
基于隐空间扩散模型的差分隐私数据合成方法研究
Differential Privacy Data Synthesis Method Based on Latent Diffusion Model
计算机科学, 2024, 51(3): 30-38. https://doi.org/10.11896/jsjkx.230700177
[12] 刘增科, 殷继彬.
文本驱动的情绪多样化人脸动画生成研究
Text-driven Generation of Emotionally Diverse Facial Animations
计算机科学, 2024, 51(11A): 240100094-8. https://doi.org/10.11896/jsjkx.240100094
[13] 闫志浩, 周长兵, 李小翠.
生成扩散模型研究综述
Survey on Generative Diffusion Model
计算机科学, 2024, 51(1): 273-283. https://doi.org/10.11896/jsjkx.230300057
[14] 郑红波, 吴斌, 徐菲, 张美玉, 秦绪佳.
基于高斯扩散模型的垃圾焚烧废气排放可视化
Visualization of Solid Waste Incineration Exhaust Emissions Based on Gaussian Diffusion Model
计算机科学, 2019, 46(6A): 527-531.
[15] 赵海勇,贾仰理.
一种改进的各向异性扩散去噪模型
Improved Anisotropic Diffusion Denosing Model
计算机科学, 2013, 40(Z11): 147-149.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!