计算机科学 ›› 2025, Vol. 52 ›› Issue (3): 58-67.doi: 10.11896/jsjkx.240300030

• 三维视觉与元宇宙 • 上一篇    下一篇

基于区域显著性与空间特征提取的说话人像合成方法

王邢波, 张浩, 高浩, 翟明亮, 谢九成   

  1. 南京邮电大学自动化学院、人工智能学院 南京 210023
  • 收稿日期:2024-03-05 修回日期:2024-10-08 出版日期:2025-03-15 发布日期:2025-03-07
  • 通讯作者: 谢九成(jiuchengxie@njupt.edu.cn)
  • 作者简介:(sinbowang@163.com)
  • 基金资助:
    国家自然科学基金(62301278,62371254,61931012);江苏省自然科学基金(BK20230362,BK20210594)

Talking Portrait Synthesis Method Based on Regional Saliency and Spatial Feature Extraction

WANG Xingbo, ZHANG Hao, GAO Hao, ZHAI Mingliang, XIE Jiucheng   

  1. College of Automation & College of Artificial Intelligence,Nanjing University of Posts and Telecommunications,Nanjing 210023,China
  • Received:2024-03-05 Revised:2024-10-08 Online:2025-03-15 Published:2025-03-07
  • About author:WANG Xingbo,born in 1975,Ph.D,lecturer.His main research interests include robot control and target tracking algorithm.
    XIE Jiucheng,born in 1992,Ph.D,lecturer.His main research interests include computer vision and artificial intelligence.
  • Supported by:
    National Natural Science Foundation of China(62301278,62371254,61931012)and Natural Science Foundation of Jiangsu Province,China(BK20230362,BK20210594).

摘要: 音频驱动的说话人像合成技术致力于将任意的输入音频序列转换为逼真的说话人像视频。近期,基于神经辐射场(NeRF)的多个说话人像合成工作取得了优秀的视觉效果。但是,此类工作仍普遍存在着语音-嘴唇同步欠佳、躯干抖动和合成视频清晰度较低等不足。为了解决上述问题,提出了一种基于区域显著特征和空间体积特征的高保真说话人像合成方法。具体而言,一方面,开发了一个区域显著性感知模块用于头部建模。它利用多模态输入信息动态调整头部空间点的体积特征,同时优化基于哈希表的特征存储,从而提高面部细节表征的精确度和渲染效率。另一方面,设计了一个空间特征提取模块用于躯干的独立建模。不同于现有方法普遍采用的直接基于躯干表面空间点坐标估计其颜色和密度的方式,该模块利用参考图像构建躯干场以提供对应的纹理和几何先验,从而实现更清晰的躯干渲染和自然的躯干运动。应用于多个人物主体的实验结果表明,在自我重建场景中,所提方法相较于当前最优的基线模型,在图像质量上(PSNR,LPIPS,FID,LMD)分别取得了10.15%,12.12%,0.77%和1.09%的提升,在嘴唇同步精度上(AUE)提高了14.20%。此外,在交叉驱动(使用非训练集音频)的场景下,该算法在嘴唇同步精度(AUE)上提升了4.74%。

关键词: 说话人像合成, 三维重建, 音视频同步, 神经辐射场, 注意力机制

Abstract: Audio-driven talking portraits synthesis endeavors to convert arbitrary input audio sequences into realistic talking portrait videos.Recently,several works on synthesizing talking portraits leveraging neural radiance fields(NeRF) have achieved superior visual results.However,such works still generally suffer from poor audio-lip synchronization,torso jitter,and low clarity in the synthesized videos.To address these issues,a method based on regional saliency features and spatial volume features is proposed to achieve high-fidelity synthesis of talking portraits.On one hand,a regional saliency-aware module is developed,dynamically adjusting the volumetric attributes of spatial points in the head region with multimodal input data and optimizing feature storage through hash tables,thus improving the precision and efficiency of facial detail representation.On the other hand,a spatial feature extraction module is designed for independent torso modeling.Unlike conventional methods that estimate color and density directly from torso surface spatial points,this module constructs a torso field using reference images to provide relevant texture and geometric priors,thereby achieving more precise torso rendering and natural movements.Experiments applied to multiple subjects demonstrate that,in self-reconstruction scenarios,the proposed method improves image quality(PSNR,LPIPS,FID,LMD) by 10.15%,12.12%,0.77%,and 1.09% respectively,and enhances lip-sync accuracy(AUE) by 14.20% compared to the current state-of-the-art baseline model.Concurrently,there is a notable increase of 14.20% in lip synchronization accuracy as measured by Sync metrics.Under cross-driving conditions with out-of-domain audio sources,the lip synchronization accuracy is achieved improvements of 4.74%.

Key words: Talking portrait synthesis, 3D reconstruction, Audio-video synchronization, Nerual radiance fields, Attention mechanism

中图分类号: 

  • TP391
[1]CHUNG J S,JAMALUDIN A,ZISSERMAN A.You said that?[J].arXiv:1705.02966,2017.
[2]CRESWELL A,WHITE T,DUMOULIN V,et al.Generativeadversarial networks:An overview[J].IEEE signal processing magazine,2018,35(1):53-65.
[3]WANG Q Q,ZHANG J L.Face Pose and Expression Correction Based on 3D Morphable Model[J].Computer Science,2019,46(6):263-269.
[4]TANG Y X,WANG B J.Research Progress of Face Editing Based on Deep Generative Model[J].Computer Science,2022,49(2):51-61.
[5]MILDENHALL B,SRINIVASAN P P,TANCIK M,et al.Nerf:Representing scenes as neural radiance fields for view synthesis[J].Communications of the ACM,2021,65(1):99-106.
[6]GUO Y,CHEN K,LIANG S,et al.Ad-nerf:Audio driven neural radiance fields for talking head synthesis[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision and Pattern Recognition.Montreal:IEEE,2021:5784-5794.
[7]XIE Z F,ZHENG J H,WANG J,et al.Speech-Driven Facial Ree-nactment Guided by Structured Latent Codes in NeRF[J].Journal of Computer-Aided Design and Graphics,2023,41(3):1003-1015.
[8]ZHENG B W,DONG J W,WU L T,et al.A Method and System for Generating Virtual Anchors Based on Neural Radiance Fields and Hidden Attributes:CN-202311094348.7[P].2023-12-05.
[9]MULLER T,EVANS A,SCHIED C,et al.Instant neuralgraphics primitives with a multiresolution hash encoding[J].ACM Transactions on Graphics(ToG),2022,41(4):1-15.
[10]TANG J,WANG K,ZHOU H,et al.Real-time neural radiance talking portrait synthesis via audio-spatial decomposition[J].arXiv:2211.12368,2022.
[11]RONNEBERGER O,FISCHER P,BROX T.U-net:Convolu-tional networks for biomedical image segmentation[C]//Proceedings of the Medical Image Computing and Computer Assisted Intervention.Munich:MICCAI,2015:234-241.
[12]GU K,ZHOU Y,HUANG T.Flnet:Landmark driven fetching and learning network for faithful talking facial animation synthesis[C]//Proceedings of the AAAI Conference on Artificial Intelligence.New York:AAAI,2020:10861-10868.
[13]ZHANG Z,LI L,DING Y,et al.Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Nashville:IEEE,2021:3661-3670.
[14]THIES J,ELGHARIB M,TEWARI A,et al.Neural voice puppetry:Audio-driven facial reenactment[C]//Proceedings of the European Conference on Computer Vision.ECCV,2020:716-731.
[15]BLANZ V,VETTER T.A morphable model for the synthesis of3D faces[C]//Proceedings of the Seminal 26th Annual Confe-rence on Computer Graphics and Interactive Techniques.New York:ACM,1999:187-194.
[16]LIU X,XU Y,WU Q,et al.Semantic-aware implicit neural au-dio-driven video portrait generation[C]//Proceedings of the European Conference on Computer Vision.Switzerland:ECCV,2022:106-125.
[17]SHEN S,LI W,ZHU Z,et al.Learning dynamic facial radiance fields for few-shot talking headsynthesis[C]//Proceedings of the European Conference on Computer Vision.Switzerland:ECCV,2022:666-682.
[18]YAO S,ZHONG R Z,YAN Y,et al.DFA-NeRF:Personalized talking head generation via disentangled face attributes neural rendering[J].arXiv:2201.00791,2022.
[19]CHAN E R,LIN C Z,CHAN M A,et al.Efficient geometry-aware 3D generative adversarial networks[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.New Orleans:IEEE,2022:16123-16133.
[20]GUO M H,LIU Z N,MU T J,et al.Beyond self-attention:External attention using two linear layers for visual tasks[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2022,45(5):5436-5447.
[21]LI J,ZHANG J,BAI X,et al.Efficient region-aware neural ra-diance fields for high-fidelity talking portrait synthesis[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.Paris:IEEE,2023:7568-7578.
[22]ZHOU H,SUN Y,WU W,et al.Pose-controllable talking face generation by implicitly modularized audio-visual representation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Nashville:IEEE,2021:4176-4186.
[23]ZHANG Z,HU Z,DENG W,et al.DINet:Deformation inpain-ting network for realistic face visually dubbing on high resolution video[C]//Proceedings of the AAAI Conference on Artificial Intelligence.Washington D.C:AAAI,2023:3543-3551.
[24]ZHANG R,ISOLA P,EFROS A,et al.The unreasonable effectiveness of deep features as a perceptual metric[C]//Procee-dings of the IEEE Conference on Computer Vision and Pattern Recognition.Salt Lake City:IEEE,2018:586-595.
[25]HEUSEL M,RAMSAUER H,UNTERTHINER T,et al.Gans trained by a two time-scale update rule converge to a local hash equilibrium[J].Advances in Neural Information Processing Systems,2017,30(4):6626-6637.
[26]CHEN L,LI Z,MADDOX R K,et al.Lip movements generation at a glance[C]//Proceedings of the European Conference on Computer Vision.Salt Lake City:ECCV,2018:520-535.
[27]GUAN J,ZHANG Z,ZHOU H,et al.StyleSync:High-fidelitygeneralized and personalized lip sync in style-based generator[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Vancouver:IEEE,2023:1505-1515.
[28]CHUNG J S,ZISSERMAN A.Lip reading in the wild[C]//Proceedings of the Computer Vision Asian Conference on Computer Vision.Waikoloa:IEEE,2017:87-103.
[29]BALTRUSAITIS T,ROBINSON P,MORENCY L P.Open-face:An open source facial behavior analysis toolkit[C]//Proceedings of the 2016 IEEE Winter Conference on Applications of Computer Vision(WACV).Waikoloa:IEEE,2016:1-10.
[30]SUWAJANAKORN S,SEITZ S M,KEMELMACHER S I.Synthesizing Obama:Learning lips sync from audio[J].ACM Transactions on Graphics(TOG),2017,36(4):1-13.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!