计算机科学 ›› 2025, Vol. 52 ›› Issue (3): 58-67.doi: 10.11896/jsjkx.240300030
王邢波, 张浩, 高浩, 翟明亮, 谢九成
WANG Xingbo, ZHANG Hao, GAO Hao, ZHAI Mingliang, XIE Jiucheng
摘要: 音频驱动的说话人像合成技术致力于将任意的输入音频序列转换为逼真的说话人像视频。近期,基于神经辐射场(NeRF)的多个说话人像合成工作取得了优秀的视觉效果。但是,此类工作仍普遍存在着语音-嘴唇同步欠佳、躯干抖动和合成视频清晰度较低等不足。为了解决上述问题,提出了一种基于区域显著特征和空间体积特征的高保真说话人像合成方法。具体而言,一方面,开发了一个区域显著性感知模块用于头部建模。它利用多模态输入信息动态调整头部空间点的体积特征,同时优化基于哈希表的特征存储,从而提高面部细节表征的精确度和渲染效率。另一方面,设计了一个空间特征提取模块用于躯干的独立建模。不同于现有方法普遍采用的直接基于躯干表面空间点坐标估计其颜色和密度的方式,该模块利用参考图像构建躯干场以提供对应的纹理和几何先验,从而实现更清晰的躯干渲染和自然的躯干运动。应用于多个人物主体的实验结果表明,在自我重建场景中,所提方法相较于当前最优的基线模型,在图像质量上(PSNR,LPIPS,FID,LMD)分别取得了10.15%,12.12%,0.77%和1.09%的提升,在嘴唇同步精度上(AUE)提高了14.20%。此外,在交叉驱动(使用非训练集音频)的场景下,该算法在嘴唇同步精度(AUE)上提升了4.74%。
中图分类号:
[1]CHUNG J S,JAMALUDIN A,ZISSERMAN A.You said that?[J].arXiv:1705.02966,2017. [2]CRESWELL A,WHITE T,DUMOULIN V,et al.Generativeadversarial networks:An overview[J].IEEE signal processing magazine,2018,35(1):53-65. [3]WANG Q Q,ZHANG J L.Face Pose and Expression Correction Based on 3D Morphable Model[J].Computer Science,2019,46(6):263-269. [4]TANG Y X,WANG B J.Research Progress of Face Editing Based on Deep Generative Model[J].Computer Science,2022,49(2):51-61. [5]MILDENHALL B,SRINIVASAN P P,TANCIK M,et al.Nerf:Representing scenes as neural radiance fields for view synthesis[J].Communications of the ACM,2021,65(1):99-106. [6]GUO Y,CHEN K,LIANG S,et al.Ad-nerf:Audio driven neural radiance fields for talking head synthesis[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision and Pattern Recognition.Montreal:IEEE,2021:5784-5794. [7]XIE Z F,ZHENG J H,WANG J,et al.Speech-Driven Facial Ree-nactment Guided by Structured Latent Codes in NeRF[J].Journal of Computer-Aided Design and Graphics,2023,41(3):1003-1015. [8]ZHENG B W,DONG J W,WU L T,et al.A Method and System for Generating Virtual Anchors Based on Neural Radiance Fields and Hidden Attributes:CN-202311094348.7[P].2023-12-05. [9]MULLER T,EVANS A,SCHIED C,et al.Instant neuralgraphics primitives with a multiresolution hash encoding[J].ACM Transactions on Graphics(ToG),2022,41(4):1-15. [10]TANG J,WANG K,ZHOU H,et al.Real-time neural radiance talking portrait synthesis via audio-spatial decomposition[J].arXiv:2211.12368,2022. [11]RONNEBERGER O,FISCHER P,BROX T.U-net:Convolu-tional networks for biomedical image segmentation[C]//Proceedings of the Medical Image Computing and Computer Assisted Intervention.Munich:MICCAI,2015:234-241. [12]GU K,ZHOU Y,HUANG T.Flnet:Landmark driven fetching and learning network for faithful talking facial animation synthesis[C]//Proceedings of the AAAI Conference on Artificial Intelligence.New York:AAAI,2020:10861-10868. [13]ZHANG Z,LI L,DING Y,et al.Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Nashville:IEEE,2021:3661-3670. [14]THIES J,ELGHARIB M,TEWARI A,et al.Neural voice puppetry:Audio-driven facial reenactment[C]//Proceedings of the European Conference on Computer Vision.ECCV,2020:716-731. [15]BLANZ V,VETTER T.A morphable model for the synthesis of3D faces[C]//Proceedings of the Seminal 26th Annual Confe-rence on Computer Graphics and Interactive Techniques.New York:ACM,1999:187-194. [16]LIU X,XU Y,WU Q,et al.Semantic-aware implicit neural au-dio-driven video portrait generation[C]//Proceedings of the European Conference on Computer Vision.Switzerland:ECCV,2022:106-125. [17]SHEN S,LI W,ZHU Z,et al.Learning dynamic facial radiance fields for few-shot talking headsynthesis[C]//Proceedings of the European Conference on Computer Vision.Switzerland:ECCV,2022:666-682. [18]YAO S,ZHONG R Z,YAN Y,et al.DFA-NeRF:Personalized talking head generation via disentangled face attributes neural rendering[J].arXiv:2201.00791,2022. [19]CHAN E R,LIN C Z,CHAN M A,et al.Efficient geometry-aware 3D generative adversarial networks[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.New Orleans:IEEE,2022:16123-16133. [20]GUO M H,LIU Z N,MU T J,et al.Beyond self-attention:External attention using two linear layers for visual tasks[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2022,45(5):5436-5447. [21]LI J,ZHANG J,BAI X,et al.Efficient region-aware neural ra-diance fields for high-fidelity talking portrait synthesis[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.Paris:IEEE,2023:7568-7578. [22]ZHOU H,SUN Y,WU W,et al.Pose-controllable talking face generation by implicitly modularized audio-visual representation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Nashville:IEEE,2021:4176-4186. [23]ZHANG Z,HU Z,DENG W,et al.DINet:Deformation inpain-ting network for realistic face visually dubbing on high resolution video[C]//Proceedings of the AAAI Conference on Artificial Intelligence.Washington D.C:AAAI,2023:3543-3551. [24]ZHANG R,ISOLA P,EFROS A,et al.The unreasonable effectiveness of deep features as a perceptual metric[C]//Procee-dings of the IEEE Conference on Computer Vision and Pattern Recognition.Salt Lake City:IEEE,2018:586-595. [25]HEUSEL M,RAMSAUER H,UNTERTHINER T,et al.Gans trained by a two time-scale update rule converge to a local hash equilibrium[J].Advances in Neural Information Processing Systems,2017,30(4):6626-6637. [26]CHEN L,LI Z,MADDOX R K,et al.Lip movements generation at a glance[C]//Proceedings of the European Conference on Computer Vision.Salt Lake City:ECCV,2018:520-535. [27]GUAN J,ZHANG Z,ZHOU H,et al.StyleSync:High-fidelitygeneralized and personalized lip sync in style-based generator[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Vancouver:IEEE,2023:1505-1515. [28]CHUNG J S,ZISSERMAN A.Lip reading in the wild[C]//Proceedings of the Computer Vision Asian Conference on Computer Vision.Waikoloa:IEEE,2017:87-103. [29]BALTRUSAITIS T,ROBINSON P,MORENCY L P.Open-face:An open source facial behavior analysis toolkit[C]//Proceedings of the 2016 IEEE Winter Conference on Applications of Computer Vision(WACV).Waikoloa:IEEE,2016:1-10. [30]SUWAJANAKORN S,SEITZ S M,KEMELMACHER S I.Synthesizing Obama:Learning lips sync from audio[J].ACM Transactions on Graphics(TOG),2017,36(4):1-13. |
|