Computer Science ›› 2025, Vol. 52 ›› Issue (3): 58-67.doi: 10.11896/jsjkx.240300030

• 3D Vision and Metaverse • Previous Articles     Next Articles

Talking Portrait Synthesis Method Based on Regional Saliency and Spatial Feature Extraction

WANG Xingbo, ZHANG Hao, GAO Hao, ZHAI Mingliang, XIE Jiucheng   

  1. College of Automation & College of Artificial Intelligence,Nanjing University of Posts and Telecommunications,Nanjing 210023,China
  • Received:2024-03-05 Revised:2024-10-08 Online:2025-03-15 Published:2025-03-07
  • About author:WANG Xingbo,born in 1975,Ph.D,lecturer.His main research interests include robot control and target tracking algorithm.
    XIE Jiucheng,born in 1992,Ph.D,lecturer.His main research interests include computer vision and artificial intelligence.
  • Supported by:
    National Natural Science Foundation of China(62301278,62371254,61931012)and Natural Science Foundation of Jiangsu Province,China(BK20230362,BK20210594).

Abstract: Audio-driven talking portraits synthesis endeavors to convert arbitrary input audio sequences into realistic talking portrait videos.Recently,several works on synthesizing talking portraits leveraging neural radiance fields(NeRF) have achieved superior visual results.However,such works still generally suffer from poor audio-lip synchronization,torso jitter,and low clarity in the synthesized videos.To address these issues,a method based on regional saliency features and spatial volume features is proposed to achieve high-fidelity synthesis of talking portraits.On one hand,a regional saliency-aware module is developed,dynamically adjusting the volumetric attributes of spatial points in the head region with multimodal input data and optimizing feature storage through hash tables,thus improving the precision and efficiency of facial detail representation.On the other hand,a spatial feature extraction module is designed for independent torso modeling.Unlike conventional methods that estimate color and density directly from torso surface spatial points,this module constructs a torso field using reference images to provide relevant texture and geometric priors,thereby achieving more precise torso rendering and natural movements.Experiments applied to multiple subjects demonstrate that,in self-reconstruction scenarios,the proposed method improves image quality(PSNR,LPIPS,FID,LMD) by 10.15%,12.12%,0.77%,and 1.09% respectively,and enhances lip-sync accuracy(AUE) by 14.20% compared to the current state-of-the-art baseline model.Concurrently,there is a notable increase of 14.20% in lip synchronization accuracy as measured by Sync metrics.Under cross-driving conditions with out-of-domain audio sources,the lip synchronization accuracy is achieved improvements of 4.74%.

Key words: Talking portrait synthesis, 3D reconstruction, Audio-video synchronization, Nerual radiance fields, Attention mechanism

CLC Number: 

  • TP391
[1]CHUNG J S,JAMALUDIN A,ZISSERMAN A.You said that?[J].arXiv:1705.02966,2017.
[2]CRESWELL A,WHITE T,DUMOULIN V,et al.Generativeadversarial networks:An overview[J].IEEE signal processing magazine,2018,35(1):53-65.
[3]WANG Q Q,ZHANG J L.Face Pose and Expression Correction Based on 3D Morphable Model[J].Computer Science,2019,46(6):263-269.
[4]TANG Y X,WANG B J.Research Progress of Face Editing Based on Deep Generative Model[J].Computer Science,2022,49(2):51-61.
[5]MILDENHALL B,SRINIVASAN P P,TANCIK M,et al.Nerf:Representing scenes as neural radiance fields for view synthesis[J].Communications of the ACM,2021,65(1):99-106.
[6]GUO Y,CHEN K,LIANG S,et al.Ad-nerf:Audio driven neural radiance fields for talking head synthesis[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision and Pattern Recognition.Montreal:IEEE,2021:5784-5794.
[7]XIE Z F,ZHENG J H,WANG J,et al.Speech-Driven Facial Ree-nactment Guided by Structured Latent Codes in NeRF[J].Journal of Computer-Aided Design and Graphics,2023,41(3):1003-1015.
[8]ZHENG B W,DONG J W,WU L T,et al.A Method and System for Generating Virtual Anchors Based on Neural Radiance Fields and Hidden Attributes:CN-202311094348.7[P].2023-12-05.
[9]MULLER T,EVANS A,SCHIED C,et al.Instant neuralgraphics primitives with a multiresolution hash encoding[J].ACM Transactions on Graphics(ToG),2022,41(4):1-15.
[10]TANG J,WANG K,ZHOU H,et al.Real-time neural radiance talking portrait synthesis via audio-spatial decomposition[J].arXiv:2211.12368,2022.
[11]RONNEBERGER O,FISCHER P,BROX T.U-net:Convolu-tional networks for biomedical image segmentation[C]//Proceedings of the Medical Image Computing and Computer Assisted Intervention.Munich:MICCAI,2015:234-241.
[12]GU K,ZHOU Y,HUANG T.Flnet:Landmark driven fetching and learning network for faithful talking facial animation synthesis[C]//Proceedings of the AAAI Conference on Artificial Intelligence.New York:AAAI,2020:10861-10868.
[13]ZHANG Z,LI L,DING Y,et al.Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Nashville:IEEE,2021:3661-3670.
[14]THIES J,ELGHARIB M,TEWARI A,et al.Neural voice puppetry:Audio-driven facial reenactment[C]//Proceedings of the European Conference on Computer Vision.ECCV,2020:716-731.
[15]BLANZ V,VETTER T.A morphable model for the synthesis of3D faces[C]//Proceedings of the Seminal 26th Annual Confe-rence on Computer Graphics and Interactive Techniques.New York:ACM,1999:187-194.
[16]LIU X,XU Y,WU Q,et al.Semantic-aware implicit neural au-dio-driven video portrait generation[C]//Proceedings of the European Conference on Computer Vision.Switzerland:ECCV,2022:106-125.
[17]SHEN S,LI W,ZHU Z,et al.Learning dynamic facial radiance fields for few-shot talking headsynthesis[C]//Proceedings of the European Conference on Computer Vision.Switzerland:ECCV,2022:666-682.
[18]YAO S,ZHONG R Z,YAN Y,et al.DFA-NeRF:Personalized talking head generation via disentangled face attributes neural rendering[J].arXiv:2201.00791,2022.
[19]CHAN E R,LIN C Z,CHAN M A,et al.Efficient geometry-aware 3D generative adversarial networks[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.New Orleans:IEEE,2022:16123-16133.
[20]GUO M H,LIU Z N,MU T J,et al.Beyond self-attention:External attention using two linear layers for visual tasks[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2022,45(5):5436-5447.
[21]LI J,ZHANG J,BAI X,et al.Efficient region-aware neural ra-diance fields for high-fidelity talking portrait synthesis[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.Paris:IEEE,2023:7568-7578.
[22]ZHOU H,SUN Y,WU W,et al.Pose-controllable talking face generation by implicitly modularized audio-visual representation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Nashville:IEEE,2021:4176-4186.
[23]ZHANG Z,HU Z,DENG W,et al.DINet:Deformation inpain-ting network for realistic face visually dubbing on high resolution video[C]//Proceedings of the AAAI Conference on Artificial Intelligence.Washington D.C:AAAI,2023:3543-3551.
[24]ZHANG R,ISOLA P,EFROS A,et al.The unreasonable effectiveness of deep features as a perceptual metric[C]//Procee-dings of the IEEE Conference on Computer Vision and Pattern Recognition.Salt Lake City:IEEE,2018:586-595.
[25]HEUSEL M,RAMSAUER H,UNTERTHINER T,et al.Gans trained by a two time-scale update rule converge to a local hash equilibrium[J].Advances in Neural Information Processing Systems,2017,30(4):6626-6637.
[26]CHEN L,LI Z,MADDOX R K,et al.Lip movements generation at a glance[C]//Proceedings of the European Conference on Computer Vision.Salt Lake City:ECCV,2018:520-535.
[27]GUAN J,ZHANG Z,ZHOU H,et al.StyleSync:High-fidelitygeneralized and personalized lip sync in style-based generator[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Vancouver:IEEE,2023:1505-1515.
[28]CHUNG J S,ZISSERMAN A.Lip reading in the wild[C]//Proceedings of the Computer Vision Asian Conference on Computer Vision.Waikoloa:IEEE,2017:87-103.
[29]BALTRUSAITIS T,ROBINSON P,MORENCY L P.Open-face:An open source facial behavior analysis toolkit[C]//Proceedings of the 2016 IEEE Winter Conference on Applications of Computer Vision(WACV).Waikoloa:IEEE,2016:1-10.
[30]SUWAJANAKORN S,SEITZ S M,KEMELMACHER S I.Synthesizing Obama:Learning lips sync from audio[J].ACM Transactions on Graphics(TOG),2017,36(4):1-13.
[1] ZHONG Yue, GU Jieming. 3D Reconstruction of Single-view Sketches Based on Attention Mechanism and Contrastive Loss [J]. Computer Science, 2025, 52(3): 77-85.
[2] SONG Xingnuo, WANG Congyan, CHEN Mingkai. Survey on 3D Scene Reconstruction Techniques in Metaverse [J]. Computer Science, 2025, 52(3): 17-32.
[3] WANG Jie, WANG Chuangye, XIE Jiucheng, GAO Hao. Animatable Head Avatar Reconstruction Algorithm Based on Region Encoding [J]. Computer Science, 2025, 52(3): 50-57.
[4] CHENG Qinghua, JIAN Haifang, ZHENG Shuaikang, GUO Huimin, LI Yuehao. Illumination-aware Infrared/Visible Fusion for Object Detection [J]. Computer Science, 2025, 52(2): 173-182.
[5] LIU Yanlun, XIAO Zheng, NIE Zhenyu, LE Yuquan, LI Kenli. Case Element Association with Evidence Extraction for Adjudication Assistance [J]. Computer Science, 2025, 52(2): 222-230.
[6] ZHAO Qian, GUO Bin, LIU Yubo, SUN Zhuo, WANG Hao, CHEN Mengqi. Generation of Enrich Semantic Video Dialogue Based on Hierarchical Visual Attention [J]. Computer Science, 2025, 52(1): 315-322.
[7] LI Yunchen, ZHANG Rui, WANG Jiabao, LI Yang, WANG Ziqi, CHEN Yao. Re-parameterization Enhanced Dual-modal Realtime Object Detection Model [J]. Computer Science, 2024, 51(9): 162-172.
[8] HU Pengfei, WANG Youguo, ZHAI Qiqing, YAN Jun, BAI Quan. Night Vehicle Detection Algorithm Based on YOLOv5s and Bistable Stochastic Resonance [J]. Computer Science, 2024, 51(9): 173-181.
[9] LIU Qian, BAI Zhihao, CHENG Chunling, GUI Yaocheng. Image-Text Sentiment Classification Model Based on Multi-scale Cross-modal Feature Fusion [J]. Computer Science, 2024, 51(9): 258-264.
[10] LI Zhe, LIU Yiyang, WANG Ke, YANG Jie, LI Yafei, XU Mingliang. Real-time Prediction Model of Carrier Aircraft Landing Trajectory Based on Stagewise Autoencoders and Attention Mechanism [J]. Computer Science, 2024, 51(9): 273-282.
[11] LIU Qilong, LI Bicheng, HUANG Zhiyong. CCSD:Topic-oriented Sarcasm Detection [J]. Computer Science, 2024, 51(9): 310-318.
[12] YAO Yao, YANG Jibin, ZHANG Xiongwei, LI Yihao, SONG Gongkunkun. CLU-Net Speech Enhancement Network for Radio Communication [J]. Computer Science, 2024, 51(9): 338-345.
[13] ZHANG Lu, DUAN Youxiang, LIU Juan, LU Yuxi. Chinese Geological Entity Relation Extraction Based on RoBERTa and Weighted Graph Convolutional Networks [J]. Computer Science, 2024, 51(8): 297-303.
[14] CHEN Shanshan, YAO Subin. Study on Recommendation Algorithms Based on Knowledge Graph and Neighbor PerceptionAttention Mechanism [J]. Computer Science, 2024, 51(8): 313-323.
[15] LIU Sichun, WANG Xiaoping, PEI Xilong, LUO Hangyu. Scene Segmentation Model Based on Dual Learning [J]. Computer Science, 2024, 51(8): 133-142.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!