Computer Science ›› 2025, Vol. 52 ›› Issue (3): 58-67.doi: 10.11896/jsjkx.240300030

• 3D Vision and Metaverse • Previous Articles     Next Articles

Talking Portrait Synthesis Method Based on Regional Saliency and Spatial Feature Extraction

WANG Xingbo, ZHANG Hao, GAO Hao, ZHAI Mingliang, XIE Jiucheng   

  1. College of Automation & College of Artificial Intelligence,Nanjing University of Posts and Telecommunications,Nanjing 210023,China
  • Received:2024-03-05 Revised:2024-10-08 Online:2025-03-15 Published:2025-03-07
  • About author:WANG Xingbo,born in 1975,Ph.D,lecturer.His main research interests include robot control and target tracking algorithm.
    XIE Jiucheng,born in 1992,Ph.D,lecturer.His main research interests include computer vision and artificial intelligence.
  • Supported by:
    National Natural Science Foundation of China(62301278,62371254,61931012)and Natural Science Foundation of Jiangsu Province,China(BK20230362,BK20210594).

Abstract: Audio-driven talking portraits synthesis endeavors to convert arbitrary input audio sequences into realistic talking portrait videos.Recently,several works on synthesizing talking portraits leveraging neural radiance fields(NeRF) have achieved superior visual results.However,such works still generally suffer from poor audio-lip synchronization,torso jitter,and low clarity in the synthesized videos.To address these issues,a method based on regional saliency features and spatial volume features is proposed to achieve high-fidelity synthesis of talking portraits.On one hand,a regional saliency-aware module is developed,dynamically adjusting the volumetric attributes of spatial points in the head region with multimodal input data and optimizing feature storage through hash tables,thus improving the precision and efficiency of facial detail representation.On the other hand,a spatial feature extraction module is designed for independent torso modeling.Unlike conventional methods that estimate color and density directly from torso surface spatial points,this module constructs a torso field using reference images to provide relevant texture and geometric priors,thereby achieving more precise torso rendering and natural movements.Experiments applied to multiple subjects demonstrate that,in self-reconstruction scenarios,the proposed method improves image quality(PSNR,LPIPS,FID,LMD) by 10.15%,12.12%,0.77%,and 1.09% respectively,and enhances lip-sync accuracy(AUE) by 14.20% compared to the current state-of-the-art baseline model.Concurrently,there is a notable increase of 14.20% in lip synchronization accuracy as measured by Sync metrics.Under cross-driving conditions with out-of-domain audio sources,the lip synchronization accuracy is achieved improvements of 4.74%.

Key words: Talking portrait synthesis, 3D reconstruction, Audio-video synchronization, Nerual radiance fields, Attention mechanism

CLC Number: 

  • TP391
[1]CHUNG J S,JAMALUDIN A,ZISSERMAN A.You said that?[J].arXiv:1705.02966,2017.
[2]CRESWELL A,WHITE T,DUMOULIN V,et al.Generativeadversarial networks:An overview[J].IEEE signal processing magazine,2018,35(1):53-65.
[3]WANG Q Q,ZHANG J L.Face Pose and Expression Correction Based on 3D Morphable Model[J].Computer Science,2019,46(6):263-269.
[4]TANG Y X,WANG B J.Research Progress of Face Editing Based on Deep Generative Model[J].Computer Science,2022,49(2):51-61.
[5]MILDENHALL B,SRINIVASAN P P,TANCIK M,et al.Nerf:Representing scenes as neural radiance fields for view synthesis[J].Communications of the ACM,2021,65(1):99-106.
[6]GUO Y,CHEN K,LIANG S,et al.Ad-nerf:Audio driven neural radiance fields for talking head synthesis[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision and Pattern Recognition.Montreal:IEEE,2021:5784-5794.
[7]XIE Z F,ZHENG J H,WANG J,et al.Speech-Driven Facial Ree-nactment Guided by Structured Latent Codes in NeRF[J].Journal of Computer-Aided Design and Graphics,2023,41(3):1003-1015.
[8]ZHENG B W,DONG J W,WU L T,et al.A Method and System for Generating Virtual Anchors Based on Neural Radiance Fields and Hidden Attributes:CN-202311094348.7[P].2023-12-05.
[9]MULLER T,EVANS A,SCHIED C,et al.Instant neuralgraphics primitives with a multiresolution hash encoding[J].ACM Transactions on Graphics(ToG),2022,41(4):1-15.
[10]TANG J,WANG K,ZHOU H,et al.Real-time neural radiance talking portrait synthesis via audio-spatial decomposition[J].arXiv:2211.12368,2022.
[11]RONNEBERGER O,FISCHER P,BROX T.U-net:Convolu-tional networks for biomedical image segmentation[C]//Proceedings of the Medical Image Computing and Computer Assisted Intervention.Munich:MICCAI,2015:234-241.
[12]GU K,ZHOU Y,HUANG T.Flnet:Landmark driven fetching and learning network for faithful talking facial animation synthesis[C]//Proceedings of the AAAI Conference on Artificial Intelligence.New York:AAAI,2020:10861-10868.
[13]ZHANG Z,LI L,DING Y,et al.Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Nashville:IEEE,2021:3661-3670.
[14]THIES J,ELGHARIB M,TEWARI A,et al.Neural voice puppetry:Audio-driven facial reenactment[C]//Proceedings of the European Conference on Computer Vision.ECCV,2020:716-731.
[15]BLANZ V,VETTER T.A morphable model for the synthesis of3D faces[C]//Proceedings of the Seminal 26th Annual Confe-rence on Computer Graphics and Interactive Techniques.New York:ACM,1999:187-194.
[16]LIU X,XU Y,WU Q,et al.Semantic-aware implicit neural au-dio-driven video portrait generation[C]//Proceedings of the European Conference on Computer Vision.Switzerland:ECCV,2022:106-125.
[17]SHEN S,LI W,ZHU Z,et al.Learning dynamic facial radiance fields for few-shot talking headsynthesis[C]//Proceedings of the European Conference on Computer Vision.Switzerland:ECCV,2022:666-682.
[18]YAO S,ZHONG R Z,YAN Y,et al.DFA-NeRF:Personalized talking head generation via disentangled face attributes neural rendering[J].arXiv:2201.00791,2022.
[19]CHAN E R,LIN C Z,CHAN M A,et al.Efficient geometry-aware 3D generative adversarial networks[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.New Orleans:IEEE,2022:16123-16133.
[20]GUO M H,LIU Z N,MU T J,et al.Beyond self-attention:External attention using two linear layers for visual tasks[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2022,45(5):5436-5447.
[21]LI J,ZHANG J,BAI X,et al.Efficient region-aware neural ra-diance fields for high-fidelity talking portrait synthesis[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.Paris:IEEE,2023:7568-7578.
[22]ZHOU H,SUN Y,WU W,et al.Pose-controllable talking face generation by implicitly modularized audio-visual representation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Nashville:IEEE,2021:4176-4186.
[23]ZHANG Z,HU Z,DENG W,et al.DINet:Deformation inpain-ting network for realistic face visually dubbing on high resolution video[C]//Proceedings of the AAAI Conference on Artificial Intelligence.Washington D.C:AAAI,2023:3543-3551.
[24]ZHANG R,ISOLA P,EFROS A,et al.The unreasonable effectiveness of deep features as a perceptual metric[C]//Procee-dings of the IEEE Conference on Computer Vision and Pattern Recognition.Salt Lake City:IEEE,2018:586-595.
[25]HEUSEL M,RAMSAUER H,UNTERTHINER T,et al.Gans trained by a two time-scale update rule converge to a local hash equilibrium[J].Advances in Neural Information Processing Systems,2017,30(4):6626-6637.
[26]CHEN L,LI Z,MADDOX R K,et al.Lip movements generation at a glance[C]//Proceedings of the European Conference on Computer Vision.Salt Lake City:ECCV,2018:520-535.
[27]GUAN J,ZHANG Z,ZHOU H,et al.StyleSync:High-fidelitygeneralized and personalized lip sync in style-based generator[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Vancouver:IEEE,2023:1505-1515.
[28]CHUNG J S,ZISSERMAN A.Lip reading in the wild[C]//Proceedings of the Computer Vision Asian Conference on Computer Vision.Waikoloa:IEEE,2017:87-103.
[29]BALTRUSAITIS T,ROBINSON P,MORENCY L P.Open-face:An open source facial behavior analysis toolkit[C]//Proceedings of the 2016 IEEE Winter Conference on Applications of Computer Vision(WACV).Waikoloa:IEEE,2016:1-10.
[30]SUWAJANAKORN S,SEITZ S M,KEMELMACHER S I.Synthesizing Obama:Learning lips sync from audio[J].ACM Transactions on Graphics(TOG),2017,36(4):1-13.
[1] CHEN Qian, CHENG Kaixuan, GUO Xin, ZHANG Xiaoxia, WANG Suge, LI Yanhong. Bidirectional Prompt-Tuning for Event Argument Extraction with Topic and Entity Embeddings [J]. Computer Science, 2026, 53(1): 278-284.
[2] LYU Jinggang, GAO Shuo, LI Yuzhi, ZHOU Jin. Facial Expression Recognition with Channel Attention Guided Global-Local Semantic Cooperation [J]. Computer Science, 2026, 53(1): 195-205.
[3] FAN Jiabin, WANG Baohui, CHEN Jixuan. Method for Symbol Detection in Substation Layout Diagrams Based on Text-Image MultimodalFusion [J]. Computer Science, 2026, 53(1): 206-215.
[4] WANG Haoyan, LI Chongshou, LI Tianrui. Reinforcement Learning Method for Solving Flexible Job Shop Scheduling Problem Based onDouble Layer Attention Network [J]. Computer Science, 2026, 53(1): 231-240.
[5] PENG Jiao, HE Yue, SHANG Xiaoran, HU Saier, ZHANG Bo, CHANG Yongjuan, OU Zhonghong, LU Yanyan, JIANG dan, LIU Yaduo. Text-Dynamic Image Cross-modal Retrieval Algorithm Based on Progressive Prototype Matching [J]. Computer Science, 2025, 52(9): 276-281.
[6] GAO Long, LI Yang, WANG Suge. Sentiment Classification Method Based on Stepwise Cooperative Fusion Representation [J]. Computer Science, 2025, 52(9): 313-319.
[7] LIU Jian, YAO Renyuan, GAO Nan, LIANG Ronghua, CHEN Peng. VSRI:Visual Semantic Relational Interactor for Image Caption [J]. Computer Science, 2025, 52(8): 222-231.
[8] LIU Yajun, JI Qingge. Pedestrian Trajectory Prediction Based on Motion Patterns and Time-Frequency Domain Fusion [J]. Computer Science, 2025, 52(7): 92-102.
[9] LIU Chengzhuang, ZHAI Sulan, LIU Haiqing, WANG Kunpeng. Weakly-aligned RGBT Salient Object Detection Based on Multi-modal Feature Alignment [J]. Computer Science, 2025, 52(7): 142-150.
[10] ZHUANG Jianjun, WAN Li. SCF U2-Net:Lightweight U2-Net Improved Method for Breast Ultrasound Lesion SegmentationCombined with Fuzzy Logic [J]. Computer Science, 2025, 52(7): 161-169.
[11] ZHENG Cheng, YANG Nan. Aspect-based Sentiment Analysis Based on Syntax,Semantics and Affective Knowledge [J]. Computer Science, 2025, 52(7): 218-225.
[12] WANG Youkang, CHENG Chunling. Multimodal Sentiment Analysis Model Based on Cross-modal Unidirectional Weighting [J]. Computer Science, 2025, 52(7): 226-232.
[13] KONG Yinling, WANG Zhongqing, WANG Hongling. Study on Opinion Summarization Incorporating Evaluation Object Information [J]. Computer Science, 2025, 52(7): 233-240.
[14] LI Daicheng, LI Han, LIU Zheyu, GONG Shiheng. MacBERT Based Chinese Named Entity Recognition Fusion with Dependent Syntactic Information and Multi-view Lexical Information [J]. Computer Science, 2025, 52(6A): 240600121-8.
[15] HUANG Bocheng, WANG Xiaolong, AN Guocheng, ZHANG Tao. Transmission Line Fault Identification Method Based on Transfer Learning and Improved YOLOv8s [J]. Computer Science, 2025, 52(6A): 240800044-8.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!