Computer Science ›› 2023, Vol. 50 ›› Issue (8): 68-78.doi: 10.11896/jsjkx.221000031

• Computer Graphics & Multimedia • Previous Articles     Next Articles

Review of Talking Face Generation

SONG Xinyang1,2, YAN Zhiyuan3, SUN Muyi2, DAI Linlin3, LI Qi1,2, SUN Zhenan1,2   

  1. 1 School of Artificial Intelligence,University of Chinese Academy of Sciences,Beijing 100049,China
    2 Center for Research on Intelligent Perception and Computing,National Laboratory of Pattern Recognition,Institute of Automation,Chinese Academy of Sciences,Beijing 100190,China
    3 Institute of Computing Technology,China Academy of Railway Sciences Corporation Limited,Beijing 100081,China
  • Received:2022-10-07 Revised:2023-02-20 Online:2023-08-15 Published:2023-08-02
  • About author:SONG Xinyang,born in 2000,Ph.D candidate.His main research interests include human behavior generation and so on.
    DAI Linlin,born in 1983,postgraduate,senior engineer.Her main research interests include computer vision and technology and so on.
  • Supported by:
    China National Railway Group Co., Ltd. Science and Technology Research Project(N2021X026).

Abstract: Talking face generation is a popular research direction in the field of visual generation,which aims to generate realistic speaker videos based on multimodal input data.Talking face generation has broad application prospects in video media,game animation and Internet-related industries,and it could also provide data support for the research of tasks such as lip reading recognition,fake identification and digital human generation.The existing mainstream methods have been able to achieve talking face generation with personalized attributes and audio-visual synchronization,but they fail to meet the requirements of emerging application scenarios such as virtual reality,man-machine interaction and metaverse.Sothestudyof talking face generation is of great significance for promoting the development of related industries.This paper sorts out and summarizes the research status of tal-king face generation.First,itelaborates the research background and related technologies of talking face generation,then introduces the mainstream generation methods in recent years according to the method classification,sorts out the audio-visual datasets and evaluations commonly used in the research,and finally summarizes the problems in the existing methods,and analyzes the potential research direction of talking face generation in the future.

Key words: Face generation, Video generation, Image generation, Deep learning, Multimodal learning, Face reconstruction, Deep fake, Computer vision

CLC Number: 

  • TP391
[1]GOODFELLOW I,POUGET-ABADIE J,MIRZA M,et al.Ge-nerative adversarial nets[J].Advances in Neural Information Processing Systems,2014,27:2672-2680.
[2]BLANZ V,VETTER T.A morphable model for the synthesis of3D faces[C]//Proceedings of the 26th annual Conference on Computer Graphics and Interactive Techniques.1999:187-194.
[3]PAYSAN P,KNOTHE R,AMBERG B,et al.A 3D face model for pose and illumination invariant face recognition[C]//2009 sixth IEEE International Conference on Advanced Video and Signal Based Surveillance.IEEE,2009:296-301.
[4]GERIG T,MOREL-FORSTER A,BLUMER C,et al.Mor-phable face models-an open framework[C]//2018 13th IEEE International Conference on Automatic Face & Gesture Recognition.IEEE,2018:75-82.
[5]TRAN A T,HASSNER T,MASI I,et al.Regressing robust and discriminative 3D morphable models with a very deep neural network[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2017:1493-1502.
[6]TEWARI A,ZOLLHÖFER M,KIM H,et al.Mofa:Model-based deep convolutional face autoencoder for unsupervised monocular reconstruction[C]//The IEEE International Conference on Computer Vision.2017:5.
[7]ZHU X,LEI Z,LIU X,et al.Face alignment across large poses:A 3d solution[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:146-155.
[8]FENG Y,WU F,SHAO X,et al.Joint 3D face reconstruction and dense alignment with position map regression network[J].arXiv:1803.07835,2018.
[9]COOTES T F,TAYLOR C J,COOPERD H,et al.Active shape models-their training and application[J].Computer Vision and Image Understanding,1995,61(1):38-59.
[10]COOTES T F,EDWARDS G J,TAYLOR C J.Active appearance models[C]//European Conference on Computer Vision.Berlin:Springer,1998:484-498.
[11]DOLLÁR P,WELINDER P,PERONA P.Cascaded pose regression[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2010:1078-1085.
[12]SUN Y,WANG X,TANG X.Deep convolutional network cascade for facial point detection[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2013:3476-3483.
[13]ZHOU H,LIU Y,LIU Z,et al.Talking face generation by adversarially disentangled audio-visual representation[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2019:9299-9306.
[14]ZAKHAROV E,SHYSHEYA A,BURKOVE,et al.Few-shotadversarial learning of realistic neural talking head models[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2019:9459-9468.
[15]ISOLA P,ZHU J Y,ZHOU T,et al.Image-to-image translation with conditional adversarial networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:1125-1134.
[16]YIN F,ZHANG Y,CUNX,et al.StyleHEAT:One-shot high-resolution editable talking face generation via pretrained StyleGAN[J].arXiv:2203.04036,2022.
[17]SUWAJANAKORN S,SEITZ S M,KEMELMACHER-SHLIZERMANI.Synthesizing obama:learning lip sync from audio[J].ACM Transactions on Graphics,2017,36(4):1-13.
[18]CHEN L,MADDOX R K,DUAN Z,et al.Hierarchical cross-modal talking face generation with dynamic pixel-wise loss[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019:7832-7841.
[19]DAS D,BISWAS S,SINHA S,et al.Speech-driven facial animation using cascaded gans for learning of motion and texture[C]//European Conference on Computer Vision.Cham:Sprin-ger, 2020:408-424.
[20]ZHENG A,ZHU F,ZHU H,et al.Talking face generation via learning semantic and temporal synchronous landmarks[C]//2020 25th International Conference on Pattern Recognition.IEEE,2021:3682-3689.
[21]ZHOU Y,HAN X,SHECHTMANE,et al.Makelttalk:speaker-aware talking-head animation[J].ACM Transactions on Gra-phics,2020,39(6):1-15.
[22]ZAKHAROV E,SHYSHEYA A,BURKOVE,et al.Few-shotadversarial learning of realistic neural talking head models[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2019:9459-9468.
[23]ZAKHAROV E,IVAKHNENKO A,SHYSHEYA A,et al.Fast bi-layer neural synthesis of one-shot realistic head avatars[C]//European Conference on Computer Vision.Cham:Sprin-ger, 2020:524-540.
[24]GU K,ZHOU Y,HUANG T.Flnet:Landmark driven fetching and learning network for faithful talking facial animation synthesis[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2020:10861-10868.
[25]ZENG D,LIU H,LIN H,et al.Talking face generation with expression-tailored generative adversarial network[C]//Procee-dings of the 28th ACM International Conference on Multimedia.2020:1716-1724.
[26]ZHANG X,WU X.Multi-modality deep restoration of extremely compressed face videos[J].arXiv:2107.05548,2021.
[27]RICHARD A,ZOLLHÖFER M,WEN Y,et al.Meshtalk:3d face animation from speech using cross-modality disentanglement[C]//Proceedings of the IEEE/CVF International Confe-rence on Computer Vision.2021:1173-1182.
[28]LAHIRI A,KWATRA V,FRUEH C,et al.LipSync3D:Data-efficient learning of personalized 3D talking faces from video using pose and lighting normalization[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2021:2755-2764.
[29]KIM H,GARRIDO P,TEWARIA,et al.Deep video portraits[J].ACM Transactions on Graphics,2018,37(4):1-14.
[30]KIM H,ELGHARIB M,ZOLLHÖFER M,et al.Neural style-preserving visual dubbing[J].ACM Transactions on Graphics,2019,38(6):1-13.
[31]REN Y,LI G,CHEN Y,et al.PIRenderer:Controllable portrait image generation via semantic neural rendering[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:13759-13768.
[32]CHEN L,CUI G,LIU C,et al.Talking-head generation withrhythmic head motion[C]//European Conference on Computer Vision.Cham:Springer,2020:35-51.
[33]SONG L,WU W,QIAN C,et al.Everybody's talkin':Let me talk as you want[J].IEEE Transactions on Information Forensics and Security,2022,17:585-598.
[34]YI R,YE Z,ZHANG J,et al.Audio-driven talking face video generation with learning-based personalized head pose[J].arXiv:2002.10137,2020.
[35]ZHANG C,ZHAO Y,HUANG Y,et al.Facial:Synthesizing dynamic talking face with implicit attribute learning[C]//Procee-dings of the IEEE/CVF International Conference on Computer Vision.2021:3867-3876.
[36]JI X,ZHOU H,WANG K,et al.Audio-driven emotional video portraits[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2021:14080-14089.
[37]SIAROHIN A,LATHUILIÈRE S,TULYAKOV S,et al.Animating arbitrary objects via deep motion transfer[C]//Procee-dings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019:2377-2386.
[38]SIAROHIN A,LATHUILIÈRE S,TULYAKOV S,et al.First order motion model for image animation[J].Advances in Neural Information Processing Systems,2019,32:7137-7147.
[39]WANG T C,MALLYA A,LIU M Y.One-shot free-view neural talking-head synthesis for video conferencing[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2021:10039-10049.
[40]WANG S,LI L,DING Y,et al.Audio2Head:Audio-driven one-shot talking-head generation with natural head motion[J].ar-Xiv:2107.09293,2021.
[41]WANG S,LI L,DING Y,et al.One-shot talking face generation from single-speaker audio-visual correlation learning[J].arXiv:2112.02749,2021.
[42]ZHANG Z,LI L,DING Y,et al.Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2021:3661-3670.
[43]DOUKAS M C,ZAFEIRIOU S,SHARMANSKA V.Head-GAN:One-shot neural Head synthesis and editing[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:14398-14407.
[44]CHEN L,LI Z,MADDOX R K,et al.Lip movements generation at a glance[C]//Proceedings of the European Conference on Computer Vision.2018:520-535.
[45]JAMALUDIN A,CHUNG J S,ZISSERMANA.You said that:Synthesising talking faces from audio[J].International Journal of Computer Vision,2019,127(11):1767-1779.
[46]PRAJWAL K R,MUKHOPADHYAY R,NAMBOODIRIV P,et al.A lip sync expert is all you need for speech to lip generation in the wild[C]//Proceedings of the 28th ACM International Conference on Multimedia.2020:484-492.
[47]VOUGIOUKAS K,PETRIDIS S,PANTIC M.Realistic speech-driven facial animation with gans[J].International Journal of Computer Vision,2020,128(5):1398-1413.
[48]SONG Y,ZHU J,LI D,et al.Talking face generation by conditional recurrent adversarial network[J].arXiv:1804.04786,2018.
[49]ZHOU H,SUN Y,WU W,et al.Pose-controllable talking face generation by implicitly modularized audio-visual representation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2021:4176-4186.
[50]BURKOV E,PASECHNIK I,GRIGOREV A,et al.Neural head reenactment with latent pose descriptors[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:13786-13795.
[51]ZHU H,HUANG H,LI Y,et al.Arbitrary talking face generation via attentional audio-visual coherence learning[C]//Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence.2021:2362-2368.
[52]ZHANG J,LIU L,XUE Z,et al.Apb2face:audio-guided face reenactment with auxiliary pose and blink signals[C]//IEEE International Conference on Acoustics,Speech and Signal Proces-sing(ICASSP 2020).IEEE,2020:4402-4406.
[53]ZHANG J,ZENG X,XU C,et al.APB2FaceV2:Real-time audio-guided multi-face reenactment[J].arXiv:2010.13017,2020.
[54]LU Y,CHAI J,CAO X.Live speech portraits:real-time photorealistic talking-head animation[J].ACM Transactions on Graphics,2021,40(6):1-17.
[55]WANG T C,LIU M Y,ZHU J Y,et al.Video-to-video synthesis[J].Advances in Neural Information Processing Systems,2018,31:1144-1156.
[56]WANG T C,LIU M Y,TAO A,et al.Few-shot video-to-video synthesis[J].Advances in Neural Information Processing Systems,2019,32:5013-5024.
[57]WANG Y,YANG D,BREMOND F,et al.Latent image animator:learning to animate images via latent space navigation[C]//International Conference on Learning Representations.2021.
[58]WILES O,KOEPKE A,ZISSERMAN A.X2face:A network for controlling face generation using images,audio,and pose codes[C]//Proceedings of the European Conference on Computer Vision.2018:670-686.
[59]WU W,ZHANG Y,LIC,et al.Reenactgan:Learning to reenact faces via boundary transfer[C]//Proceedings of the European Conference on Computer Vision.2018:603-619.
[60]SONG L,WU W,FU C,et al.Everything's talkin':Pareidolia face reenactment[J].arXiv:2104.03061,2021.
[61]MILDENHALL B,SRINIVASAN P P,TANCIKM,et al.Nerf:Representing scenes as neural radiance fields for view synthesis[C]//European Conference on Computer Vision.Cham:Sprin-ger,2020:405-421.
[62]GAFNI G,THIES J,ZOLLHOFER M,et al.Dynamic neural radiance fields for monocular 4d facial avatar reconstruction[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2021:8649-8658.
[63]GUO Y,CHEN K,LIANG S,et al.Ad-nerf:Audio driven neural radiance fields for talking head synthesis[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:5784-5794.
[64]ALGHAMDI N,MADDOCK S,MARXER R,et al.A corpus of audio-visual Lombard speech with frontal and profile views[J].The Journal of the Acoustical Society of America,2018,143(6):EL523-EL529.
[65]CHUNG J S,ZISSERMAN A.Lip reading in the wild[C]//Asian Conference on Computer Vision.Cham:Springer,2016:87-103.
[66]CHUNG J S,SENIOR A,VINYALS O,et al.Lip reading sen-tences in the wild[C]//Proceedings of the IEEE/CVF Confe-rence on Computer Vision and Pattern Recognition.2017:3444-3453.
[67]NAGRANI A,CHUNG J S,ZISSERMAN A.Voxceleb:a large-scale speaker identification dataset[J].arXiv:1706.08612,2017.
[68]CHUNG J S,NAGRANI A,ZISSERMAN A.Voxceleb2:Deep speaker recognition[J].arXiv:1806.05622,2018.
[69]WANG K,WU Q,SONG L,et al.Mead:A large-scale audio-vi-sual dataset for emotional talking-face generation[C]//European Conference on Computer Vision.Cham:Springer,2020:700-717.
[70]WANG Z,BOVIK A C,SHEIKH H R,et al.Image quality assessment:from error visibility to structural similarity[J].IEEE Transactions on Image Processing,2004,13(4):600-612.
[71]HEUSEL M,RAMSAUER H,UNTERTHINER T,et al.Ganstrained by a two time-scale update rule converge to a local nash equilibrium[J].Advances in Neural Information Processing Systems,2017,30:6626-6637.
[72]ZHANG R,ISOLA P,EFROS A A,et al.The unreasonable effectiveness of deep features as a perceptual metric[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:586-595.
[1] ZHANG Yian, YANG Ying, REN Gang, WANG Gang. Study on Multimodal Online Reviews Helpfulness Prediction Based on Attention Mechanism [J]. Computer Science, 2023, 50(8): 37-44.
[2] WANG Xu, WU Yanxia, ZHANG Xue, HONG Ruize, LI Guangsheng. Survey of Rotating Object Detection Research in Computer Vision [J]. Computer Science, 2023, 50(8): 79-92.
[3] ZHOU Ziyi, XIONG Hailing. Image Captioning Optimization Strategy Based on Deep Learning [J]. Computer Science, 2023, 50(8): 99-110.
[4] ZHANG Xiao, DONG Hongbin. Lightweight Multi-view Stereo Integrating Coarse Cost Volume and Bilateral Grid [J]. Computer Science, 2023, 50(8): 125-132.
[5] WANG Yu, WANG Zuchao, PAN Rui. Survey of DGA Domain Name Detection Based on Character Feature [J]. Computer Science, 2023, 50(8): 251-259.
[6] LI Kun, GUO Wei, ZHANG Fan, DU Jiayu, YANG Meiyue. Adversarial Malware Generation Method Based on Genetic Algorithm [J]. Computer Science, 2023, 50(7): 325-331.
[7] WANG Mingxia, XIONG Yun. Disease Diagnosis Prediction Algorithm Based on Contrastive Learning [J]. Computer Science, 2023, 50(7): 46-52.
[8] SHEN Zhehui, WANG Kailai, KONG Xiangjie. Exploring Station Spatio-Temporal Mobility Pattern:A Short and Long-term Traffic Prediction Framework [J]. Computer Science, 2023, 50(7): 98-106.
[9] HUO Weile, JING Tao, REN Shuang. Review of 3D Object Detection for Autonomous Driving [J]. Computer Science, 2023, 50(7): 107-118.
[10] ZHOU Bo, JIANG Peifeng, DUAN Chang, LUO Yuetong. Study on Single Background Object Detection Oriented Improved-RetinaNet Model and Its Application [J]. Computer Science, 2023, 50(7): 137-142.
[11] MAO Huihui, ZHAO Xiaole, DU Shengdong, TENG Fei, LI Tianrui. Short-term Subway Passenger Flow Forecasting Based on Graphical Embedding of Temporal Knowledge [J]. Computer Science, 2023, 50(7): 213-220.
[12] LI Yuqiang, LI Linfeng, ZHU Hao, HOU Mengshu. Deep Learning-based Algorithm for Active IPv6 Address Prediction [J]. Computer Science, 2023, 50(7): 261-269.
[13] LIANG Mingxuan, WANG Shi, ZHU Junwu, LI Yang, GAO Xiang, JIAO Zhixiang. Survey of Knowledge-enhanced Natural Language Generation Research [J]. Computer Science, 2023, 50(6A): 220200120-8.
[14] WANG Dongli, YANG Shan, OUYANG Wanli, LI Baopu, ZHOU Yan. Explainability of Artificial Intelligence:Development and Application [J]. Computer Science, 2023, 50(6A): 220600212-7.
[15] GAO Xiang, WANG Shi, ZHU Junwu, LIANG Mingxuan, LI Yang, JIAO Zhixiang. Overview of Named Entity Recognition Tasks [J]. Computer Science, 2023, 50(6A): 220200119-8.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!