计算机科学 ›› 2023, Vol. 50 ›› Issue (8): 68-78.doi: 10.11896/jsjkx.221000031

• 计算机图形学&多媒体 • 上一篇    下一篇

说话人生成研究现状与发展趋势

宋昕洋1,2, 阎志远3, 孙沐毅2, 戴琳琳3, 李琦1,2, 孙哲南1,2   

  1. 1 中国科学院大学人工智能学院 北京 100049
    2 中国科学院自动化研究所模式识别国家重点实验室智能感知与计算研究中心 北京 100190
    3 中国铁道科学研究院集团有限公司电子计算技术研究所 北京 100081
  • 收稿日期:2022-10-07 修回日期:2023-02-20 出版日期:2023-08-15 发布日期:2023-08-02
  • 通讯作者: 戴琳琳(daizi2407@163.com)
  • 作者简介:(songxinyang2022@ia.ac.cn)
  • 基金资助:
    中国国家铁路集团有限公司科技研究开发计划课题(N2021X026)

Review of Talking Face Generation

SONG Xinyang1,2, YAN Zhiyuan3, SUN Muyi2, DAI Linlin3, LI Qi1,2, SUN Zhenan1,2   

  1. 1 School of Artificial Intelligence,University of Chinese Academy of Sciences,Beijing 100049,China
    2 Center for Research on Intelligent Perception and Computing,National Laboratory of Pattern Recognition,Institute of Automation,Chinese Academy of Sciences,Beijing 100190,China
    3 Institute of Computing Technology,China Academy of Railway Sciences Corporation Limited,Beijing 100081,China
  • Received:2022-10-07 Revised:2023-02-20 Online:2023-08-15 Published:2023-08-02
  • About author:SONG Xinyang,born in 2000,Ph.D candidate.His main research interests include human behavior generation and so on.
    DAI Linlin,born in 1983,postgraduate,senior engineer.Her main research interests include computer vision and technology and so on.
  • Supported by:
    China National Railway Group Co., Ltd. Science and Technology Research Project(N2021X026).

摘要: 说话人生成是视觉生成领域的热门研究方向,旨在根据输入的多模态信息生成逼真的说话人视频。说话人生成在影视传媒、游戏动漫和互联网相关产业中具有广阔的应用前景,同时也可以为唇读识别、伪造鉴别和数字人生成等任务的研究提供数据支持。现阶段主流的说话人生成方法已经能够实现包含个性化属性、视听同步的说话人视频生成,但还未能达到虚拟现实、人机交互和元宇宙等新兴应用场景的要求。因此,研究说话人生成对于推动相关产业发展具有重要意义。对说话人生成的研究现状进行梳理与总结,首先阐述了说话人生成的研究背景和相关技术,然后根据方法分类介绍了近年来主流的说话人生成方法,整理了相关研究中常用的视听数据集和评价指标,最后总结现有方法存在的问题,分析了说话人生成未来潜在的研究方向。

关键词: 人脸生成, 视频生成, 图像生成, 深度学习, 多模态学习, 人脸重建, 深度伪造, 计算机视觉

Abstract: Talking face generation is a popular research direction in the field of visual generation,which aims to generate realistic speaker videos based on multimodal input data.Talking face generation has broad application prospects in video media,game animation and Internet-related industries,and it could also provide data support for the research of tasks such as lip reading recognition,fake identification and digital human generation.The existing mainstream methods have been able to achieve talking face generation with personalized attributes and audio-visual synchronization,but they fail to meet the requirements of emerging application scenarios such as virtual reality,man-machine interaction and metaverse.Sothestudyof talking face generation is of great significance for promoting the development of related industries.This paper sorts out and summarizes the research status of tal-king face generation.First,itelaborates the research background and related technologies of talking face generation,then introduces the mainstream generation methods in recent years according to the method classification,sorts out the audio-visual datasets and evaluations commonly used in the research,and finally summarizes the problems in the existing methods,and analyzes the potential research direction of talking face generation in the future.

Key words: Face generation, Video generation, Image generation, Deep learning, Multimodal learning, Face reconstruction, Deep fake, Computer vision

中图分类号: 

  • TP391
[1]GOODFELLOW I,POUGET-ABADIE J,MIRZA M,et al.Ge-nerative adversarial nets[J].Advances in Neural Information Processing Systems,2014,27:2672-2680.
[2]BLANZ V,VETTER T.A morphable model for the synthesis of3D faces[C]//Proceedings of the 26th annual Conference on Computer Graphics and Interactive Techniques.1999:187-194.
[3]PAYSAN P,KNOTHE R,AMBERG B,et al.A 3D face model for pose and illumination invariant face recognition[C]//2009 sixth IEEE International Conference on Advanced Video and Signal Based Surveillance.IEEE,2009:296-301.
[4]GERIG T,MOREL-FORSTER A,BLUMER C,et al.Mor-phable face models-an open framework[C]//2018 13th IEEE International Conference on Automatic Face & Gesture Recognition.IEEE,2018:75-82.
[5]TRAN A T,HASSNER T,MASI I,et al.Regressing robust and discriminative 3D morphable models with a very deep neural network[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2017:1493-1502.
[6]TEWARI A,ZOLLHÖFER M,KIM H,et al.Mofa:Model-based deep convolutional face autoencoder for unsupervised monocular reconstruction[C]//The IEEE International Conference on Computer Vision.2017:5.
[7]ZHU X,LEI Z,LIU X,et al.Face alignment across large poses:A 3d solution[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:146-155.
[8]FENG Y,WU F,SHAO X,et al.Joint 3D face reconstruction and dense alignment with position map regression network[J].arXiv:1803.07835,2018.
[9]COOTES T F,TAYLOR C J,COOPERD H,et al.Active shape models-their training and application[J].Computer Vision and Image Understanding,1995,61(1):38-59.
[10]COOTES T F,EDWARDS G J,TAYLOR C J.Active appearance models[C]//European Conference on Computer Vision.Berlin:Springer,1998:484-498.
[11]DOLLÁR P,WELINDER P,PERONA P.Cascaded pose regression[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2010:1078-1085.
[12]SUN Y,WANG X,TANG X.Deep convolutional network cascade for facial point detection[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2013:3476-3483.
[13]ZHOU H,LIU Y,LIU Z,et al.Talking face generation by adversarially disentangled audio-visual representation[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2019:9299-9306.
[14]ZAKHAROV E,SHYSHEYA A,BURKOVE,et al.Few-shotadversarial learning of realistic neural talking head models[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2019:9459-9468.
[15]ISOLA P,ZHU J Y,ZHOU T,et al.Image-to-image translation with conditional adversarial networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:1125-1134.
[16]YIN F,ZHANG Y,CUNX,et al.StyleHEAT:One-shot high-resolution editable talking face generation via pretrained StyleGAN[J].arXiv:2203.04036,2022.
[17]SUWAJANAKORN S,SEITZ S M,KEMELMACHER-SHLIZERMANI.Synthesizing obama:learning lip sync from audio[J].ACM Transactions on Graphics,2017,36(4):1-13.
[18]CHEN L,MADDOX R K,DUAN Z,et al.Hierarchical cross-modal talking face generation with dynamic pixel-wise loss[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019:7832-7841.
[19]DAS D,BISWAS S,SINHA S,et al.Speech-driven facial animation using cascaded gans for learning of motion and texture[C]//European Conference on Computer Vision.Cham:Sprin-ger, 2020:408-424.
[20]ZHENG A,ZHU F,ZHU H,et al.Talking face generation via learning semantic and temporal synchronous landmarks[C]//2020 25th International Conference on Pattern Recognition.IEEE,2021:3682-3689.
[21]ZHOU Y,HAN X,SHECHTMANE,et al.Makelttalk:speaker-aware talking-head animation[J].ACM Transactions on Gra-phics,2020,39(6):1-15.
[22]ZAKHAROV E,SHYSHEYA A,BURKOVE,et al.Few-shotadversarial learning of realistic neural talking head models[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2019:9459-9468.
[23]ZAKHAROV E,IVAKHNENKO A,SHYSHEYA A,et al.Fast bi-layer neural synthesis of one-shot realistic head avatars[C]//European Conference on Computer Vision.Cham:Sprin-ger, 2020:524-540.
[24]GU K,ZHOU Y,HUANG T.Flnet:Landmark driven fetching and learning network for faithful talking facial animation synthesis[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2020:10861-10868.
[25]ZENG D,LIU H,LIN H,et al.Talking face generation with expression-tailored generative adversarial network[C]//Procee-dings of the 28th ACM International Conference on Multimedia.2020:1716-1724.
[26]ZHANG X,WU X.Multi-modality deep restoration of extremely compressed face videos[J].arXiv:2107.05548,2021.
[27]RICHARD A,ZOLLHÖFER M,WEN Y,et al.Meshtalk:3d face animation from speech using cross-modality disentanglement[C]//Proceedings of the IEEE/CVF International Confe-rence on Computer Vision.2021:1173-1182.
[28]LAHIRI A,KWATRA V,FRUEH C,et al.LipSync3D:Data-efficient learning of personalized 3D talking faces from video using pose and lighting normalization[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2021:2755-2764.
[29]KIM H,GARRIDO P,TEWARIA,et al.Deep video portraits[J].ACM Transactions on Graphics,2018,37(4):1-14.
[30]KIM H,ELGHARIB M,ZOLLHÖFER M,et al.Neural style-preserving visual dubbing[J].ACM Transactions on Graphics,2019,38(6):1-13.
[31]REN Y,LI G,CHEN Y,et al.PIRenderer:Controllable portrait image generation via semantic neural rendering[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:13759-13768.
[32]CHEN L,CUI G,LIU C,et al.Talking-head generation withrhythmic head motion[C]//European Conference on Computer Vision.Cham:Springer,2020:35-51.
[33]SONG L,WU W,QIAN C,et al.Everybody's talkin':Let me talk as you want[J].IEEE Transactions on Information Forensics and Security,2022,17:585-598.
[34]YI R,YE Z,ZHANG J,et al.Audio-driven talking face video generation with learning-based personalized head pose[J].arXiv:2002.10137,2020.
[35]ZHANG C,ZHAO Y,HUANG Y,et al.Facial:Synthesizing dynamic talking face with implicit attribute learning[C]//Procee-dings of the IEEE/CVF International Conference on Computer Vision.2021:3867-3876.
[36]JI X,ZHOU H,WANG K,et al.Audio-driven emotional video portraits[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2021:14080-14089.
[37]SIAROHIN A,LATHUILIÈRE S,TULYAKOV S,et al.Animating arbitrary objects via deep motion transfer[C]//Procee-dings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019:2377-2386.
[38]SIAROHIN A,LATHUILIÈRE S,TULYAKOV S,et al.First order motion model for image animation[J].Advances in Neural Information Processing Systems,2019,32:7137-7147.
[39]WANG T C,MALLYA A,LIU M Y.One-shot free-view neural talking-head synthesis for video conferencing[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2021:10039-10049.
[40]WANG S,LI L,DING Y,et al.Audio2Head:Audio-driven one-shot talking-head generation with natural head motion[J].ar-Xiv:2107.09293,2021.
[41]WANG S,LI L,DING Y,et al.One-shot talking face generation from single-speaker audio-visual correlation learning[J].arXiv:2112.02749,2021.
[42]ZHANG Z,LI L,DING Y,et al.Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2021:3661-3670.
[43]DOUKAS M C,ZAFEIRIOU S,SHARMANSKA V.Head-GAN:One-shot neural Head synthesis and editing[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:14398-14407.
[44]CHEN L,LI Z,MADDOX R K,et al.Lip movements generation at a glance[C]//Proceedings of the European Conference on Computer Vision.2018:520-535.
[45]JAMALUDIN A,CHUNG J S,ZISSERMANA.You said that:Synthesising talking faces from audio[J].International Journal of Computer Vision,2019,127(11):1767-1779.
[46]PRAJWAL K R,MUKHOPADHYAY R,NAMBOODIRIV P,et al.A lip sync expert is all you need for speech to lip generation in the wild[C]//Proceedings of the 28th ACM International Conference on Multimedia.2020:484-492.
[47]VOUGIOUKAS K,PETRIDIS S,PANTIC M.Realistic speech-driven facial animation with gans[J].International Journal of Computer Vision,2020,128(5):1398-1413.
[48]SONG Y,ZHU J,LI D,et al.Talking face generation by conditional recurrent adversarial network[J].arXiv:1804.04786,2018.
[49]ZHOU H,SUN Y,WU W,et al.Pose-controllable talking face generation by implicitly modularized audio-visual representation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2021:4176-4186.
[50]BURKOV E,PASECHNIK I,GRIGOREV A,et al.Neural head reenactment with latent pose descriptors[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:13786-13795.
[51]ZHU H,HUANG H,LI Y,et al.Arbitrary talking face generation via attentional audio-visual coherence learning[C]//Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence.2021:2362-2368.
[52]ZHANG J,LIU L,XUE Z,et al.Apb2face:audio-guided face reenactment with auxiliary pose and blink signals[C]//IEEE International Conference on Acoustics,Speech and Signal Proces-sing(ICASSP 2020).IEEE,2020:4402-4406.
[53]ZHANG J,ZENG X,XU C,et al.APB2FaceV2:Real-time audio-guided multi-face reenactment[J].arXiv:2010.13017,2020.
[54]LU Y,CHAI J,CAO X.Live speech portraits:real-time photorealistic talking-head animation[J].ACM Transactions on Graphics,2021,40(6):1-17.
[55]WANG T C,LIU M Y,ZHU J Y,et al.Video-to-video synthesis[J].Advances in Neural Information Processing Systems,2018,31:1144-1156.
[56]WANG T C,LIU M Y,TAO A,et al.Few-shot video-to-video synthesis[J].Advances in Neural Information Processing Systems,2019,32:5013-5024.
[57]WANG Y,YANG D,BREMOND F,et al.Latent image animator:learning to animate images via latent space navigation[C]//International Conference on Learning Representations.2021.
[58]WILES O,KOEPKE A,ZISSERMAN A.X2face:A network for controlling face generation using images,audio,and pose codes[C]//Proceedings of the European Conference on Computer Vision.2018:670-686.
[59]WU W,ZHANG Y,LIC,et al.Reenactgan:Learning to reenact faces via boundary transfer[C]//Proceedings of the European Conference on Computer Vision.2018:603-619.
[60]SONG L,WU W,FU C,et al.Everything's talkin':Pareidolia face reenactment[J].arXiv:2104.03061,2021.
[61]MILDENHALL B,SRINIVASAN P P,TANCIKM,et al.Nerf:Representing scenes as neural radiance fields for view synthesis[C]//European Conference on Computer Vision.Cham:Sprin-ger,2020:405-421.
[62]GAFNI G,THIES J,ZOLLHOFER M,et al.Dynamic neural radiance fields for monocular 4d facial avatar reconstruction[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2021:8649-8658.
[63]GUO Y,CHEN K,LIANG S,et al.Ad-nerf:Audio driven neural radiance fields for talking head synthesis[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:5784-5794.
[64]ALGHAMDI N,MADDOCK S,MARXER R,et al.A corpus of audio-visual Lombard speech with frontal and profile views[J].The Journal of the Acoustical Society of America,2018,143(6):EL523-EL529.
[65]CHUNG J S,ZISSERMAN A.Lip reading in the wild[C]//Asian Conference on Computer Vision.Cham:Springer,2016:87-103.
[66]CHUNG J S,SENIOR A,VINYALS O,et al.Lip reading sen-tences in the wild[C]//Proceedings of the IEEE/CVF Confe-rence on Computer Vision and Pattern Recognition.2017:3444-3453.
[67]NAGRANI A,CHUNG J S,ZISSERMAN A.Voxceleb:a large-scale speaker identification dataset[J].arXiv:1706.08612,2017.
[68]CHUNG J S,NAGRANI A,ZISSERMAN A.Voxceleb2:Deep speaker recognition[J].arXiv:1806.05622,2018.
[69]WANG K,WU Q,SONG L,et al.Mead:A large-scale audio-vi-sual dataset for emotional talking-face generation[C]//European Conference on Computer Vision.Cham:Springer,2020:700-717.
[70]WANG Z,BOVIK A C,SHEIKH H R,et al.Image quality assessment:from error visibility to structural similarity[J].IEEE Transactions on Image Processing,2004,13(4):600-612.
[71]HEUSEL M,RAMSAUER H,UNTERTHINER T,et al.Ganstrained by a two time-scale update rule converge to a local nash equilibrium[J].Advances in Neural Information Processing Systems,2017,30:6626-6637.
[72]ZHANG R,ISOLA P,EFROS A A,et al.The unreasonable effectiveness of deep features as a perceptual metric[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:586-595.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!