计算机科学 ›› 2023, Vol. 50 ›› Issue (8): 68-78.doi: 10.11896/jsjkx.221000031
宋昕洋1,2, 阎志远3, 孙沐毅2, 戴琳琳3, 李琦1,2, 孙哲南1,2
SONG Xinyang1,2, YAN Zhiyuan3, SUN Muyi2, DAI Linlin3, LI Qi1,2, SUN Zhenan1,2
摘要: 说话人生成是视觉生成领域的热门研究方向,旨在根据输入的多模态信息生成逼真的说话人视频。说话人生成在影视传媒、游戏动漫和互联网相关产业中具有广阔的应用前景,同时也可以为唇读识别、伪造鉴别和数字人生成等任务的研究提供数据支持。现阶段主流的说话人生成方法已经能够实现包含个性化属性、视听同步的说话人视频生成,但还未能达到虚拟现实、人机交互和元宇宙等新兴应用场景的要求。因此,研究说话人生成对于推动相关产业发展具有重要意义。对说话人生成的研究现状进行梳理与总结,首先阐述了说话人生成的研究背景和相关技术,然后根据方法分类介绍了近年来主流的说话人生成方法,整理了相关研究中常用的视听数据集和评价指标,最后总结现有方法存在的问题,分析了说话人生成未来潜在的研究方向。
中图分类号:
| [1]GOODFELLOW I,POUGET-ABADIE J,MIRZA M,et al.Ge-nerative adversarial nets[J].Advances in Neural Information Processing Systems,2014,27:2672-2680. [2]BLANZ V,VETTER T.A morphable model for the synthesis of3D faces[C]//Proceedings of the 26th annual Conference on Computer Graphics and Interactive Techniques.1999:187-194. [3]PAYSAN P,KNOTHE R,AMBERG B,et al.A 3D face model for pose and illumination invariant face recognition[C]//2009 sixth IEEE International Conference on Advanced Video and Signal Based Surveillance.IEEE,2009:296-301. [4]GERIG T,MOREL-FORSTER A,BLUMER C,et al.Mor-phable face models-an open framework[C]//2018 13th IEEE International Conference on Automatic Face & Gesture Recognition.IEEE,2018:75-82. [5]TRAN A T,HASSNER T,MASI I,et al.Regressing robust and discriminative 3D morphable models with a very deep neural network[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2017:1493-1502. [6]TEWARI A,ZOLLHÖFER M,KIM H,et al.Mofa:Model-based deep convolutional face autoencoder for unsupervised monocular reconstruction[C]//The IEEE International Conference on Computer Vision.2017:5. [7]ZHU X,LEI Z,LIU X,et al.Face alignment across large poses:A 3d solution[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:146-155. [8]FENG Y,WU F,SHAO X,et al.Joint 3D face reconstruction and dense alignment with position map regression network[J].arXiv:1803.07835,2018. [9]COOTES T F,TAYLOR C J,COOPERD H,et al.Active shape models-their training and application[J].Computer Vision and Image Understanding,1995,61(1):38-59. [10]COOTES T F,EDWARDS G J,TAYLOR C J.Active appearance models[C]//European Conference on Computer Vision.Berlin:Springer,1998:484-498. [11]DOLLÁR P,WELINDER P,PERONA P.Cascaded pose regression[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2010:1078-1085. [12]SUN Y,WANG X,TANG X.Deep convolutional network cascade for facial point detection[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2013:3476-3483. [13]ZHOU H,LIU Y,LIU Z,et al.Talking face generation by adversarially disentangled audio-visual representation[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2019:9299-9306. [14]ZAKHAROV E,SHYSHEYA A,BURKOVE,et al.Few-shotadversarial learning of realistic neural talking head models[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2019:9459-9468. [15]ISOLA P,ZHU J Y,ZHOU T,et al.Image-to-image translation with conditional adversarial networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:1125-1134. [16]YIN F,ZHANG Y,CUNX,et al.StyleHEAT:One-shot high-resolution editable talking face generation via pretrained StyleGAN[J].arXiv:2203.04036,2022. [17]SUWAJANAKORN S,SEITZ S M,KEMELMACHER-SHLIZERMANI.Synthesizing obama:learning lip sync from audio[J].ACM Transactions on Graphics,2017,36(4):1-13. [18]CHEN L,MADDOX R K,DUAN Z,et al.Hierarchical cross-modal talking face generation with dynamic pixel-wise loss[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019:7832-7841. [19]DAS D,BISWAS S,SINHA S,et al.Speech-driven facial animation using cascaded gans for learning of motion and texture[C]//European Conference on Computer Vision.Cham:Sprin-ger, 2020:408-424. [20]ZHENG A,ZHU F,ZHU H,et al.Talking face generation via learning semantic and temporal synchronous landmarks[C]//2020 25th International Conference on Pattern Recognition.IEEE,2021:3682-3689. [21]ZHOU Y,HAN X,SHECHTMANE,et al.Makelttalk:speaker-aware talking-head animation[J].ACM Transactions on Gra-phics,2020,39(6):1-15. [22]ZAKHAROV E,SHYSHEYA A,BURKOVE,et al.Few-shotadversarial learning of realistic neural talking head models[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2019:9459-9468. [23]ZAKHAROV E,IVAKHNENKO A,SHYSHEYA A,et al.Fast bi-layer neural synthesis of one-shot realistic head avatars[C]//European Conference on Computer Vision.Cham:Sprin-ger, 2020:524-540. [24]GU K,ZHOU Y,HUANG T.Flnet:Landmark driven fetching and learning network for faithful talking facial animation synthesis[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2020:10861-10868. [25]ZENG D,LIU H,LIN H,et al.Talking face generation with expression-tailored generative adversarial network[C]//Procee-dings of the 28th ACM International Conference on Multimedia.2020:1716-1724. [26]ZHANG X,WU X.Multi-modality deep restoration of extremely compressed face videos[J].arXiv:2107.05548,2021. [27]RICHARD A,ZOLLHÖFER M,WEN Y,et al.Meshtalk:3d face animation from speech using cross-modality disentanglement[C]//Proceedings of the IEEE/CVF International Confe-rence on Computer Vision.2021:1173-1182. [28]LAHIRI A,KWATRA V,FRUEH C,et al.LipSync3D:Data-efficient learning of personalized 3D talking faces from video using pose and lighting normalization[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2021:2755-2764. [29]KIM H,GARRIDO P,TEWARIA,et al.Deep video portraits[J].ACM Transactions on Graphics,2018,37(4):1-14. [30]KIM H,ELGHARIB M,ZOLLHÖFER M,et al.Neural style-preserving visual dubbing[J].ACM Transactions on Graphics,2019,38(6):1-13. [31]REN Y,LI G,CHEN Y,et al.PIRenderer:Controllable portrait image generation via semantic neural rendering[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:13759-13768. [32]CHEN L,CUI G,LIU C,et al.Talking-head generation withrhythmic head motion[C]//European Conference on Computer Vision.Cham:Springer,2020:35-51. [33]SONG L,WU W,QIAN C,et al.Everybody's talkin':Let me talk as you want[J].IEEE Transactions on Information Forensics and Security,2022,17:585-598. [34]YI R,YE Z,ZHANG J,et al.Audio-driven talking face video generation with learning-based personalized head pose[J].arXiv:2002.10137,2020. [35]ZHANG C,ZHAO Y,HUANG Y,et al.Facial:Synthesizing dynamic talking face with implicit attribute learning[C]//Procee-dings of the IEEE/CVF International Conference on Computer Vision.2021:3867-3876. [36]JI X,ZHOU H,WANG K,et al.Audio-driven emotional video portraits[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2021:14080-14089. [37]SIAROHIN A,LATHUILIÈRE S,TULYAKOV S,et al.Animating arbitrary objects via deep motion transfer[C]//Procee-dings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019:2377-2386. [38]SIAROHIN A,LATHUILIÈRE S,TULYAKOV S,et al.First order motion model for image animation[J].Advances in Neural Information Processing Systems,2019,32:7137-7147. [39]WANG T C,MALLYA A,LIU M Y.One-shot free-view neural talking-head synthesis for video conferencing[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2021:10039-10049. [40]WANG S,LI L,DING Y,et al.Audio2Head:Audio-driven one-shot talking-head generation with natural head motion[J].ar-Xiv:2107.09293,2021. [41]WANG S,LI L,DING Y,et al.One-shot talking face generation from single-speaker audio-visual correlation learning[J].arXiv:2112.02749,2021. [42]ZHANG Z,LI L,DING Y,et al.Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2021:3661-3670. [43]DOUKAS M C,ZAFEIRIOU S,SHARMANSKA V.Head-GAN:One-shot neural Head synthesis and editing[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:14398-14407. [44]CHEN L,LI Z,MADDOX R K,et al.Lip movements generation at a glance[C]//Proceedings of the European Conference on Computer Vision.2018:520-535. [45]JAMALUDIN A,CHUNG J S,ZISSERMANA.You said that:Synthesising talking faces from audio[J].International Journal of Computer Vision,2019,127(11):1767-1779. [46]PRAJWAL K R,MUKHOPADHYAY R,NAMBOODIRIV P,et al.A lip sync expert is all you need for speech to lip generation in the wild[C]//Proceedings of the 28th ACM International Conference on Multimedia.2020:484-492. [47]VOUGIOUKAS K,PETRIDIS S,PANTIC M.Realistic speech-driven facial animation with gans[J].International Journal of Computer Vision,2020,128(5):1398-1413. [48]SONG Y,ZHU J,LI D,et al.Talking face generation by conditional recurrent adversarial network[J].arXiv:1804.04786,2018. [49]ZHOU H,SUN Y,WU W,et al.Pose-controllable talking face generation by implicitly modularized audio-visual representation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2021:4176-4186. [50]BURKOV E,PASECHNIK I,GRIGOREV A,et al.Neural head reenactment with latent pose descriptors[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:13786-13795. [51]ZHU H,HUANG H,LI Y,et al.Arbitrary talking face generation via attentional audio-visual coherence learning[C]//Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence.2021:2362-2368. [52]ZHANG J,LIU L,XUE Z,et al.Apb2face:audio-guided face reenactment with auxiliary pose and blink signals[C]//IEEE International Conference on Acoustics,Speech and Signal Proces-sing(ICASSP 2020).IEEE,2020:4402-4406. [53]ZHANG J,ZENG X,XU C,et al.APB2FaceV2:Real-time audio-guided multi-face reenactment[J].arXiv:2010.13017,2020. [54]LU Y,CHAI J,CAO X.Live speech portraits:real-time photorealistic talking-head animation[J].ACM Transactions on Graphics,2021,40(6):1-17. [55]WANG T C,LIU M Y,ZHU J Y,et al.Video-to-video synthesis[J].Advances in Neural Information Processing Systems,2018,31:1144-1156. [56]WANG T C,LIU M Y,TAO A,et al.Few-shot video-to-video synthesis[J].Advances in Neural Information Processing Systems,2019,32:5013-5024. [57]WANG Y,YANG D,BREMOND F,et al.Latent image animator:learning to animate images via latent space navigation[C]//International Conference on Learning Representations.2021. [58]WILES O,KOEPKE A,ZISSERMAN A.X2face:A network for controlling face generation using images,audio,and pose codes[C]//Proceedings of the European Conference on Computer Vision.2018:670-686. [59]WU W,ZHANG Y,LIC,et al.Reenactgan:Learning to reenact faces via boundary transfer[C]//Proceedings of the European Conference on Computer Vision.2018:603-619. [60]SONG L,WU W,FU C,et al.Everything's talkin':Pareidolia face reenactment[J].arXiv:2104.03061,2021. [61]MILDENHALL B,SRINIVASAN P P,TANCIKM,et al.Nerf:Representing scenes as neural radiance fields for view synthesis[C]//European Conference on Computer Vision.Cham:Sprin-ger,2020:405-421. [62]GAFNI G,THIES J,ZOLLHOFER M,et al.Dynamic neural radiance fields for monocular 4d facial avatar reconstruction[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2021:8649-8658. [63]GUO Y,CHEN K,LIANG S,et al.Ad-nerf:Audio driven neural radiance fields for talking head synthesis[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:5784-5794. [64]ALGHAMDI N,MADDOCK S,MARXER R,et al.A corpus of audio-visual Lombard speech with frontal and profile views[J].The Journal of the Acoustical Society of America,2018,143(6):EL523-EL529. [65]CHUNG J S,ZISSERMAN A.Lip reading in the wild[C]//Asian Conference on Computer Vision.Cham:Springer,2016:87-103. [66]CHUNG J S,SENIOR A,VINYALS O,et al.Lip reading sen-tences in the wild[C]//Proceedings of the IEEE/CVF Confe-rence on Computer Vision and Pattern Recognition.2017:3444-3453. [67]NAGRANI A,CHUNG J S,ZISSERMAN A.Voxceleb:a large-scale speaker identification dataset[J].arXiv:1706.08612,2017. [68]CHUNG J S,NAGRANI A,ZISSERMAN A.Voxceleb2:Deep speaker recognition[J].arXiv:1806.05622,2018. [69]WANG K,WU Q,SONG L,et al.Mead:A large-scale audio-vi-sual dataset for emotional talking-face generation[C]//European Conference on Computer Vision.Cham:Springer,2020:700-717. [70]WANG Z,BOVIK A C,SHEIKH H R,et al.Image quality assessment:from error visibility to structural similarity[J].IEEE Transactions on Image Processing,2004,13(4):600-612. [71]HEUSEL M,RAMSAUER H,UNTERTHINER T,et al.Ganstrained by a two time-scale update rule converge to a local nash equilibrium[J].Advances in Neural Information Processing Systems,2017,30:6626-6637. [72]ZHANG R,ISOLA P,EFROS A A,et al.The unreasonable effectiveness of deep features as a perceptual metric[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:586-595. | 
| 
 | ||