计算机科学 ›› 2026, Vol. 53 ›› Issue (2): 245-252.doi: 10.11896/jsjkx.241200067
郭星星1,2, 肖雁南1,2, 温佩芝1,2,3, 徐智1,2, 黄文明1,2,3
GUO Xingxing1,2, XIAO Yannan1,2, WEN Peizhi1,2,3, XU Zhi1,2, HUANG Wenming1,2,3
摘要: 音频驱动数字人脸视频生成的难点问题在于,如何将音频与视频两种不同模态的信息对齐,从而实现唇音同步。现有技术大多基于英文数据集开发,由于中文发音与英文发音存在差异性,直接将这些技术运用于中文音频驱动数字人脸视频生成时,存在牙齿模糊和视频清晰度不够的问题。基于GAN框架,提出了一种基于注意力机制的音频驱动数字人脸视频生成方法M-CSAWav2Lip。将MFCC和Mel Spectrogram融合,实现音频特征提取。利用MFCC的时间动态特性和Mel Spectrogram的频率分辨能力,全面捕捉语音信息的细微变化。在数字人脸生成过程中,采用基于注意力机制及残差连接的网络架构,通过加权通道和空间注意力机制强化特征的重要性,提高关键音频和视频特征的获取能力,实现有效编码和融合中文音视频信息,生成与语音内容相匹配的唇部动作和面部视频。最后,在自建的中文数据集及通用数据集上进行训练与测试。实验结果表明,所提方法生成的唇音同步数字人脸视频在精度和质量方面均有一定的提升。
中图分类号:
| [1]WANG J,QIAN X,ZHANG M,et al.Seeing what you said:Talking face generation guided by a lip reading expert[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2023:14653-14662. [2]CHEN L,MADDOX R K,DUAN Z,et al.Hierarchical cross-modal talking face generation with dynamic pixel-wise loss[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019:7832-7841. [3]PRAJWAL K R,MUKHOPADHYAY R,NAMBOODIRI V P,et al.A lip sync expert is all you need for speech to lip generation in the wild[C]//Proceedings of the 28th ACM International Conference on Multimedia.2020:484-492. [4]ZHOU H,SUN Y,WU W,et al.Pose-controllable talking face generation by implicitly modularized audio-visual representation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2021:4176-4186. [5]GUO Y,CHEN K,LIANG S,et al.Ad-Nerf:Audio driven neural radiance fields for talking head synthesis[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:5784-5794. [6]MUKHOPADHYAY S,SURI S,GADDE R T,et al.Diff2lip:Audio conditioned diffusion models for lip-synchronization[C]//Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision.2024:5292-5302. [7]MISTRY D S,KULKARNI A V.Overview:Speech Recognition Technology,Mel-Frequency Cepstral Coefficients(MFCC),Artificial Neural Network(ANN)[J/OL].https://www.ijert.org/research/overview-speech-recognition-technology-mel-frequency-cepstral-coefficients-mfcc-artificial-neural-network-ann-IJERTV2IS100586.pdf. [8]TRAN T,LUNDGREN J.Drill Fault Diagnosis Based on the Scalogram and Mel Spectrogram of Sound Signals Using Artificial Intelligence[J].IEEE Access,2020,8:203655-203666. [9]LI H,QIU K,CHEN L,et al.SCAttNet:Semantic segmentation network with spatial and channel attention mechanism for high-resolution remote sensing images[J].IEEE Geoscience and Remote Sensing Letters,2020,18(5):905-909. [10]QIN Z,ZHANG P,WU F,et al.Fcanet:Frequency channel attention networks[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:783-792. [11]GUO M H,XU T X,LIU J J,et al.Attention mechanisms in computer vision:A survey[J].Computational Visual Media,2022,8(3):331-368. [12]CHUNG J S,ZISSERMAN A.Out of time:automated lip sync in the wild[C]//Computer Vision-ACCV 2016 Workshops:ACCV 2016 International Workshops.2017:251-263. [13]JI Y,YU Y Q.Optimization algorithm for speech facial video generation based on dense convolutional generative adversarial networks and keyframes[J].Journal of Jilin University(Engineering and Technology Edition),2025,55(3):986-992. [14]AFOURAS T,CHUNG J S,SENIOR A,et al.Deep audio-visual speech recognition[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2018,44(12):8717-8727. [15]SON C J,SENIOR A,VINYALS O,et al.Lip reading sentences in the wild[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:6447-6456. [16]CHUNG J,ZISSERMAN A.Lip reading in profile[C]//Ritish Machine Vision Conference.British Machine Vision Association and Society for Pattern Recognition,2017. [17]ZHAO Y,XU R,SONG M.A cascade sequence-to-sequence model for chinese mandarin lip reading[C]//Proceedings of the 1st ACM International Conference on Multimedia in Asia.2019:1-6. [18]ZHAO Y,XU R,WANG X,et al.Hearing lips:Improving lip reading by distilling speech recognizers[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2020:6917-6924. [19]PARK S J,KIM M,HONG J,et al.Synctalkface:Talking face generation with precise lip-syncing via audio-lip memory[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2022:2062-2070. [20]LIANG B,PAN Y,GUO Z,et al.Expressive talking head generation with granular audio-visual control[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2022:3387-3396. [21]DUCHI J,HAZAN E,SINGER Y.Adaptive subgradient methods for online learning and stochastic optimization[J].Journal of machine learning research,2011,12(7):2121-2159. |
|
||