计算机科学 ›› 2026, Vol. 53 ›› Issue (2): 245-252.doi: 10.11896/jsjkx.241200067

• 计算机图形学&多媒体 • 上一篇    下一篇

基于注意力机制的音频驱动数字人脸视频生成方法

郭星星1,2, 肖雁南1,2, 温佩芝1,2,3, 徐智1,2, 黄文明1,2,3   

  1. 1 桂林电子科技大学计算机与信息安全学院 广西 桂林 541004
    2 广西图像图形与智能处理重点实验室 广西 桂林 541004
    3 桂林信息科技学院信息工程学院 广西 桂林 541004
  • 收稿日期:2024-12-09 修回日期:2025-03-08 发布日期:2026-02-10
  • 通讯作者: 肖雁南(xiaoyn@guet.edu.cn)
  • 作者简介:(guoxingxing0328@163.com)
  • 基金资助:
    广西图像图形与智能处理重点实验室开放基金(GIIP2310);广西自然科学基金(2020GXNSFAA297186)

Attention-based Audio-driven Digital Face Video Generation Method

GUO Xingxing1,2, XIAO Yannan1,2, WEN Peizhi1,2,3, XU Zhi1,2, HUANG Wenming1,2,3   

  1. 1 School of Computer and Information Security,Guilin University of Electronic Technology,Guilin,Guangxi 541004,China
    2 Guangxi Key Laboratory of Image and Graphics Intelligent Processing,Guilin,Guangxi 541004,China
    3 School of Information Engineering,Guilin University of Information Technology,Guilin,Guangxi 541004,China
  • Received:2024-12-09 Revised:2025-03-08 Online:2026-02-10
  • About author:GUO Xingxing,born in 1998,postgra-duate.Her main research interest is digital image processing.
    XIAO Yannan,born in 1990,postgra-duate,engineer.His main research in-terests include artificial intelligence and image-based 3D reconstruction.
  • Supported by:
    Guangxi Key Laboratory of Image and Graphic Intelligent Processing Foundation Proiect(GIIP2310) and Guangxi Natural Science Foundation,China(2020GXNSFAA297186).

摘要: 音频驱动数字人脸视频生成的难点问题在于,如何将音频与视频两种不同模态的信息对齐,从而实现唇音同步。现有技术大多基于英文数据集开发,由于中文发音与英文发音存在差异性,直接将这些技术运用于中文音频驱动数字人脸视频生成时,存在牙齿模糊和视频清晰度不够的问题。基于GAN框架,提出了一种基于注意力机制的音频驱动数字人脸视频生成方法M-CSAWav2Lip。将MFCC和Mel Spectrogram融合,实现音频特征提取。利用MFCC的时间动态特性和Mel Spectrogram的频率分辨能力,全面捕捉语音信息的细微变化。在数字人脸生成过程中,采用基于注意力机制及残差连接的网络架构,通过加权通道和空间注意力机制强化特征的重要性,提高关键音频和视频特征的获取能力,实现有效编码和融合中文音视频信息,生成与语音内容相匹配的唇部动作和面部视频。最后,在自建的中文数据集及通用数据集上进行训练与测试。实验结果表明,所提方法生成的唇音同步数字人脸视频在精度和质量方面均有一定的提升。

关键词: 音频驱动, 唇音同步, 音频特征提取, 数字人脸生成, 注意力机制

Abstract: The key challenge in audio-driven digital face video generation lies in aligning the information from two different modalities,audio and video,to achieve lip synchronization.Existing technologies have primarily been developed using English datasets.However,due to the phonetic differences between Chinese and English,directly applying these methods to Chinese audio-driven face video generation results in issues such as blurred teeth and insufficient video clarity.This paper proposes M-CSAWav2Lip,an audio-driven digital face video generation method based on a GAN framework and enhanced by an attention mechanism.The method combines MFCC and Mel Spectrograms for audio feature extraction.By leveraging the temporal dynamics of MFCC and the frequency resolution of Mel Spectrograms,the method captures subtle variations in speech information comprehensively.During the digital face generation process,a network architecture based on attention mechanisms and residual connections is employed.This architecture uses weighted channel and spatial attention mechanisms to enhance the importance of features,improving the ability to extract key audio and video features.This allows for the effective encoding and fusion of Chinese audio-video information,generating lip movements and facial videos that are consistent with the audio content.Finally,the model is trained and tested on both a custom Chinese dataset and a general dataset.Experimental results demonstrate that the generated lip-synced digital face videos show improvements in both accuracy and quality.

Key words: Audio-driven, Lip synchronization, Audio feature extraction, Digital face generation, Attention mechanism

中图分类号: 

  • TP391
[1]WANG J,QIAN X,ZHANG M,et al.Seeing what you said:Talking face generation guided by a lip reading expert[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2023:14653-14662.
[2]CHEN L,MADDOX R K,DUAN Z,et al.Hierarchical cross-modal talking face generation with dynamic pixel-wise loss[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019:7832-7841.
[3]PRAJWAL K R,MUKHOPADHYAY R,NAMBOODIRI V P,et al.A lip sync expert is all you need for speech to lip generation in the wild[C]//Proceedings of the 28th ACM International Conference on Multimedia.2020:484-492.
[4]ZHOU H,SUN Y,WU W,et al.Pose-controllable talking face generation by implicitly modularized audio-visual representation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2021:4176-4186.
[5]GUO Y,CHEN K,LIANG S,et al.Ad-Nerf:Audio driven neural radiance fields for talking head synthesis[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:5784-5794.
[6]MUKHOPADHYAY S,SURI S,GADDE R T,et al.Diff2lip:Audio conditioned diffusion models for lip-synchronization[C]//Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision.2024:5292-5302.
[7]MISTRY D S,KULKARNI A V.Overview:Speech Recognition Technology,Mel-Frequency Cepstral Coefficients(MFCC),Artificial Neural Network(ANN)[J/OL].https://www.ijert.org/research/overview-speech-recognition-technology-mel-frequency-cepstral-coefficients-mfcc-artificial-neural-network-ann-IJERTV2IS100586.pdf.
[8]TRAN T,LUNDGREN J.Drill Fault Diagnosis Based on the Scalogram and Mel Spectrogram of Sound Signals Using Artificial Intelligence[J].IEEE Access,2020,8:203655-203666.
[9]LI H,QIU K,CHEN L,et al.SCAttNet:Semantic segmentation network with spatial and channel attention mechanism for high-resolution remote sensing images[J].IEEE Geoscience and Remote Sensing Letters,2020,18(5):905-909.
[10]QIN Z,ZHANG P,WU F,et al.Fcanet:Frequency channel attention networks[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:783-792.
[11]GUO M H,XU T X,LIU J J,et al.Attention mechanisms in computer vision:A survey[J].Computational Visual Media,2022,8(3):331-368.
[12]CHUNG J S,ZISSERMAN A.Out of time:automated lip sync in the wild[C]//Computer Vision-ACCV 2016 Workshops:ACCV 2016 International Workshops.2017:251-263.
[13]JI Y,YU Y Q.Optimization algorithm for speech facial video generation based on dense convolutional generative adversarial networks and keyframes[J].Journal of Jilin University(Engineering and Technology Edition),2025,55(3):986-992.
[14]AFOURAS T,CHUNG J S,SENIOR A,et al.Deep audio-visual speech recognition[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2018,44(12):8717-8727.
[15]SON C J,SENIOR A,VINYALS O,et al.Lip reading sentences in the wild[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:6447-6456.
[16]CHUNG J,ZISSERMAN A.Lip reading in profile[C]//Ritish Machine Vision Conference.British Machine Vision Association and Society for Pattern Recognition,2017.
[17]ZHAO Y,XU R,SONG M.A cascade sequence-to-sequence model for chinese mandarin lip reading[C]//Proceedings of the 1st ACM International Conference on Multimedia in Asia.2019:1-6.
[18]ZHAO Y,XU R,WANG X,et al.Hearing lips:Improving lip reading by distilling speech recognizers[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2020:6917-6924.
[19]PARK S J,KIM M,HONG J,et al.Synctalkface:Talking face generation with precise lip-syncing via audio-lip memory[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2022:2062-2070.
[20]LIANG B,PAN Y,GUO Z,et al.Expressive talking head generation with granular audio-visual control[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2022:3387-3396.
[21]DUCHI J,HAZAN E,SINGER Y.Adaptive subgradient methods for online learning and stochastic optimization[J].Journal of machine learning research,2011,12(7):2121-2159.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!