Computer Science ›› 2026, Vol. 53 ›› Issue (2): 245-252.doi: 10.11896/jsjkx.241200067

• Computer Grapnics & Multimedia • Previous Articles     Next Articles

Attention-based Audio-driven Digital Face Video Generation Method

GUO Xingxing1,2, XIAO Yannan1,2, WEN Peizhi1,2,3, XU Zhi1,2, HUANG Wenming1,2,3   

  1. 1 School of Computer and Information Security,Guilin University of Electronic Technology,Guilin,Guangxi 541004,China
    2 Guangxi Key Laboratory of Image and Graphics Intelligent Processing,Guilin,Guangxi 541004,China
    3 School of Information Engineering,Guilin University of Information Technology,Guilin,Guangxi 541004,China
  • Received:2024-12-09 Revised:2025-03-08 Published:2026-02-10
  • About author:GUO Xingxing,born in 1998,postgra-duate.Her main research interest is digital image processing.
    XIAO Yannan,born in 1990,postgra-duate,engineer.His main research in-terests include artificial intelligence and image-based 3D reconstruction.
  • Supported by:
    Guangxi Key Laboratory of Image and Graphic Intelligent Processing Foundation Proiect(GIIP2310) and Guangxi Natural Science Foundation,China(2020GXNSFAA297186).

Abstract: The key challenge in audio-driven digital face video generation lies in aligning the information from two different modalities,audio and video,to achieve lip synchronization.Existing technologies have primarily been developed using English datasets.However,due to the phonetic differences between Chinese and English,directly applying these methods to Chinese audio-driven face video generation results in issues such as blurred teeth and insufficient video clarity.This paper proposes M-CSAWav2Lip,an audio-driven digital face video generation method based on a GAN framework and enhanced by an attention mechanism.The method combines MFCC and Mel Spectrograms for audio feature extraction.By leveraging the temporal dynamics of MFCC and the frequency resolution of Mel Spectrograms,the method captures subtle variations in speech information comprehensively.During the digital face generation process,a network architecture based on attention mechanisms and residual connections is employed.This architecture uses weighted channel and spatial attention mechanisms to enhance the importance of features,improving the ability to extract key audio and video features.This allows for the effective encoding and fusion of Chinese audio-video information,generating lip movements and facial videos that are consistent with the audio content.Finally,the model is trained and tested on both a custom Chinese dataset and a general dataset.Experimental results demonstrate that the generated lip-synced digital face videos show improvements in both accuracy and quality.

Key words: Audio-driven, Lip synchronization, Audio feature extraction, Digital face generation, Attention mechanism

CLC Number: 

  • TP391
[1]WANG J,QIAN X,ZHANG M,et al.Seeing what you said:Talking face generation guided by a lip reading expert[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2023:14653-14662.
[2]CHEN L,MADDOX R K,DUAN Z,et al.Hierarchical cross-modal talking face generation with dynamic pixel-wise loss[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019:7832-7841.
[3]PRAJWAL K R,MUKHOPADHYAY R,NAMBOODIRI V P,et al.A lip sync expert is all you need for speech to lip generation in the wild[C]//Proceedings of the 28th ACM International Conference on Multimedia.2020:484-492.
[4]ZHOU H,SUN Y,WU W,et al.Pose-controllable talking face generation by implicitly modularized audio-visual representation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2021:4176-4186.
[5]GUO Y,CHEN K,LIANG S,et al.Ad-Nerf:Audio driven neural radiance fields for talking head synthesis[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:5784-5794.
[6]MUKHOPADHYAY S,SURI S,GADDE R T,et al.Diff2lip:Audio conditioned diffusion models for lip-synchronization[C]//Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision.2024:5292-5302.
[7]MISTRY D S,KULKARNI A V.Overview:Speech Recognition Technology,Mel-Frequency Cepstral Coefficients(MFCC),Artificial Neural Network(ANN)[J/OL].https://www.ijert.org/research/overview-speech-recognition-technology-mel-frequency-cepstral-coefficients-mfcc-artificial-neural-network-ann-IJERTV2IS100586.pdf.
[8]TRAN T,LUNDGREN J.Drill Fault Diagnosis Based on the Scalogram and Mel Spectrogram of Sound Signals Using Artificial Intelligence[J].IEEE Access,2020,8:203655-203666.
[9]LI H,QIU K,CHEN L,et al.SCAttNet:Semantic segmentation network with spatial and channel attention mechanism for high-resolution remote sensing images[J].IEEE Geoscience and Remote Sensing Letters,2020,18(5):905-909.
[10]QIN Z,ZHANG P,WU F,et al.Fcanet:Frequency channel attention networks[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:783-792.
[11]GUO M H,XU T X,LIU J J,et al.Attention mechanisms in computer vision:A survey[J].Computational Visual Media,2022,8(3):331-368.
[12]CHUNG J S,ZISSERMAN A.Out of time:automated lip sync in the wild[C]//Computer Vision-ACCV 2016 Workshops:ACCV 2016 International Workshops.2017:251-263.
[13]JI Y,YU Y Q.Optimization algorithm for speech facial video generation based on dense convolutional generative adversarial networks and keyframes[J].Journal of Jilin University(Engineering and Technology Edition),2025,55(3):986-992.
[14]AFOURAS T,CHUNG J S,SENIOR A,et al.Deep audio-visual speech recognition[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2018,44(12):8717-8727.
[15]SON C J,SENIOR A,VINYALS O,et al.Lip reading sentences in the wild[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:6447-6456.
[16]CHUNG J,ZISSERMAN A.Lip reading in profile[C]//Ritish Machine Vision Conference.British Machine Vision Association and Society for Pattern Recognition,2017.
[17]ZHAO Y,XU R,SONG M.A cascade sequence-to-sequence model for chinese mandarin lip reading[C]//Proceedings of the 1st ACM International Conference on Multimedia in Asia.2019:1-6.
[18]ZHAO Y,XU R,WANG X,et al.Hearing lips:Improving lip reading by distilling speech recognizers[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2020:6917-6924.
[19]PARK S J,KIM M,HONG J,et al.Synctalkface:Talking face generation with precise lip-syncing via audio-lip memory[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2022:2062-2070.
[20]LIANG B,PAN Y,GUO Z,et al.Expressive talking head generation with granular audio-visual control[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2022:3387-3396.
[21]DUCHI J,HAZAN E,SINGER Y.Adaptive subgradient methods for online learning and stochastic optimization[J].Journal of machine learning research,2011,12(7):2121-2159.
[1] CHANG Xuanwei, DUAN Liguo, CHEN Jiahao, CUI Juanjuan, LI Aiping. Method for Span-level Sentiment Triplet Extraction by Deeply Integrating Syntactic and Semantic
Features
[J]. Computer Science, 2026, 53(2): 322-330.
[2] ZHANG Jing, PAN Jinghao, JIANG Wenchao. Background Structure-aware Few-shot Knowledge Graph Completion [J]. Computer Science, 2026, 53(2): 331-341.
[3] ZHUO Tienong, YING Di, ZHAO Hui. Research on Student Classroom Concentration Integrating Cross-modal Attention and Role
Interaction
[J]. Computer Science, 2026, 53(2): 67-77.
[4] XU Jingtao, YANG Yan, JIANG Yongquan. Time-Frequency Attention Based Model for Time Series Anomaly Detection [J]. Computer Science, 2026, 53(2): 161-169.
[5] HAN Lei, SHANG Haoyu, QIAN Xiaoyan, GU Yan, LIU Qingsong, WANG Chuang. Constrained Multi-loss Video Anomaly Detection with Dual-branch Feature Fusion [J]. Computer Science, 2026, 53(2): 236-244.
[6] JI Sai, QIAO Liwei, SUN Yajie. Semantic-guided Hybrid Cross-feature Fusion Method for Infrared and Visible Light Images [J]. Computer Science, 2026, 53(2): 253-263.
[7] LYU Jinggang, GAO Shuo, LI Yuzhi, ZHOU Jin. Facial Expression Recognition with Channel Attention Guided Global-Local Semantic Cooperation [J]. Computer Science, 2026, 53(1): 195-205.
[8] FAN Jiabin, WANG Baohui, CHEN Jixuan. Method for Symbol Detection in Substation Layout Diagrams Based on Text-Image MultimodalFusion [J]. Computer Science, 2026, 53(1): 206-215.
[9] WANG Haoyan, LI Chongshou, LI Tianrui. Reinforcement Learning Method for Solving Flexible Job Shop Scheduling Problem Based onDouble Layer Attention Network [J]. Computer Science, 2026, 53(1): 231-240.
[10] CHEN Qian, CHENG Kaixuan, GUO Xin, ZHANG Xiaoxia, WANG Suge, LI Yanhong. Bidirectional Prompt-Tuning for Event Argument Extraction with Topic and Entity Embeddings [J]. Computer Science, 2026, 53(1): 278-284.
[11] PENG Jiao, HE Yue, SHANG Xiaoran, HU Saier, ZHANG Bo, CHANG Yongjuan, OU Zhonghong, LU Yanyan, JIANG dan, LIU Yaduo. Text-Dynamic Image Cross-modal Retrieval Algorithm Based on Progressive Prototype Matching [J]. Computer Science, 2025, 52(9): 276-281.
[12] GAO Long, LI Yang, WANG Suge. Sentiment Classification Method Based on Stepwise Cooperative Fusion Representation [J]. Computer Science, 2025, 52(9): 313-319.
[13] LIU Jian, YAO Renyuan, GAO Nan, LIANG Ronghua, CHEN Peng. VSRI:Visual Semantic Relational Interactor for Image Caption [J]. Computer Science, 2025, 52(8): 222-231.
[14] LIU Yajun, JI Qingge. Pedestrian Trajectory Prediction Based on Motion Patterns and Time-Frequency Domain Fusion [J]. Computer Science, 2025, 52(7): 92-102.
[15] LIU Chengzhuang, ZHAI Sulan, LIU Haiqing, WANG Kunpeng. Weakly-aligned RGBT Salient Object Detection Based on Multi-modal Feature Alignment [J]. Computer Science, 2025, 52(7): 142-150.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!