基于时频域生成对抗网络的语音增强算法

doi:10.11896/jsjkx.210500114

计算机科学 ›› 2022, Vol. 49 ›› Issue (6): 187-192.doi: 10.11896/jsjkx.210500114

• 计算机图形学&多媒体 • 上一篇下一篇

基于时频域生成对抗网络的语音增强算法

尹文兵¹, 高戈¹, 曾邦¹, 王霄¹, 陈怡²

1 武汉大学国家多媒体软件工程技术研究中心武汉 430072
2 华中师范大学计算机学院武汉 430077

收稿日期:2021-05-17 修回日期:2021-09-04 出版日期:2022-06-15 发布日期:2022-06-08
通讯作者: 高戈(gaoge@whu.edu.cn)
作者简介:(912228963@qq.com)

Speech Enhancement Based on Time-Frequency Domain GAN

YIN Wen-bing¹, GAO Ge¹, ZENG Bang¹, WANG Xiao¹, CHEN Yi²

1 National Engineering Research Center for Multimedia Software,Wuhan University,Wuhan 430072,China
2 School of Computer Science,Central China Normal University,Wuhan 430077,China

Received:2021-05-17 Revised:2021-09-04 Online:2022-06-15 Published:2022-06-08
About author:YIN Wen-bing,born in 1997,postgra-duate.His main research interests include speech enhancement and so on.
GAO Ge,born in 1973,Ph.D,professor,is a member of China Computer Federation.His main research interests include speech processing and computer vision.

摘要/Abstract

摘要： 传统基于生成对抗网络的语音增强算法(Speech Enhancement Algorithm Based on Generative Adversarial Networks,SEGAN)在时域上对语音进行增强处理,完全忽略了语音样本在频域上的分布情况。在低信噪比条件下,语音信号会淹没在噪声中,带噪语音的时域分布信息很难捕获,因此,SEGAN的增强性能会急剧下降,其增强语音的语音质量和语音可懂度很低。针对该问题,提出了基于时频域生成对抗网络的语音增强算法(Time-Frequency Domain SEGAN,TFSEGAN)。TFSEGAN采用了时频域双判别器的模型结构和时频域L1损失函数,时域判别器的输入为语音样本的时域特征,频域判别器的输入为语音样本的频域特征。在训练过程中,时域判别器将语音样本的时域分布信息作为判别标准,而频域判别器将语音样本的频域分布信息作为判别标准。在两个判别器的作用下,TFSEGAN的生成器能够同时学习语音样本在时域和频域中的分布规律和信息。实验证明,在低信噪比条件下,与SEGAN相比,TFSEGAN的语音质量与可懂度分别提升了约17.45%和11.75%。

关键词: 低信噪比, 生成对抗网络, 时频域, 语音可懂度, 语音增强, 语音质量

Abstract: The traditional speech enhancement algorithm based on generative adversarial networks (SEGAN) enhances speech in the time domain,and completely ignores the distribution of speech samples in frequency domain.Under the condition of low signal-to-noise ratio,the speech signal will be submerged in noise,and the time-domain distribution information of noisy speech is difficult to capture.Therefore,the enhancement performance of SEGAN will drop sharply,and the speech quality and speech intelligibility of its enhanced speech are very low.To solve this problem,this paper proposes a speech enhancement algorithm (time-frequency domain SEGAN,TFSEGAN) based on time-frequency domain generation confrontation network.TFSEGAN adopts the model structure of the time-frequency domain dual discriminator,and a time-frequency L1 loss function.The input of time domain discriminator is time domain feature of the speech sample,and the input of frequency domain discriminator is frequency domain feature of the speech sample.In the training process,time-domain discriminator uses the time-domain distribution information of speech sample as the criterion,and frequency-domain discriminator uses the frequency-domain distribution information of the speech sample as the criterion.Under the action of two discriminators,the generator of TFSEGAN could simulta-neously learn the distribution rules and information of speech samples in time domain and frequency domain.Experiments prove that,compared with SEGAN,the speech quality and intelligibility of TFSEGAN improve by about 17.45% and 11.75% respectively at low signal-to-noise ratio.

Key words: Generative adversarial network, Low signal-to-noise ratio, Speech enhancement, Speech intelligibility, Speech qua-lity, Time-frequency domain

中图分类号:

TN912.35

尹文兵, 高戈, 曾邦, 王霄, 陈怡. 基于时频域生成对抗网络的语音增强算法[J]. 计算机科学, 2022, 49(6): 187-192. https://doi.org/10.11896/jsjkx.210500114

YIN Wen-bing, GAO Ge, ZENG Bang, WANG Xiao, CHEN Yi. Speech Enhancement Based on Time-Frequency Domain GAN[J]. Computer Science, 2022, 49(6): 187-192. https://doi.org/10.11896/jsjkx.210500114

参考文献

[1] BOLL S F.Suppression of acoustic noise in speech using spectral subtraction[J].IEEE Transactions on Acoustics Speech & Signal Processing,1979,27(2):113-120.
[2] LIM J S,OPPENHEIM A V.Enhancement and bandwidth compression of noisy speech[J].Proceedings of the IEEE,2005,67(12):1586-1604.
[3] MCAULAY R J,MALPASS M L.Speech enhancement using a soft-decision noise suppression filter[J].IEEE Trans. Acoust. Speech Signal Process,1980,28(2):137-145.
[4] DENDRINOS M,BAKAMIDIS S,CARAYANNIS G.Speechenhancement from noise:A regenerative approach[J].Speech Communication,1991,10(1):45-57.
[5] WANG D L.On Ideal Binary Mask As the Computational Goal of Auditory Scene Analysis[M].Springer,US,2005.
[6] SRINIVASAN S,ROMAN N,WANG D L.Binary and ratiotime-frequency masks for robust speech recognition[J].Speech Communication,2006,48(11):1486-1501.
[7] OORD A,DIELEMAN S,ZEN H,et al.Wavenet:A generative model for raw audio[J].arXiv:1609.03499,2016.
[8] QIAN K,ZHANG Y,CHANG S,et al.Speech EnhancementUsing Bayesian Wavenet[C]//Interspeech.2017:2013-2017.
[9] RETHAGE D,PONS J,SERRA X.A wavenet for speech de-noising[C]//2018 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP).IEEE,2018:5069-5073.
[10] PASCUAL S,BONAFONTE A,SERRA J.SEGAN:Speech enhancement generative adversarial network[J].arXiv:1703.09452,2017.
[11] PHAN H,MCLOUGHLIN I V,PHAM L,et al.ImprovingGANs for speech enhancement[J].IEEE Signal Processing Letters,2020,27:1700-1704.
[12] ZHANG Z,DENG C,SHEN Y,et al.On loss functions and recurrency training for GAN-based speech enhancement systems[J].arXiv:2007.14974,2020.
[13] GOODFELLOW I J,POUGET-ABADIE J,MIRZA M,et al.Generative Adversarial Networks[J].Advances in Neural Information Processing Systems,2014,3:2672-2680.
[14] MIRZA M,OSINDERO S.Conditional Generative AdversarialNets[J].Computer Science,2014:2672-2680.
[15] ODENA A.Semi-supervised learning with generative adversarial networks[J].arXiv:1606.01583,2016.
[16] DONAHUE J,KRÄHENBÜHL P,DARRELL T.Adversarial feature learning[J].arXiv:1605.09782,2016.
[17] MAO X,LI Q,XIE H,et al.Least squares generative adversarial networks[C]//Proceedings of the IEEE International Confe-rence on Computer Vision.2017:2794-2802.
[18] YUAN W H,SHI Y L,HU S D,et al.A Speech Enhancement Approach Based on Fusion of Time-Domain and Frequency-Domain Features[J].Computer Engineering,2021,47(10):75-81.
[19] LIU H,LI Y,YUAN H Q,et al.Speech Signal Separation Based on Generative Adversarial Networks[J].Computer Enginee-ring,2020,46(1):302-308.
[20] LIU S H,SUN X,LI C B.Emotion Recognition Using EEG Signals Based on Location Information Reconstruction and Time-Frequency Information Fusion[J].Computer Engineering,2021,47(12):95-102.

相关文章 15

[1]	张佳, 董守斌. 基于评论方面级用户偏好迁移的跨领域推荐算法 Cross-domain Recommendation Based on Review Aspect-level User Preference Transfer 计算机科学, 2022, 49(9): 41-47. https://doi.org/10.11896/jsjkx.220200131
[2]	孙奇, 吉根林, 张杰. 基于非局部注意力生成对抗网络的视频异常事件检测方法 Non-local Attention Based Generative Adversarial Network for Video Abnormal Event Detection 计算机科学, 2022, 49(8): 172-177. https://doi.org/10.11896/jsjkx.210600061
[3]	戴朝霞, 李锦欣, 张向东, 徐旭, 梅林, 张亮. 基于DNGAN的磁共振图像超分辨率重建算法 Super-resolution Reconstruction of MRI Based on DNGAN 计算机科学, 2022, 49(7): 113-119. https://doi.org/10.11896/jsjkx.210600105
[4]	徐辉, 康金梦, 张加万. 基于特征感知的数字壁画复原方法 Digital Mural Inpainting Method Based on Feature Perception 计算机科学, 2022, 49(6): 217-223. https://doi.org/10.11896/jsjkx.210500105
[5]	高志宇, 王天荆, 汪悦, 沈航, 白光伟. 基于生成对抗网络的5G网络流量预测方法 Traffic Prediction Method for 5G Network Based on Generative Adversarial Network 计算机科学, 2022, 49(4): 321-328. https://doi.org/10.11896/jsjkx.210300240
[6]	黎思泉, 万永菁, 蒋翠玲. 基于生成对抗网络去影像的多基频估计算法 Multiple Fundamental Frequency Estimation Algorithm Based on Generative Adversarial Networks for Image Removal 计算机科学, 2022, 49(3): 179-184. https://doi.org/10.11896/jsjkx.201200081
[7]	石达, 芦天亮, 杜彦辉, 张建岭, 暴雨轩. 基于改进CycleGAN的人脸性别伪造图像生成模型 Generation Model of Gender-forged Face Image Based on Improved CycleGAN 计算机科学, 2022, 49(2): 31-39. https://doi.org/10.11896/jsjkx.210600012
[8]	唐雨潇, 王斌君. 基于深度生成模型的人脸编辑研究进展 Research Progress of Face Editing Based on Deep Generative Model 计算机科学, 2022, 49(2): 51-61. https://doi.org/10.11896/jsjkx.210400108
[9]	李建, 郭延明, 于天元, 武与伦, 王翔汉, 老松杨. 基于生成对抗网络的多目标类别对抗样本生成算法 Multi-target Category Adversarial Example Generating Algorithm Based on GAN 计算机科学, 2022, 49(2): 83-91. https://doi.org/10.11896/jsjkx.210800130
[10]	谈馨悦, 何小海, 王正勇, 罗晓东, 卿粼波. 基于Transformer交叉注意力的文本生成图像技术 Text-to-Image Generation Technology Based on Transformer Cross Attention 计算机科学, 2022, 49(2): 107-115. https://doi.org/10.11896/jsjkx.210600085
[11]	陈贵强, 何军. 自然场景下遥感图像超分辨率重建算法研究 Study on Super-resolution Reconstruction Algorithm of Remote Sensing Images in Natural Scene 计算机科学, 2022, 49(2): 116-122. https://doi.org/10.11896/jsjkx.210700095
[12]	蒋宗礼, 樊珂, 张津丽. 基于生成对抗网络和元路径的异质网络表示学习 Generative Adversarial Network and Meta-path Based Heterogeneous Network Representation Learning 计算机科学, 2022, 49(1): 133-139. https://doi.org/10.11896/jsjkx.201000179
[13]	张玮琪, 汤轶丰, 李林燕, 胡伏原. 基于场景图的段落生成序列图像方法 Image Stream From Paragraph Method Based on Scene Graph 计算机科学, 2022, 49(1): 233-240. https://doi.org/10.11896/jsjkx.201100207
[14]	林椹尠, 张梦凯, 吴成茂, 郑兴宁. 利用生成对抗网络的人脸图像分步补全法 Face Image Inpainting with Generative Adversarial Network 计算机科学, 2021, 48(9): 174-180. https://doi.org/10.11896/jsjkx.200800014
[15]	刘立波, 苟婷婷. 融合深度典型相关分析和对抗学习的跨模态检索 Cross-modal Retrieval Combining Deep Canonical Correlation Analysis and Adversarial Learning 计算机科学, 2021, 48(9): 200-207. https://doi.org/10.11896/jsjkx.200600119

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

基于时频域生成对抗网络的语音增强算法

Speech Enhancement Based on Time-Frequency Domain GAN

PDF (PC)

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

Metrics

本文评价

推荐阅读 0