计算机科学 ›› 2023, Vol. 50 ›› Issue (11A): 230200203-9.doi: 10.11896/jsjkx.230200203

• 图像处理&多媒体技术 • 上一篇    下一篇

融合门控循环单元及自注意力机制的生成对抗语音增强

张德辉1, 董安明1,2, 禹继国1,2, 赵恺3, 周酉4   

  1. 1 齐鲁工业大学(山东省科学院)计算机科学与技术学院 济南 250353
    2 齐鲁工业大学(山东省科学院)大数据研究院 济南 250353
    3 中国科学院自动化研究所 北京 100190
    4 山东海看新媒体研究院有限公司 济南 250013
  • 发布日期:2023-11-09
  • 通讯作者: 董安明(anmingdong@qlu.edu.cn)
  • 作者简介:(anmigndong@qlu.edu.cn)
  • 基金资助:
    国家重点研发计划(2019YFB2102600);国家自然科学基金(62272256);山东省科技型中小企业创新能力提升工程项目(2022TSGC2180,2022TSGC2123);济南市“高校20条”自主培养创新团队(202228093);齐鲁工业大学(山东省科学院)科教产融合试点工程项目(基础研究类)先导项目(2022XD001)

Speech Enhancement Based on Generative Adversarial Networks with Gated Recurrent Units and Self-attention Mechanisms

ZHANG Dehui1, DONG Anming1,2, YU Jiguo1,2, ZHAO Kai3 andZHOU You4   

  1. 1 School of Computer Science and Technology,Qilu University of Technology(Shandong Academy of Sciences),Jinan 250353,China
    2 Big Data Research Institute,Qilu University of Technology(Shandong Academy of Sciences),Jinan 250353,China
    3 Institute of Automation,Chinese Academy of Sciences,Beijing 100190,China
    4 Shandong HiCon New Media Institute Co.LTD,Jinan 250013,China
  • Published:2023-11-09
  • About author:ZHANG Dehui,born in 1997,master,is a student member of China Computer Federation.His main research interests include deep learning and speech enhancement.
    DONG Anming,born in 1982,Ph.D,associate professor,postgraduate supervisor,is a member of China Computer Federation.His main research interests include Time series signal processing,wireless communication and artificial intelligence.
  • Supported by:
    National Key Research and Development Program of China(2019YFB2102600),National Natural Science Foundation of China(62272256),Innovation Capability Enhancement Program for Small and Medium-sized Technological Enterprises of Shandong Province(2022TSGC2180,2022TSGC2123),Piloting Fundamental Research Program for the Integration of Scientific Research,Independent Training Innovation Team of Jinan(202228093),and Piloting Fundamental Research Program for the Integration of Scientific Research,Education and Industry of Qilu University of Technology(Shandong Academy of Sciences)(2022XD001).

摘要: 因其通过两种网络对抗训练并不断提升网络映射能力的特性,生成对抗网络(Generative Adversarial Networks,GAN)具有强大的降噪能力,近年来被应用于语音增强领域。针对现有生成对抗网络语音增强方法未充分利用语音特征序列中的时间相关性和全局相关性这一不足,提出一种融合门控循环单元(Gated Recurrent Unit,GRU) 和自注意力机制(self-attention)的语音增强GAN网络。该网络利用串联和并联两种方式构建了时间建模模块,可捕获语音特征序列的时间相关性和上下文信息。与基线算法相比,所设计的新型GAN网络语音质量听觉估计分数(PESQ)提高了 4%,且在语音信号分段信噪比(SSNR)和短时客观可懂度(STOI)等多个客观评价指标上表现更优。该研究结果表明,融合语音特征序列中的时间相关性和全局相关性有助于提升GAN 网络语音增强的性能。

关键词: 语音增强, 生成对抗网络, 门控循环单元, 自注意力机制, 特征融合

Abstract: Generative adversarial networks(GAN) have strong noise reduction ability and have been applied in the field of speech enhancement in recent years due to their ability to use two kinds of network adversarial training and constantly improve the network mapping ability.In view of the shortcomings of existing generative adversarial network speech enhancement methods,which do not make full use of temporal and global dependencies in speech feature sequences,this paper proposes a speech enhancement GAN network that integrates gated recurrent units and self-attention mechanism.The network constructs a time modeling module in series and parallel to capture the time dependence and context information of speech feature sequences.Compared with the baseline algorithm,the proposed new GAN network speech quality auditory estimation score(PESQ) improves by 4%,and performs better on several objective evaluation indexes such as segmental signal-to-noise ratio(SSNR) and short-term objective intelligibility(STOI).The results show that the integration of temporal correlation and global correlation in speech feature sequences is helpful to improve the performance of GAN network speech enhancement.

Key words: Speech enhancement, Generative adversarial network, Gated recurrent unit, Self-attention mechanism, Feature fusion

中图分类号: 

  • TP391
[1]LAN T,PENG C,LI S,et al.Review of monophonic speechnoise reduction and dereverberation research [J].Computer Research and Development,2020,57(5):26.
[2]XIANG Q,TANG Y.Research on Chinese Speech Enhancement Technology Based on Generative Adversarial Networks [J].Computer Application Research,2020(S02):150-151.
[3]LOIZOU P C.Speech enhancement:theory and practice[M].CRC Press,2007.
[4]WANG H,LI J,ZHAO H M,et al.Speech enhancement algorithm based on sparse low-rank model and phase spectrum compensation [J].Computer Engineering and Applications,2018,54(5):6.
[5]BOLL S.Suppression of acoustic noise in speech using spectral subtraction[J].IEEE Transactions on Acoustics,Speech,and Signal Processing,1979,27(2):113-120.
[6]LIM J S,OPPENHEIM A V.Enhancement and bandwidth compression of noisy speech[J].Proceedings of the IEEE,1979,67(12):1586-1604.
[7]MCAULAY R,MALPASS M.Speech enhancement using asoftdecision noise suppression filer[J].IEEE Transactions on Acoustics,Speech,and Signal Processing,1980,28(2):137-145.
[8]LEE D D,SEUNG H S.Learning the parts of objects by non-negative matrix factorization[J].Nature,1999,401(6755):788-791.
[9]TAHA T M F,ADEEL A,HUSSAIN A.A survey on techniques for enhancing speech[J].International Journal of Computer Applications,2018,179(17):1-14.
[10]WANG Y,WANG D L.Towards scaling up classification-based speech separation[J].IEEE Transactions on Audio,Speech,and Language Processing,2013,21(7):1381-1390.
[11]FU S W,TSAO Y,LU X.SNR-Aware Convolutional NeuralNetwork Modeling for Speech Enhancement[C]//Interspeech.2016:3768-3772.
[12]TAN K,CHEN J,WANG D L.Gated residual networks with dilated convolutions for monaural speech enhancement[J].IEEE/ACM Transactions on Audio,Speech,and Language Processing,2018,27(1):189-198.
[13]HUANG P S,KIMM,HASEGAWA-JOHNSON M,et al.Joint optimization of masks and deep recurrent neural networks for monaural source separation[J].IEEE/ACM Transactions on Audio,Speech,and Language Processing,2015,23(12):2136-2147.
[14]XIAO C X,CHEN Y.Real-time speech enhancement algorithm based on recurrent neural network [J].Computer Engineering and Design,2021,42(7):6.
[15]WANG Z,ZHANG T,SHAO Y,et al.LSTM-convolutional-BLSTM encoder-decoder network for minimum mean-square error approach to speech enhancement[J].Applied Acoustics,2021,172:107647.
[16]BAO C C,XIANG Y.Review of single-channel speech enhancement methods based on deep neural network [J].Signal Processing,2019,35(12):11.
[17]XU Y,DU J,DAI L R,et al.A regression approach to speech enhancement based on deep neural networks[J].IEEE/ACM Transactions on Audio,Speech,and Language Processing,2014,23(1):7-19.
[18]GAO G,YIN W B,CHEN Y,et al.A Speech EnhancementMethod Based on Generative Adversarial Networks in Time-Frequency Domain [J].Computer Science,2022,49(6):6.
[19]GOODFELLOW I,POUGET-ABADIE J,MIRZA M,et al.Generative adversarial networks[J].Communications of the ACM,2020,63(11):139-144.
[20]PASCUAL S,BONAFONTE A,SERRAJ.SEGAN:Speech enhancement generative adversarial network[J].arXiv:1703.09452,2017.
[21]PHAN H,MCLOUGHLIN I V,PHAM L,et al.ImprovingGANs for speech enhancement[J].IEEE Signal Processing Letters,2020,27:1700-1704.
[22]PHAN H,LE NGUYEN H,CHÉNO Y,et al.Self-attention generative adversarial network for speech enhancement[C]//ICASSP 2021 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2021:7103-7107.
[23]DONAHUE C,LI B,PRABHAVALKAR R.Exploring speech enhancement with generative adversarial networks for robust speech recognition[C]//2018 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2018:5024-5028.
[24]LI P,JIANG Z,YIN S,et al.Pagan:A phase-adapted generative adversarial networks for speech enhancement[C]//2020 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2020:6234-6238.
[25]HE K,ZHANG X,REN S,et al.Delving deep into rectifiers:Surpassing human-level performance on imagenet classification[C]//Proceedings of the IEEE International Conference on Computer Vision.2015:1026-1034.
[26]TONG T,LI G,LIU X,et al.Image super-resolution usingdense skip connections[C]//Proceedings of the IEEE International Conference on Computer Vision.2017:4799-4807.
[27]CHO K,VAN MERRIËNBOER B,GULCEHRE C,et al.Learning phrase representations using RNN encoder-decoder for statistical machine trans-lation[J].arXiv:1406.1078,2014.
[28]MNIH V,HEESS N,Graves A.Recurrent models of visual attention[J].Advances in Neural Information Processing Systems,2014,27.
[29]VASWANI A,SHAZEER N,PARMAR N,et al.Attention is all you need[J].Advances in Neural Information Processing Systems,2017,30.
[30]LIM J,OPPENHEIM A.All-pole modeling of degraded speech[J].IEEE Transactions on Acoustics,Speech,and Signal Processing,1978,26(3):197-210.
[31]VALENTINI-BOTINHAO C,WANG X,TAKAKI S,et al.Investigating RNN-based speech enhancement methods for noise-robust Text-to-Speech[C]//SSW.2016:146-152.
[32]THIEMANN J,ITO N,VINCENT E.The diverse environ-ments multichannel acoustic noise database(demand):A database of multichannel environmental noise recordings[C]//Proceedings of Meetings on Acoustics ICA2013.Acoustical Society of America,2013.
[33]UNION I T.Wideband extension to recommendation p.862 for the assessment of wideband telephone networks and speech codecs[J].International Telecommunication Union,Recommendation P,2007,862.
[34]HU Y,LOIZOU P C.Evaluation of objective quality measures for speech enhancement[J].IEEE Transactions on Audio,Speech,and Language Processing,2007,16(1):229-238.
[35]TAAL C H,HENDRIKS R C,HEUSDENS R,et al.A short-time objective intelligibility measure for time-frequency weighted noisy speech[C]//2010 IEEE International Conference on Acoustics,Speech and Signal Processing.IEEE,2010:4214-4217.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!