计算机科学 ›› 2023, Vol. 50 ›› Issue (11A): 230200203-9.doi: 10.11896/jsjkx.230200203
张德辉1, 董安明1,2, 禹继国1,2, 赵恺3, 周酉4
ZHANG Dehui1, DONG Anming1,2, YU Jiguo1,2, ZHAO Kai3 andZHOU You4
摘要: 因其通过两种网络对抗训练并不断提升网络映射能力的特性,生成对抗网络(Generative Adversarial Networks,GAN)具有强大的降噪能力,近年来被应用于语音增强领域。针对现有生成对抗网络语音增强方法未充分利用语音特征序列中的时间相关性和全局相关性这一不足,提出一种融合门控循环单元(Gated Recurrent Unit,GRU) 和自注意力机制(self-attention)的语音增强GAN网络。该网络利用串联和并联两种方式构建了时间建模模块,可捕获语音特征序列的时间相关性和上下文信息。与基线算法相比,所设计的新型GAN网络语音质量听觉估计分数(PESQ)提高了 4%,且在语音信号分段信噪比(SSNR)和短时客观可懂度(STOI)等多个客观评价指标上表现更优。该研究结果表明,融合语音特征序列中的时间相关性和全局相关性有助于提升GAN 网络语音增强的性能。
中图分类号:
[1]LAN T,PENG C,LI S,et al.Review of monophonic speechnoise reduction and dereverberation research [J].Computer Research and Development,2020,57(5):26. [2]XIANG Q,TANG Y.Research on Chinese Speech Enhancement Technology Based on Generative Adversarial Networks [J].Computer Application Research,2020(S02):150-151. [3]LOIZOU P C.Speech enhancement:theory and practice[M].CRC Press,2007. [4]WANG H,LI J,ZHAO H M,et al.Speech enhancement algorithm based on sparse low-rank model and phase spectrum compensation [J].Computer Engineering and Applications,2018,54(5):6. [5]BOLL S.Suppression of acoustic noise in speech using spectral subtraction[J].IEEE Transactions on Acoustics,Speech,and Signal Processing,1979,27(2):113-120. [6]LIM J S,OPPENHEIM A V.Enhancement and bandwidth compression of noisy speech[J].Proceedings of the IEEE,1979,67(12):1586-1604. [7]MCAULAY R,MALPASS M.Speech enhancement using asoftdecision noise suppression filer[J].IEEE Transactions on Acoustics,Speech,and Signal Processing,1980,28(2):137-145. [8]LEE D D,SEUNG H S.Learning the parts of objects by non-negative matrix factorization[J].Nature,1999,401(6755):788-791. [9]TAHA T M F,ADEEL A,HUSSAIN A.A survey on techniques for enhancing speech[J].International Journal of Computer Applications,2018,179(17):1-14. [10]WANG Y,WANG D L.Towards scaling up classification-based speech separation[J].IEEE Transactions on Audio,Speech,and Language Processing,2013,21(7):1381-1390. [11]FU S W,TSAO Y,LU X.SNR-Aware Convolutional NeuralNetwork Modeling for Speech Enhancement[C]//Interspeech.2016:3768-3772. [12]TAN K,CHEN J,WANG D L.Gated residual networks with dilated convolutions for monaural speech enhancement[J].IEEE/ACM Transactions on Audio,Speech,and Language Processing,2018,27(1):189-198. [13]HUANG P S,KIMM,HASEGAWA-JOHNSON M,et al.Joint optimization of masks and deep recurrent neural networks for monaural source separation[J].IEEE/ACM Transactions on Audio,Speech,and Language Processing,2015,23(12):2136-2147. [14]XIAO C X,CHEN Y.Real-time speech enhancement algorithm based on recurrent neural network [J].Computer Engineering and Design,2021,42(7):6. [15]WANG Z,ZHANG T,SHAO Y,et al.LSTM-convolutional-BLSTM encoder-decoder network for minimum mean-square error approach to speech enhancement[J].Applied Acoustics,2021,172:107647. [16]BAO C C,XIANG Y.Review of single-channel speech enhancement methods based on deep neural network [J].Signal Processing,2019,35(12):11. [17]XU Y,DU J,DAI L R,et al.A regression approach to speech enhancement based on deep neural networks[J].IEEE/ACM Transactions on Audio,Speech,and Language Processing,2014,23(1):7-19. [18]GAO G,YIN W B,CHEN Y,et al.A Speech EnhancementMethod Based on Generative Adversarial Networks in Time-Frequency Domain [J].Computer Science,2022,49(6):6. [19]GOODFELLOW I,POUGET-ABADIE J,MIRZA M,et al.Generative adversarial networks[J].Communications of the ACM,2020,63(11):139-144. [20]PASCUAL S,BONAFONTE A,SERRAJ.SEGAN:Speech enhancement generative adversarial network[J].arXiv:1703.09452,2017. [21]PHAN H,MCLOUGHLIN I V,PHAM L,et al.ImprovingGANs for speech enhancement[J].IEEE Signal Processing Letters,2020,27:1700-1704. [22]PHAN H,LE NGUYEN H,CHÉNO Y,et al.Self-attention generative adversarial network for speech enhancement[C]//ICASSP 2021 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2021:7103-7107. [23]DONAHUE C,LI B,PRABHAVALKAR R.Exploring speech enhancement with generative adversarial networks for robust speech recognition[C]//2018 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2018:5024-5028. [24]LI P,JIANG Z,YIN S,et al.Pagan:A phase-adapted generative adversarial networks for speech enhancement[C]//2020 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2020:6234-6238. [25]HE K,ZHANG X,REN S,et al.Delving deep into rectifiers:Surpassing human-level performance on imagenet classification[C]//Proceedings of the IEEE International Conference on Computer Vision.2015:1026-1034. [26]TONG T,LI G,LIU X,et al.Image super-resolution usingdense skip connections[C]//Proceedings of the IEEE International Conference on Computer Vision.2017:4799-4807. [27]CHO K,VAN MERRIËNBOER B,GULCEHRE C,et al.Learning phrase representations using RNN encoder-decoder for statistical machine trans-lation[J].arXiv:1406.1078,2014. [28]MNIH V,HEESS N,Graves A.Recurrent models of visual attention[J].Advances in Neural Information Processing Systems,2014,27. [29]VASWANI A,SHAZEER N,PARMAR N,et al.Attention is all you need[J].Advances in Neural Information Processing Systems,2017,30. [30]LIM J,OPPENHEIM A.All-pole modeling of degraded speech[J].IEEE Transactions on Acoustics,Speech,and Signal Processing,1978,26(3):197-210. [31]VALENTINI-BOTINHAO C,WANG X,TAKAKI S,et al.Investigating RNN-based speech enhancement methods for noise-robust Text-to-Speech[C]//SSW.2016:146-152. [32]THIEMANN J,ITO N,VINCENT E.The diverse environ-ments multichannel acoustic noise database(demand):A database of multichannel environmental noise recordings[C]//Proceedings of Meetings on Acoustics ICA2013.Acoustical Society of America,2013. [33]UNION I T.Wideband extension to recommendation p.862 for the assessment of wideband telephone networks and speech codecs[J].International Telecommunication Union,Recommendation P,2007,862. [34]HU Y,LOIZOU P C.Evaluation of objective quality measures for speech enhancement[J].IEEE Transactions on Audio,Speech,and Language Processing,2007,16(1):229-238. [35]TAAL C H,HENDRIKS R C,HEUSDENS R,et al.A short-time objective intelligibility measure for time-frequency weighted noisy speech[C]//2010 IEEE International Conference on Acoustics,Speech and Signal Processing.IEEE,2010:4214-4217. |
|