Computer Science ›› 2023, Vol. 50 ›› Issue (11A): 230200203-9.doi: 10.11896/jsjkx.230200203

• Image Processing & Multimedia Technology • Previous Articles     Next Articles

Speech Enhancement Based on Generative Adversarial Networks with Gated Recurrent Units and Self-attention Mechanisms

ZHANG Dehui1, DONG Anming1,2, YU Jiguo1,2, ZHAO Kai3 andZHOU You4   

  1. 1 School of Computer Science and Technology,Qilu University of Technology(Shandong Academy of Sciences),Jinan 250353,China
    2 Big Data Research Institute,Qilu University of Technology(Shandong Academy of Sciences),Jinan 250353,China
    3 Institute of Automation,Chinese Academy of Sciences,Beijing 100190,China
    4 Shandong HiCon New Media Institute Co.LTD,Jinan 250013,China
  • Published:2023-11-09
  • About author:ZHANG Dehui,born in 1997,master,is a student member of China Computer Federation.His main research interests include deep learning and speech enhancement.
    DONG Anming,born in 1982,Ph.D,associate professor,postgraduate supervisor,is a member of China Computer Federation.His main research interests include Time series signal processing,wireless communication and artificial intelligence.
  • Supported by:
    National Key Research and Development Program of China(2019YFB2102600),National Natural Science Foundation of China(62272256),Innovation Capability Enhancement Program for Small and Medium-sized Technological Enterprises of Shandong Province(2022TSGC2180,2022TSGC2123),Piloting Fundamental Research Program for the Integration of Scientific Research,Independent Training Innovation Team of Jinan(202228093),and Piloting Fundamental Research Program for the Integration of Scientific Research,Education and Industry of Qilu University of Technology(Shandong Academy of Sciences)(2022XD001).

Abstract: Generative adversarial networks(GAN) have strong noise reduction ability and have been applied in the field of speech enhancement in recent years due to their ability to use two kinds of network adversarial training and constantly improve the network mapping ability.In view of the shortcomings of existing generative adversarial network speech enhancement methods,which do not make full use of temporal and global dependencies in speech feature sequences,this paper proposes a speech enhancement GAN network that integrates gated recurrent units and self-attention mechanism.The network constructs a time modeling module in series and parallel to capture the time dependence and context information of speech feature sequences.Compared with the baseline algorithm,the proposed new GAN network speech quality auditory estimation score(PESQ) improves by 4%,and performs better on several objective evaluation indexes such as segmental signal-to-noise ratio(SSNR) and short-term objective intelligibility(STOI).The results show that the integration of temporal correlation and global correlation in speech feature sequences is helpful to improve the performance of GAN network speech enhancement.

Key words: Speech enhancement, Generative adversarial network, Gated recurrent unit, Self-attention mechanism, Feature fusion

CLC Number: 

  • TP391
[1]LAN T,PENG C,LI S,et al.Review of monophonic speechnoise reduction and dereverberation research [J].Computer Research and Development,2020,57(5):26.
[2]XIANG Q,TANG Y.Research on Chinese Speech Enhancement Technology Based on Generative Adversarial Networks [J].Computer Application Research,2020(S02):150-151.
[3]LOIZOU P C.Speech enhancement:theory and practice[M].CRC Press,2007.
[4]WANG H,LI J,ZHAO H M,et al.Speech enhancement algorithm based on sparse low-rank model and phase spectrum compensation [J].Computer Engineering and Applications,2018,54(5):6.
[5]BOLL S.Suppression of acoustic noise in speech using spectral subtraction[J].IEEE Transactions on Acoustics,Speech,and Signal Processing,1979,27(2):113-120.
[6]LIM J S,OPPENHEIM A V.Enhancement and bandwidth compression of noisy speech[J].Proceedings of the IEEE,1979,67(12):1586-1604.
[7]MCAULAY R,MALPASS M.Speech enhancement using asoftdecision noise suppression filer[J].IEEE Transactions on Acoustics,Speech,and Signal Processing,1980,28(2):137-145.
[8]LEE D D,SEUNG H S.Learning the parts of objects by non-negative matrix factorization[J].Nature,1999,401(6755):788-791.
[9]TAHA T M F,ADEEL A,HUSSAIN A.A survey on techniques for enhancing speech[J].International Journal of Computer Applications,2018,179(17):1-14.
[10]WANG Y,WANG D L.Towards scaling up classification-based speech separation[J].IEEE Transactions on Audio,Speech,and Language Processing,2013,21(7):1381-1390.
[11]FU S W,TSAO Y,LU X.SNR-Aware Convolutional NeuralNetwork Modeling for Speech Enhancement[C]//Interspeech.2016:3768-3772.
[12]TAN K,CHEN J,WANG D L.Gated residual networks with dilated convolutions for monaural speech enhancement[J].IEEE/ACM Transactions on Audio,Speech,and Language Processing,2018,27(1):189-198.
[13]HUANG P S,KIMM,HASEGAWA-JOHNSON M,et al.Joint optimization of masks and deep recurrent neural networks for monaural source separation[J].IEEE/ACM Transactions on Audio,Speech,and Language Processing,2015,23(12):2136-2147.
[14]XIAO C X,CHEN Y.Real-time speech enhancement algorithm based on recurrent neural network [J].Computer Engineering and Design,2021,42(7):6.
[15]WANG Z,ZHANG T,SHAO Y,et al.LSTM-convolutional-BLSTM encoder-decoder network for minimum mean-square error approach to speech enhancement[J].Applied Acoustics,2021,172:107647.
[16]BAO C C,XIANG Y.Review of single-channel speech enhancement methods based on deep neural network [J].Signal Processing,2019,35(12):11.
[17]XU Y,DU J,DAI L R,et al.A regression approach to speech enhancement based on deep neural networks[J].IEEE/ACM Transactions on Audio,Speech,and Language Processing,2014,23(1):7-19.
[18]GAO G,YIN W B,CHEN Y,et al.A Speech EnhancementMethod Based on Generative Adversarial Networks in Time-Frequency Domain [J].Computer Science,2022,49(6):6.
[19]GOODFELLOW I,POUGET-ABADIE J,MIRZA M,et al.Generative adversarial networks[J].Communications of the ACM,2020,63(11):139-144.
[20]PASCUAL S,BONAFONTE A,SERRAJ.SEGAN:Speech enhancement generative adversarial network[J].arXiv:1703.09452,2017.
[21]PHAN H,MCLOUGHLIN I V,PHAM L,et al.ImprovingGANs for speech enhancement[J].IEEE Signal Processing Letters,2020,27:1700-1704.
[22]PHAN H,LE NGUYEN H,CHÉNO Y,et al.Self-attention generative adversarial network for speech enhancement[C]//ICASSP 2021 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2021:7103-7107.
[23]DONAHUE C,LI B,PRABHAVALKAR R.Exploring speech enhancement with generative adversarial networks for robust speech recognition[C]//2018 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2018:5024-5028.
[24]LI P,JIANG Z,YIN S,et al.Pagan:A phase-adapted generative adversarial networks for speech enhancement[C]//2020 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2020:6234-6238.
[25]HE K,ZHANG X,REN S,et al.Delving deep into rectifiers:Surpassing human-level performance on imagenet classification[C]//Proceedings of the IEEE International Conference on Computer Vision.2015:1026-1034.
[26]TONG T,LI G,LIU X,et al.Image super-resolution usingdense skip connections[C]//Proceedings of the IEEE International Conference on Computer Vision.2017:4799-4807.
[27]CHO K,VAN MERRIËNBOER B,GULCEHRE C,et al.Learning phrase representations using RNN encoder-decoder for statistical machine trans-lation[J].arXiv:1406.1078,2014.
[28]MNIH V,HEESS N,Graves A.Recurrent models of visual attention[J].Advances in Neural Information Processing Systems,2014,27.
[29]VASWANI A,SHAZEER N,PARMAR N,et al.Attention is all you need[J].Advances in Neural Information Processing Systems,2017,30.
[30]LIM J,OPPENHEIM A.All-pole modeling of degraded speech[J].IEEE Transactions on Acoustics,Speech,and Signal Processing,1978,26(3):197-210.
[31]VALENTINI-BOTINHAO C,WANG X,TAKAKI S,et al.Investigating RNN-based speech enhancement methods for noise-robust Text-to-Speech[C]//SSW.2016:146-152.
[32]THIEMANN J,ITO N,VINCENT E.The diverse environ-ments multichannel acoustic noise database(demand):A database of multichannel environmental noise recordings[C]//Proceedings of Meetings on Acoustics ICA2013.Acoustical Society of America,2013.
[33]UNION I T.Wideband extension to recommendation p.862 for the assessment of wideband telephone networks and speech codecs[J].International Telecommunication Union,Recommendation P,2007,862.
[34]HU Y,LOIZOU P C.Evaluation of objective quality measures for speech enhancement[J].IEEE Transactions on Audio,Speech,and Language Processing,2007,16(1):229-238.
[35]TAAL C H,HENDRIKS R C,HEUSDENS R,et al.A short-time objective intelligibility measure for time-frequency weighted noisy speech[C]//2010 IEEE International Conference on Acoustics,Speech and Signal Processing.IEEE,2010:4214-4217.
[1] ZHUANG Yuan, CAO Wenfang, SUN Guokai, SUN Jianguo, SHEN Linshan, YOU Yang, WANG Xiaopeng, ZHANG Yunhai. Network Protocol Vulnerability Mining Method Based on the Combination of Generative AdversarialNetwork and Mutation Strategy [J]. Computer Science, 2023, 50(9): 44-51.
[2] CHEN Guojun, YUE Xueyan, ZHU Yanning, FU Yunpeng. Study on Building Extraction Algorithm of Remote Sensing Image Based on Multi-scale Feature Fusion [J]. Computer Science, 2023, 50(9): 202-209.
[3] TENG Sihang, WANG Lie, LI Ya. Non-autoregressive Transformer Chinese Speech Recognition Incorporating Pronunciation- Character Representation Conversion [J]. Computer Science, 2023, 50(8): 111-117.
[4] YAN Yan, SUI Yi, SI Jianwei. Remote Sensing Image Pan-sharpening Method Based on Generative Adversarial Network [J]. Computer Science, 2023, 50(8): 133-141.
[5] ZHOU Fengfan, LING Hefei, ZHANG Jinyuan, XIA Ziwei, SHI Yuxuan, LI Ping. Facial Physical Adversarial Example Performance Prediction Algorithm Based on Multi-modal Feature Fusion [J]. Computer Science, 2023, 50(8): 280-285.
[6] SHAN Xiaohuan, SONG Rui, LI Haihai, SONG Baoyan. Event Recommendation Method with Multi-factor Feature Fusion in EBSN [J]. Computer Science, 2023, 50(7): 60-65.
[7] YAN Mingqiang, YU Pengfei, LI Haiyan, LI Hongsong. Arbitrary Image Style Transfer with Consistent Semantic Style [J]. Computer Science, 2023, 50(7): 129-136.
[8] WANG Tianran, WANG Qi, WANG Qingshan. Transfer Learning Based Cross-object Sign Language Gesture Recognition Method [J]. Computer Science, 2023, 50(6A): 220300232-5.
[9] LI Fan, JIA Dongli, YAO Yumin, TU Jun. Graph Neural Network Few Shot Image Classification Network Based on Residual and Self-attention Mechanism [J]. Computer Science, 2023, 50(6A): 220500104-5.
[10] WU Liuchen, ZHANG Hui, LIU Jiaxuan, ZHAO Chenyang. Defect Detection of Transmission Line Bolt Based on Region Attention Mechanism andMulti-scale Feature Fusion [J]. Computer Science, 2023, 50(6A): 220200096-7.
[11] LUO Huilan, LONG Jun, LIANG Miaomiao. Attentional Feature Fusion Approach for Siamese Network Based Object Tracking [J]. Computer Science, 2023, 50(6A): 220300237-9.
[12] DOU Zhi, HU Chenguang, LIANG Jingyi, ZHENG Liming, LIU Guoqi. Lightweight Target Detection Algorithm Based on Improved Yolov4-tiny [J]. Computer Science, 2023, 50(6A): 220700006-7.
[13] ZHANG Changfan, MA Yuanyuan, LIU Jianhua, HE Jing. Dual Gating-Residual Feature Fusion for Image-Text Cross-modal Retrieval [J]. Computer Science, 2023, 50(6A): 220700030-7.
[14] WANG Wei, BAI Long, MA Huanchang, LIU Yanheng. Study on Safety Warning Method of Driver’s Blind Area Based on Machine Vision [J]. Computer Science, 2023, 50(6A): 220700141-7.
[15] RUAN Wang, HAO Guosheng, WANG Xia, HU Xiaoting, YANG Zihao. Fusion Multi-feature Fuzzy Model for Target Recognition and Its Application [J]. Computer Science, 2023, 50(6A): 220100138-7.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!