计算机科学 ›› 2024, Vol. 51 ›› Issue (9): 338-345.doi: 10.11896/jsjkx.230700200
姚瑶, 杨吉斌, 张雄伟, 李毅豪, 宋宫琨琨
YAO Yao, YANG Jibin, ZHANG Xiongwei, LI Yihao, SONG Gongkunkun
摘要: 为了消除电台系统中的环境噪声和信道噪声对语音通信质量的不利影响,提升电台语音通信的质量,提出了一种基于联合通道注意力与长短时记忆网络(Long Short Term Memory,LSTM)的深度可分离U形网络CLU-Net(Channel Attention and LSTM-based U-Net)。该网络采用深度可分离卷积实现低复杂度的特征提取,联合利用注意力机制和LSTM同时关注语音通道特征和长时上下文联系,在参数量较少的情况下实现对干净语音特征的关注。在公开与实测数据集上进行多组对比实验,仿真结果表明,所提方法在VoiceBank-DEMAND数据集上的PESQ和STOI等指标得分优于同类语音增强模型。实测实验结果表明,所提CLU-Net增强框架能够有效抑制环境噪声与信道噪声,在低信噪比条件下的增强性能优于其他同类型的增强网络。
中图分类号:
[1]WANG Y P,WEI G H,PAN X D,et al.Prediction model and experiment of out-of-band dual-band interference of communication station[J].Acta Electronica Sinica,2019,47(4):826-831. [2]LI S,CAO F.Research on end-to-end framework model analysis and trend of intelligent speech technology[J].Computer Science,2022,49(S1):331-336. [3]PASCUAL S,BONAFONTE A,SERRA J.SEGAN:Speech Enhancement Generative Adversarial Network[C]//Conference of the International Speech Communication Association.2017:3642-3646. [4]PANDEY A,WANG D.TCNN:Temporal Convolutional Neural Network for Real-time Speech Enhancement in the Time Domain[C]//2019 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP 2019).Brighton,UK,2019:6875-6879. [5]PANDEY A,WANG D L.Densely connected neural networkwith dilated convolutions for real-time speech enhancement in the time domain[C]//2020 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP 2020).IEEE,2020:6629-6633. [6]FAN J Y,YANG J B,ZHANG X W,et al.Single-channel speech enhancement based on multi-head attention mechanism in U-net network[J].Acta Acoustica Sinica,2022,47(6):703-716. [7]LI L,ZHU Y,ZHU Z.Automatic Modulation ClassificationUsing ResNeXt-GRU With Deep Feature Fusion[J].IEEE Tran-sactions on Instrumentation and Measurement,2023,72:1-10. [8]CHOLLET F.Xception:Deep learning with depthwise separable convolutions[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:1251-1258. [9]BENGIO Y,SIMARD P,FRASCONI P,Learning long-term dependencies with gradient descent is difficult[J].IEEE Transactions on Instrumentation and Measurement,1994,5(2):157-166. [10]HOCHREITER S,SCHMIDHUBER J.Long short-term memory[J].Neural Computation,1997,9(8):1735-1780. [11]BANG J Y,SUN M,ZHANG X W,et al.Lightweight Model for Bone-Conducted Speech Enhancement Based on Convolution Network and Residual Long Short-Time Memory Network[J].Journal of Data Acquisition & Processing,2021,36(5):921-931. [12]ZHANG Q,SONG Q,NI Z,et al.Time-frequency attention for monaural speech enhancement[C]//2022 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP 2022).IEEE,2022:7852-7856. [13]WOO S,PARK J,LEE J Y,et al.Cbam:Convolutional block attention module[C]//Proceedings of the European Conference on Computer Vision(ECCV).2018:3-19. [14]TOLOOSHAMS B,GIRI R,SONG A H,et al.Channel-atten-tion dense u-net for multichannel speech enhancement[C]//2020 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP 2020).Barcelona,Spain.IEEE,2020:836-840. [15]ZHU X,LI J,LIU Y,et al.A Survey on Model Compression for Large Language Models[J].arXiv:2308.07633,2023. [16]ANDREW G H,MENGLONG Z,BO C,et al.Mobilenets:Efficient convolutional neural networks for mobile vision applications[J].arXiv:1704.04861,2017. [17]ZHANG X,ZHOU X,LIN M,et al.Shufflenet:An extremelyefficient convolutional neural network for mobile devices[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:6848-6856. [18]ZENG Y,LI Y,ZHOU Z,et al.Domestic activities classification from audio recordings using multi-scale dilated depthwise separable convolutional network[C]//2021 IEEE 23rd International Workshop on Multimedia Signal Processing(MMSP).IEEE,2021:1-5. [19]TAN K,WANG D L.A convolutional recurrent neural network for real-time speech enhancement[C]//Interspeech 2018.2018:3229-3233. [20]LE X,CHEN H,CHEN K,et al.DPCRN:Dual-path convolution recurrent network for single channel speech enhancement[C]//Interspeech 2021,22nd Annual Conference of the International Speech Communication Association.Brno,Czechia,2021:2811-2815. [21]DEFOSSEZ A,SYNNAEVE G,ADI Y.Real time speech en-hancement in the waveform domain[C]//Interspeech 2020,21st Annual Conference of the International Speech Communication Association,Virtual Event.2020:3291-3295. [22]HU J,SHEN L,SUN G.Squeeze-and-excitation networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:7132-7141. [23]FU J,LIU J,TIAN H,et al.Dual attention network for scenesegmentation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019:3146-3154. [24]PARK H J,KANG B H,SHIN W,et al.Manner:Multi-view attention network for noise erasure[C]//2022 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP 2022).Singapore,IEEE,2022:7842-7846. [25]LI Y,WANG W,CHEN H,et al.Few-shot speaker identifica-tion using depthwise separable convolutional network with channel attention[J].arXiv:2204.11180,2022. [26]VALENTINI-BOTINHAO C,WANG X,TAKAKI S,et al.Investigating RNN-based speech enhancement methods for noise-robust text-to-speech[C]//SSW.2016:146-152. [27]WANG D,ZHANG X.Thchs-30:A free chinese speech corpus[J].arXiv:1512.01882,2015. [28]RIX A W,BEERENDS J G,HOLLIER M P,et al.Perceptualevaluation of speech quality(PESQ)-a new method for speecn quality assessment of telephone networks and codecs[C]//Proceedings of the 26th International Conference on Acoustics,Speech,and Signal Processing.Utah:IEEE,2001:749-752. [29]TAAL C H,HENDRIKS R C,HEUSDENS R,et al.An algorithm for intelligibility prediction of time-frequency weighted noisy speech[J].IEEE Transactions on Audio,Speech,and Language Processing,2011,19(7):2125-2136. [30]HU Y,LOIZOU P C.Evaluation of objective quality measuresfor speech enhancement[J].IEEE Transactions on Audio,Speech,and Language Processing,2007,16(1):229-238. [31]MACARTNEY C,WEYDE T.Improved speech enhancementwith the Wave-U-Net[J].arXiv:1811.11307,2018. [32]FU S W,LIAO C F,TSAO Y,et al.Metricgan:Generative adversarial networks based black-box metric scores optimization for speech enhancement[C]//International Conference on Machine Learning.PMLR,2019:2031-2041. [33]YIN D,LUO C,XIONG Z,et al.Phasen:A phase-and-harmo-nics-aware speech enhancement network[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2020,34(5):9458-9465. [34]ZHANG Q Q,AARON M N,WANG M J,et al.Deepmmse:A deep learning approach to mmse-based noise power spectral density estimation[J].IEEE/ACM Transactions on Audio,Speech,and Language Processing,IEEE,2020,28(1):1404-1415. [35]WANG K,HE B,ZHU W P.TSTNN:Two-Stage Transformer Based Neural Network for Speech Enhancement in the Time Domain[C]//2021 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP 2021).Toronto,ON,Canada,2021:7098-7102. [36]KONG Z,PING W,DANTREY A,et al.Speech denoising in the waveform domain with self-attention[C]//2022 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP 2022).IEEE,2022:7867-7871. |
|