计算机科学 ›› 2024, Vol. 51 ›› Issue (9): 338-345.doi: 10.11896/jsjkx.230700200

• 计算机网络 • 上一篇    下一篇

面向电台通信的CLU-Net语音增强网络

姚瑶, 杨吉斌, 张雄伟, 李毅豪, 宋宫琨琨   

  1. 中国人民解放军陆军工程大学 南京 210007
  • 收稿日期:2023-07-26 修回日期:2023-11-04 出版日期:2024-09-15 发布日期:2024-09-10
  • 通讯作者: 杨吉斌(yjbice@sina.com)
  • 作者简介:(speech_11@163.com)
  • 基金资助:
    国家自然科学基金(62071484);陆军工程大学基础前沿项目(KYZYJKQTZQ23001)

CLU-Net Speech Enhancement Network for Radio Communication

YAO Yao, YANG Jibin, ZHANG Xiongwei, LI Yihao, SONG Gongkunkun   

  1. School of Command and Control Engineering,Army Engineering University of PLA,Nanjing 210007,China
  • Received:2023-07-26 Revised:2023-11-04 Online:2024-09-15 Published:2024-09-10
  • About author:YAO Yao,born in 1998,postgraduate.Her main research interests is intelligent speech processing.
    YANG Jibing,born in 1978,Ph.D,associate professor.His main research intere-sts include speech and acoustic signal processing.
  • Supported by:
    National Natural Science Foundation(62071484) and Basic Frontier Project of Army Engineering University of PLA(KYZYJKQTZQ23001).

摘要: 为了消除电台系统中的环境噪声和信道噪声对语音通信质量的不利影响,提升电台语音通信的质量,提出了一种基于联合通道注意力与长短时记忆网络(Long Short Term Memory,LSTM)的深度可分离U形网络CLU-Net(Channel Attention and LSTM-based U-Net)。该网络采用深度可分离卷积实现低复杂度的特征提取,联合利用注意力机制和LSTM同时关注语音通道特征和长时上下文联系,在参数量较少的情况下实现对干净语音特征的关注。在公开与实测数据集上进行多组对比实验,仿真结果表明,所提方法在VoiceBank-DEMAND数据集上的PESQ和STOI等指标得分优于同类语音增强模型。实测实验结果表明,所提CLU-Net增强框架能够有效抑制环境噪声与信道噪声,在低信噪比条件下的增强性能优于其他同类型的增强网络。

关键词: 电台通信, 语音增强, 深度可分离卷积, 注意力机制

Abstract: In order to overcome the adverse effects of environmental and channel noise on speech communication quality in radio systems and improve the speech quality of radio communication,this paper proposes a deep separable network called CLU-Net(channel attention and LSTM-based U-Net),which adopts the deep U-shape architecture and long short-term memory(LSTM).In the network,deep separable convolution is used to implement low-complexity feature coding.The combination of attention mechanisms and LSTM can pay attention to the relationship between different convolution channels and the context of clean speech simultaneously and obtain the clean speech characteristic with fewer parameters.Varieties of noisy speech datasets are tested,including public and self-built sets using noise collected in different environments and radio systems.The results of the simulation experiment on the VoiceBank-DEMAND dataset indicate that the proposed method outperforms similar speech enhancement models in terms of objective metrics such as PESQ and STOI.Field experimental results show that the enhancement scheme can effectively suppress different environmental and radio noise types.The performance under low signal-to-noise ratios is superior to that of the same kind of enhancement networks.

Key words: Radio communication, Voice enhancement, Deep separable convolution, Attention mechanism

中图分类号: 

  • TP391
[1]WANG Y P,WEI G H,PAN X D,et al.Prediction model and experiment of out-of-band dual-band interference of communication station[J].Acta Electronica Sinica,2019,47(4):826-831.
[2]LI S,CAO F.Research on end-to-end framework model analysis and trend of intelligent speech technology[J].Computer Science,2022,49(S1):331-336.
[3]PASCUAL S,BONAFONTE A,SERRA J.SEGAN:Speech Enhancement Generative Adversarial Network[C]//Conference of the International Speech Communication Association.2017:3642-3646.
[4]PANDEY A,WANG D.TCNN:Temporal Convolutional Neural Network for Real-time Speech Enhancement in the Time Domain[C]//2019 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP 2019).Brighton,UK,2019:6875-6879.
[5]PANDEY A,WANG D L.Densely connected neural networkwith dilated convolutions for real-time speech enhancement in the time domain[C]//2020 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP 2020).IEEE,2020:6629-6633.
[6]FAN J Y,YANG J B,ZHANG X W,et al.Single-channel speech enhancement based on multi-head attention mechanism in U-net network[J].Acta Acoustica Sinica,2022,47(6):703-716.
[7]LI L,ZHU Y,ZHU Z.Automatic Modulation ClassificationUsing ResNeXt-GRU With Deep Feature Fusion[J].IEEE Tran-sactions on Instrumentation and Measurement,2023,72:1-10.
[8]CHOLLET F.Xception:Deep learning with depthwise separable convolutions[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:1251-1258.
[9]BENGIO Y,SIMARD P,FRASCONI P,Learning long-term dependencies with gradient descent is difficult[J].IEEE Transactions on Instrumentation and Measurement,1994,5(2):157-166.
[10]HOCHREITER S,SCHMIDHUBER J.Long short-term memory[J].Neural Computation,1997,9(8):1735-1780.
[11]BANG J Y,SUN M,ZHANG X W,et al.Lightweight Model for Bone-Conducted Speech Enhancement Based on Convolution Network and Residual Long Short-Time Memory Network[J].Journal of Data Acquisition & Processing,2021,36(5):921-931.
[12]ZHANG Q,SONG Q,NI Z,et al.Time-frequency attention for monaural speech enhancement[C]//2022 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP 2022).IEEE,2022:7852-7856.
[13]WOO S,PARK J,LEE J Y,et al.Cbam:Convolutional block attention module[C]//Proceedings of the European Conference on Computer Vision(ECCV).2018:3-19.
[14]TOLOOSHAMS B,GIRI R,SONG A H,et al.Channel-atten-tion dense u-net for multichannel speech enhancement[C]//2020 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP 2020).Barcelona,Spain.IEEE,2020:836-840.
[15]ZHU X,LI J,LIU Y,et al.A Survey on Model Compression for Large Language Models[J].arXiv:2308.07633,2023.
[16]ANDREW G H,MENGLONG Z,BO C,et al.Mobilenets:Efficient convolutional neural networks for mobile vision applications[J].arXiv:1704.04861,2017.
[17]ZHANG X,ZHOU X,LIN M,et al.Shufflenet:An extremelyefficient convolutional neural network for mobile devices[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:6848-6856.
[18]ZENG Y,LI Y,ZHOU Z,et al.Domestic activities classification from audio recordings using multi-scale dilated depthwise separable convolutional network[C]//2021 IEEE 23rd International Workshop on Multimedia Signal Processing(MMSP).IEEE,2021:1-5.
[19]TAN K,WANG D L.A convolutional recurrent neural network for real-time speech enhancement[C]//Interspeech 2018.2018:3229-3233.
[20]LE X,CHEN H,CHEN K,et al.DPCRN:Dual-path convolution recurrent network for single channel speech enhancement[C]//Interspeech 2021,22nd Annual Conference of the International Speech Communication Association.Brno,Czechia,2021:2811-2815.
[21]DEFOSSEZ A,SYNNAEVE G,ADI Y.Real time speech en-hancement in the waveform domain[C]//Interspeech 2020,21st Annual Conference of the International Speech Communication Association,Virtual Event.2020:3291-3295.
[22]HU J,SHEN L,SUN G.Squeeze-and-excitation networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:7132-7141.
[23]FU J,LIU J,TIAN H,et al.Dual attention network for scenesegmentation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019:3146-3154.
[24]PARK H J,KANG B H,SHIN W,et al.Manner:Multi-view attention network for noise erasure[C]//2022 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP 2022).Singapore,IEEE,2022:7842-7846.
[25]LI Y,WANG W,CHEN H,et al.Few-shot speaker identifica-tion using depthwise separable convolutional network with channel attention[J].arXiv:2204.11180,2022.
[26]VALENTINI-BOTINHAO C,WANG X,TAKAKI S,et al.Investigating RNN-based speech enhancement methods for noise-robust text-to-speech[C]//SSW.2016:146-152.
[27]WANG D,ZHANG X.Thchs-30:A free chinese speech corpus[J].arXiv:1512.01882,2015.
[28]RIX A W,BEERENDS J G,HOLLIER M P,et al.Perceptualevaluation of speech quality(PESQ)-a new method for speecn quality assessment of telephone networks and codecs[C]//Proceedings of the 26th International Conference on Acoustics,Speech,and Signal Processing.Utah:IEEE,2001:749-752.
[29]TAAL C H,HENDRIKS R C,HEUSDENS R,et al.An algorithm for intelligibility prediction of time-frequency weighted noisy speech[J].IEEE Transactions on Audio,Speech,and Language Processing,2011,19(7):2125-2136.
[30]HU Y,LOIZOU P C.Evaluation of objective quality measuresfor speech enhancement[J].IEEE Transactions on Audio,Speech,and Language Processing,2007,16(1):229-238.
[31]MACARTNEY C,WEYDE T.Improved speech enhancementwith the Wave-U-Net[J].arXiv:1811.11307,2018.
[32]FU S W,LIAO C F,TSAO Y,et al.Metricgan:Generative adversarial networks based black-box metric scores optimization for speech enhancement[C]//International Conference on Machine Learning.PMLR,2019:2031-2041.
[33]YIN D,LUO C,XIONG Z,et al.Phasen:A phase-and-harmo-nics-aware speech enhancement network[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2020,34(5):9458-9465.
[34]ZHANG Q Q,AARON M N,WANG M J,et al.Deepmmse:A deep learning approach to mmse-based noise power spectral density estimation[J].IEEE/ACM Transactions on Audio,Speech,and Language Processing,IEEE,2020,28(1):1404-1415.
[35]WANG K,HE B,ZHU W P.TSTNN:Two-Stage Transformer Based Neural Network for Speech Enhancement in the Time Domain[C]//2021 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP 2021).Toronto,ON,Canada,2021:7098-7102.
[36]KONG Z,PING W,DANTREY A,et al.Speech denoising in the waveform domain with self-attention[C]//2022 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP 2022).IEEE,2022:7867-7871.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!