计算机科学 ›› 2023, Vol. 50 ›› Issue (11A): 220900135-7.doi: 10.11896/jsjkx.220900135

• 交叉&应用 • 上一篇    下一篇

基于GRU与自注意力网络的声源到达方向估计

何儒汉1,2, 陈一帆1,2, 余永升3, 姜艾森4   

  1. 1 纺织服装智能化湖北省工程研究中心 武汉 430200
    2 武汉纺织大学计算机与人工智能学院 武汉 430200
    3 武汉理工大学硅酸盐建筑材料国家重点实验室 武汉 430070
    4 武汉纺织大学技术研究院 武汉 430200
  • 发布日期:2023-11-09
  • 通讯作者: 陈一帆(2015363091@mail.wtu.edu.cn)
  • 作者简介:(heruhan@wtu.edu.cn)
  • 基金资助:
    国家自然科学基金面上项目(61170093)

Sound Source Arrival Direction Estimation Based on GRU and Self-attentive Network

HE Ruhan1,2, CHEN Yifan1,2, YU Yongsheng3and JIANG Aisen4   

  1. 1 Hubei Provincial Engineering Research Center for Intelligent Textile and Fashion,Wuhan 430200,China
    2 School of Computer Science and Artificial Intelligence,Wuhan Textile University,Wuhan 430200,China
    3 State Key Laboratory of Silicate Materials for Architectures Wuhan University of Technology,Wuhan 430070,China
    4 Science and Technology Institute,Wuhan Textile University,Wuhan 430200,China
  • Published:2023-11-09
  • About author:HE Ruhan,born in 1974,Ph.D,professor,is a member of China Computer Federation.His main research interests include machine learning,computer vision and multimedia retrieval.
    CHEN Yifan,born in 1999,postgra-duate.His main research interests include machine learning and sound source localization.
  • Supported by:
    National Natural Science Foundation of China(61170093).

摘要: 基于神经网络的声源定位近年来受到广泛的关注,但如何缓解隐含DOA位置信息丢失、小样本数据等问题仍然是目前面临的挑战,因此提出了一种基于GRU和自注意力网络的声源到达方向估计方法。该方法采用对小型数据集效果较好的GRU作为骨干网络,弥补了纯净的声音数据采集困难的问题;同时,该方法使用多声道录音的声源形成训练集,经过短时傅里叶变换特征提取得到梅尔频谱图和声学强度矢量,进而形成由多通道语谱图以及归一化的主特征向量叠加的输入特征,避免了对语谱图与GCC-PHAT特征结合的隐式DOA信息的破坏,有效缓解了隐含DOA位置信息丢失问题;将其作为输入进入卷积循环神经网络模型进行监督学习获得模型参数。模型输出使用三维笛卡尔积坐标回归获得DOA位置估计,并增加自注意力网络在模型训练时进行参数回传,使得网络在训练的同时计算损失并预测关联矩阵,以解决预测定位和参考定位之间的最优分配。实验结果表明,该网络在不同混响条件和信噪比的环境下,均具有较高的定位准确率和鲁棒性。

关键词: 声源到达方向估计, GRU, 卷积神经网络, 循环神经网络, 自注意力

Abstract: Neural network-based sound source localization has received wide attention in recent years.However,it is still challenging to mitigate the problems such as loss of implied DOA location information and small sample data.Therefore,a sound source arrival direction estimation method based on GRU and self-attentive network is proposed.The method uses GRU,which works well for small data sets,as the backbone network to compensate for the difficulty of pure sound data collection.At the same time,it uses sound sources from multichannel recordings to form a training set.After the short-time Fourier transform feature extraction to obtain the Meier spectrogram and acoustic intensity vector,then form the input features superimposed by the multi-channel speech spectrogram and the normalized main feature vector.Avoiding the implicit DOA information corrupted by the combination of speech spectrogram and GCC-PHAT features,effectively mitigating the loss of implicit DOA location information.It is used as input into the convolutional recurrent neural network model for supervised learning to obtain the model parameters.The model output uses 3D Cartesian product coordinate regression to obtain DOA location estimates,and adds a self-attentive network for parameter back-propagation during model training,enables the network to calculate the loss and predict the correlation matrix while training to solve the optimal allocation between predicted and reference localization.Experimental results show that the network has high localization accuracy and robustness under different reverberation conditions and signal-to-noise ratios.

Key words: Sound source direction of arrival estimation, GRU, Convolutional neural network, Recurrent neural networks, Self-attention

中图分类号: 

  • TP391
[1]HONG H,WANG M,FU M,et al.Sound Source Localization Sensor of Robot for Tdoa Method[C]//Third International Conference on Intelligent Human-machine Systems & Cybernetics.Zhejiang,China:IEEE,2011:19-22.
[2]SALVATI D,DRIOLI C,FORESTI G L.On the Use of Ma-chine Learning in Microphone Array Beamforming for Far-Field Sound Source Localization[C]//2016 IEEE 26th International Workshop on Machine Learning for Signal Processing(MLSP).Vietrisul Mare,Italy:IEEE,2016:1-6.
[3]HIRVONEN T.Speech/Music Classification of Short AudioSegments[C]//IEEE International Symposium on Multimedia.IEEE,2015:2-9.
[4]TAKEDA R,KOMATANI K.Sound source localization based on deep neural networks with directional activate function exploiting phase information[C]//2016 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2016:3-20.
[5]TAKEDA R,KOMATANI K.Discriminative multiple soundsource localization based on deep neural networks using independent location model[C]//Spoken Language Technology Workshop.IEEE,2017.
[6]YALTA N,NAKADAI K,OGATA T.Sound source localization using deep learning models[J].Journal of Robotics and Mechatronics,2017,29(1):37-48.
[7]CHU F J,VELA P A.Deep grasp:Detection and localization of grasps with deep neural networks[J].arXiv:1802.00520,2018.
[8]CHAKRABARTY S,HABETS,EMANUËL A P.Multi-Speaker DOA Estimation Using Deep Convolutional Networks Trained with Noise Signals[J].IEEE Journal of Selected Topics in Signal Processing,2019,13(1):8-21.
[9]CHAKRABARTY S,HABETS E.Multi-speaker localizationusing convolutional neural network trained with noise[J].ar-Xiv:1712.04276,2017.
[10]FERGUSON E L,WILLIAMS S B,JIN C T.Sound Source Localization in a Multipath Environment Using Convolutional Neural Networks[J].arXiv:1710.10948,2017.
[11]ADAVANNE S,POLITIS A,VIRTANEN T.Direction of arrivalestimation for multiple sound sources using convolutional recurrent neural network[C]//2018 26th European Signal Processing Conference(EUSIPCO).IEEE,2018:1462-1466.
[12]ZHOU Z,RUI Y,CAI X,et al.Constrained total least squares method using TDOA measurements for jointly estimating acoustic emission source and wave velocity[J].Measurement,2021,182:109758.
[13]ZHANG Y D.Research on Microphone Array Sound Source Localization and Beamforming Fof Speech Interaction[D].Xiamen:Xiamen University,2019.
[14]CHEN Y,HSU Y,BAI M R.Multi-channel end-to-end neural network for speech enhancement,source localization,and voice activity detection[J].arXiv:2206.09728,2022.
[15]MAZZON L,KOIZUMI Y,YASUDA M,et al.First order ambisonics domain spatial augmentation for DNN-based direction of arrival estimation[J].arXiv:1910.04388,2019.
[16]WANG Q,DU J,WU H X,et al.A four-stage data augmentation approach to ResNet-Conformer based acoustic modeling for sound event localization and detection[J].arXiv:2101.02919,2021.
[17]HIRVONEN T.Classification of spatial audio location and content using convolutional neural networks[C]//Audio Enginee-ring Society Convention 138.Audio Engineering Society,2015.
[18]ADAVANNE S,POLITIS A,NIKUNEN J,et al.Sound Event Localization and Detection of Overlapping Sources Using Convolutional Recurrent Neural Networks:10.1109/JSTSP.2018.2885636[P].2018.
[19]HU J,CAO Y,WU M,et al.Sound Event Localization and De-tection for Real Spatial Sound Scenes:Event-Independent Network and Data Augmentation Chains[J].arXiv:2209.01802,2022.
[20]NGUYEN T N T,JONES D L,GAN W S.A sequence matchingnetwork for polyphonic sound event localization and detection[C]//ICASSP 2020-2020 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2020:71-75.
[21]NGUYEN T N T,NGUYEN N K,PHAN H,et al.A general network architecture for sound event localization and detection using transfer learning and recurrent neural network[C]//2021 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP 2021).IEEE,2021:935-939.
[22]SHIMADA K,KOYAMA Y,TAKAHASHI N,et al.AC-CDOA:Activity-coupled cartesian direction of arrival representation for sound event localization and detection[C]//2021 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP arXiv).IEEE,2021:915-919.
[23]TAKAHASHI N,MITSUFUJI Y.Densely connected multi-dilated convolutional networks for dense prediction tasks[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2021:993-1002.
[24]POLITIS A,ADAVANNE S,VIRTANEN T.A dataset of reverberant spatial sound scenes with moving sources for sound event localization and detection[J].arXiv:2006.01919,2020.
[25]POLITIS A,ADAVANNE S,KRAUSE D,et al.A dataset of dynamic reverberant sound scenes with directional interferers for sound event localization and detection[J].arXiv:2106.06999,2021.
[26]WANG Q,WU H,JING Z,et al.The USTC-iFlytek system for sound event localization and detection of DCASE2020 challenge[J].IEEE AASP Chall.Detect.Classif.Acoust.Scenes Events,2020,17(1):5-13.
[27]YE Z,WANG X,LIU H,et al.Sound Event Detection Trans-former:An Event-based End-to-End Model for Sound Event Detection[J].arXiv:2110.02011,2021.
[28]NARANJO-ALCAZAR J,PEREZ-CASTANOS S,FERRAN-DIS J,et al.Sound event localization and detection using squeeze-excitation residual CNNs[J].arXiv:2006.14436,2020.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!