计算机科学 ›› 2023, Vol. 50 ›› Issue (3): 191-198.doi: 10.11896/jsjkx.220500259

• 计算机图形学&多媒体 • 上一篇    下一篇

一种基于三维卷积的声学事件联合估计方法

梅鹏程, 杨吉斌, 张强, 黄翔   

  1. 陆军工程大学指挥控制工程学院 南京 210007
  • 收稿日期:2022-05-30 修回日期:2022-09-13 出版日期:2023-03-15 发布日期:2023-03-15
  • 通讯作者: 杨吉斌(yjbice@sina.com)
  • 作者简介:(765388155@qq.com)
  • 基金资助:
    国家自然科学基金(62071484)

Sound Event Joint Estimation Method Based on Three-dimension Convolution

MEI Pengcheng, YANG Jibin, ZHANG Qiang, HUANG Xiang   

  1. School of Command and Control Engineering,Army Engineering University,Nanjing 210007,China
  • Received:2022-05-30 Revised:2022-09-13 Online:2023-03-15 Published:2023-03-15
  • About author:MEI Pengcheng,born in 1993,postgra-duate.His main research interests include machine learning and acoustic signal processing.
    YANG Jibin,born in 1978,Ph.D,asso-ciate professor.His main research interests include speech and acoustic signal processing,machine learning and pattern recognition.
  • Supported by:
    National Natural Science Foundation of China(62071484).

摘要: 声学事件定位与检测在监控、异常检测等任务中应用广泛,以基于卷积递归神经网络架构为代表的深度学习方法可以联合实现声学事件检测和声源定位。为提高定位与检测的综合性能,提出了一种基于三维卷积的声学事件联合估计方法SELD3Dnet。通过对输入的多通道音频计算幅度相位特征,并经过多重三维卷积结构提取高层特征表示,最后利用循环网络和全连接层实现声音事件类别和空间位置的估计。在处理多通道的声学信号特征时,三维卷积可以同时对时间、频率、信号通道3个维度进行卷积计算,最大程度地利用信号通道间的相关性,克服噪声和混响的影响。在TUT2018和TAU2019等公开数据集上进行了充分的对比实验。结果表明,所提方法在TUT2018 REAL和TUT2019 MREAL数据集上的综合性能都有显著提升。其中,在TUT2018 REAL数据集上声学事件检测的F1指标显著提升了13.9%,帧准确率显著提升了21.1%;在TUT2019 MREAL数据集上F1指标显著提升了10.8%,帧准确率显著提升了14.4%。表明所提方法能有效克服实际信号中混响的影响。

关键词: 声学事件定位与检测, 深度学习, 卷积神经网络, 三维卷积, 多通道信号

Abstract: Sound event localization and detection(SELD) is widely used in monitoring and anomaly detection tasks.Deep learning methods represented by convolutional recurrent neural networks(CRNN) can be realized to improve the performance of SELD.In order to improve the system localization and detection performance,a method based on 3D Convolution feature extraction,called SELD3Dnet,is proposed.The amplitude and phase spectra of input multi-channel acoustic signal are calculated,and the deep feature representation is extracted by multiple 3D Convolution modules.Recurrent neural networks and the fully connected layers are adopted to estimate the type of sound events and their localization.In processing multi-channel acoustic signals,three-dimensional(3D) convolution can carry out convolution calculation of time,frequency and signal channel simultaneously,so that the correlation between signal channels can be exploited to the maximum extent.Comparative experiments are conducted on TUT2018 dataset and TAU2019 dataset,and the results show that the comprehensive performance of the proposed method is significantly improved on TUT2018 REAL and TAU2019 MREAL datasets.The F1 index of acoustic event detection on TUT2018 REAL dataset significan-tly improves by 13.9% and frame accuracy by 21.1%,while the F1 index on TAU2019 MREAL dataset significantly improves by 10.8% and frame accuracy by 14.4%.It is verified that the proposed method can effectively overcome the influence of reverberation existing in real-life scenes.

Key words: Sound event localization and detection, Deep Learning, Convolutional neural networks, Three-dimension convolution, Multi-channel signal

中图分类号: 

  • TP391.42
[1]SALAMON J,BELLO J P.Deep Convolutional Neural Net-works and Data Augmentation for Environmental Sound Classification [J].IEEE Signal Processing Letters,2017,24(3):279-283.
[2]ZHANG X Y,ZHANG H L,HAN Y Y,et al.Research Progress of the Wildlife Monitoring and Identification Based on Deep Learning[J].Journal of Chinese Journal of Wildlife,2022,43(1):251-258.
[3]FOGGIA P,PETKOV N,SAGGESE A,et al.Audio Surveillance of Roads:A System for Detecting Anomalous Sounds [J].IEEE Transactions on Intelligent Transportation Systems,2016,17(1):279-288.
[4]NAKAMURA K,NAKADAI K,INCE G.Real-time super-resolution Sound Source Localization for robots [C]//Proceedings of 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.2012:694-699.
[5]BUTKO T,PLA F G,SEGURA C,et al.Two-source acousticevent detection and localization:Online implementation in a Smart-room.[C]//Proceedings of the European Signal Proces-sing Conference.2011:1317-1321.
[6]HIRVONEN T.Classification of Spatial Audio Location andContent Using Convolutional Neural Networks[C]//Audio Engineering Society Convention 138.Audio Engineering Society,New York,USA,2015:1857-1861.
[7]ADAVANNE S,POLITIS A,NIKUNEN J,et al.Sound Event Localization and Detection of Overlapping Sources Using Convolutional Recurrent Neural Networks [J].IEEE Journal of Selected Topics in Signal Processing,2019,13(1):34-48.
[8]ADAVANNE S,POLITIS A,VIRTANEN T.A multi-room reverberant dataset for sound event localization and detection [J].arXiv:1905.08546,2019.
[9]POLITIS A,MESAROS A,ADAVANNE S,et al.Overview and Evaluation of Sound Event Localization and Detection in DCASE 2019 [J].IEEE/ACM Transactions on Audio Speech and Language Processing,2021,29:684-698.
[10]LI X T,ZHONG S C,ZHONG J F.DOA estimation of wideband signal based on improved MUSIC [J].Computer Engineering,2022,48(11):201-206.
[11]XU C D,LIU H,MIN Y,et al.Sound event localization and detection based on dual attention [EB/OL].http://kns.cnki.net/kcms/detail/11.2127.TP.20220824.1356.008.html.
[12]SONG H,LIU X J,YU S F,et al.Binaural localization algorithm based on deep learning [J].Technical Acoustics,2022,41(4):602-607.
[13]YANG L P,HAO J Y,GU X H,et al.Sound Event Detection width Audio Tagging Consistency Constraint CRNN [J].Journal of Electronics & Information Technology,2022,44(3):1102-1110.
[14]ADAVANNE S,POLITIS A,VIRTANEN T.Direction of Arrival Estimation for Multiple Sound Sources Using Convolutional Recurrent Neural Network[C]//Proceedings of the 2018 26th European Signal Processing Conference(EUSIPCO).2018:1462-1466.
[15]GANNOT S,VINCENT E,MARKOVICH-GOLAN S,et al.A Consolidated Perspective on Multimicrophone Speech Enhancement and Source Separation [J].IEEE/ACM Transaction on Audio,Speech and Language Processing,2017,25(4):692-730.
[16]CAKIR E,PARASCANDOLO G,HEITTOLA T,et al.Convolutional Recurrent Neural Networks for Polyphonic Sound Event Detection [J].IEEE/ACM Transactions on Audio,Speech & Language Processing,2017,25(6):1291-1303.
[17]YU L,PAN Z,CHEN Z W,et al.Eigenvalue filtering method for microphone array denoising [J].Acta Acustica,2021,46(3):335-43.
[18]HUANG J,HU X Y.Indoor 3D Sound Source Localization Optimization Algorithm Based on Microphone Array [J]. Compu-ter Systems & Applications,2021,30(9):212-218.
[19]CAO Y,IQBAL T,KONG Q,et al.An Improved Event-Independent Network for Polyphonic Sound Event Localization and Detection[C]//Proceedings of the Detection and Classification of Acoustic Scenes and Events 2019 Workshop(DCASE2019).2021:885-889.
[20]KAPKA S,LEWANDOWSKI M.Sound source detection,localization and classification using consecutive ensemble of CRNN models.[C]//Proceedings of the Detection and Classification of Acoustic Scenes and Events 2019 Workshop(DCASE2019).2019:119-123.
[21]RANJAN R,JAYABALAN S,NGUYEN T N T,et al.Soundevent detection and direction of arrival estimation using residual net and recurrent neural networks[C]//Proceedings of the Detection and Classification of Acoustic Scenes and Events 2019 Workshop(DCASE2019).2019:214-218.
[22]GRONDIN F,GLASS J,SOBIERAJ I,et al.Sound event localization and detection using CRNN on pairs of microphones[C]//Proceedings of the Detection and Classification of Acoustic Scenes and Events 2019 Workshop(DCASE2019).2019:84-88.
[23]ADAVANNE S,POLITIS A,VIRTANEN T.MultichannelSound Event Detection Using 3D Convolutional Neural Networks for Learning Inter-channel Features[C]//Proceedings of the 2018 International Joint Conference on Neural Networks(IJCNN).2018:1-7.
[24]ADAVANNE S,PERTILÄ P,VIRTANEN T.Sound event detection using spatial features and convolutional recurrent neural network.[C]//Proceedings of the 2017 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).2017:771-775.
[25]SANG T H,CHIEN F T,CHANG C C,et al.DoA Estimation for FMCW Radar by 3D-CNN [J].Sensors,2021,21:5319.
[26]DIAZ-GUERRA D,MIGUEL A,BELTRÁN J R.Robust Sound Source Tracking Using SRP-PHAT and 3D Convolutional Neural Networks [J].IEEE/ACM Transactions on Audio,Speech,and Language Processing,2021,29:300-311.
[27]AGYEMAN R,RAFIQ M,SHIN H K,et al.Optimizing Spatiotemporal Feature Learning in 3D Convolutional Neural Networks With Pooling Blocks [J].IEEE Access,2021,9:70797-70805.
[28]GU J,YANG X,MELLO S D,et al.Dynamic Facial Analysis:From Bayesian Filtering to Recurrent Neural Network[C]//proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition(CVPR).2017:1531-1540.
[29]ADAVANNE S,POLITIS A,VIRTANEN T.Localization,Detection and Tracking of Multiple Moving Sound Sources with a Convolutional Recurrent Neural Network[C]//Proceedings of the Detection and Classification of Acoustic Scenes and Events 2019 Workshop(DCASE2019).USA,2019:20-24.
[30]MESAROS A,HEITTOLA T,VIRTANEN T.Metrics for Poly-phonic Sound Event Detection [J].Applied Sciences,2016,6(6):162.
[31]KUHN H W.The Hungarian method for the assignment problem [J].Naval Research Logistics Quarterly,1955,2(1):83-97.
[1] 李帅, 徐彬, 韩祎珂, 廖同鑫.
SS-GCN:情感增强和句法增强的方面级情感分析模型
SS-GCN:Aspect-based Sentiment Analysis Model with Affective Enhancement and Syntactic Enhancement
计算机科学, 2023, 50(3): 3-11. https://doi.org/10.11896/jsjkx.220700238
[2] 王晓飞, 樊学强, 李章维.
基于迁移学习和多视图特征融合提高RNA碱基相互作用预测
Improving RNA Base Interactions Prediction Based on Transfer Learning and Multi-view Feature Fusion
计算机科学, 2023, 50(3): 164-172. https://doi.org/10.11896/jsjkx.211200186
[3] 董永峰, 黄港, 薛婉若, 李林昊.
融合IRT的图注意力深度知识追踪模型
Graph Attention Deep Knowledge Tracing Model Integrated with IRT
计算机科学, 2023, 50(3): 173-180. https://doi.org/10.11896/jsjkx.211200134
[4] 华晓凤, 冯娜, 于俊清, 何云峰.
基于规则推理的足球视频任意球射门事件检测
Shooting Event Detection of Free Kick in Soccer Video Based on Rule Reasoning
计算机科学, 2023, 50(3): 181-190. https://doi.org/10.11896/jsjkx.220300062
[5] 白雪飞, 马亚楠, 王文剑.
基于特征融合的边缘引导乳腺超声图像分割方法
Segmentation Method of Edge-guided Breast Ultrasound Images Based on Feature Fusion
计算机科学, 2023, 50(3): 199-207. https://doi.org/10.11896/jsjkx.211200294
[6] 刘航, 普园媛, 吕大华, 赵征鹏, 徐丹, 钱文华.
极化自注意力约束颜色溢出的图像自动上色
Polarized Self-attention Constrains Color Overflow in Automatic Coloring of Image
计算机科学, 2023, 50(3): 208-215. https://doi.org/10.11896/jsjkx.220100149
[7] 陈亮, 王璐, 李生春, 刘昌宏.
基于深度学习的可视化仪表板生成技术研究
Study on Visual Dashboard Generation Technology Based on Deep Learning
计算机科学, 2023, 50(3): 238-245. https://doi.org/10.11896/jsjkx.230100064
[8] 张译, 吴秦.
特征增强损失与前景注意力人群计数网络
Crowd Counting Network Based on Feature Enhancement Loss and Foreground Attention
计算机科学, 2023, 50(3): 246-253. https://doi.org/10.11896/jsjkx.220100219
[9] 应宗浩, 吴槟.
深度学习模型的后门攻击研究综述
Backdoor Attack on Deep Learning Models:A Survey
计算机科学, 2023, 50(3): 333-350. https://doi.org/10.11896/jsjkx.220600031
[10] 邹芸竹, 杜圣东, 滕飞, 李天瑞.
一种基于多模态深度特征融合的视觉问答模型
Visual Question Answering Model Based on Multi-modal Deep Feature Fusion
计算机科学, 2023, 50(2): 123-129. https://doi.org/10.11896/jsjkx.211200303
[11] 王鹏宇, 台文鑫, 刘芳, 钟婷, 罗绪成, 周帆.
基于数据增强的自监督飞行航迹预测
Self-supervised Flight Trajectory Prediction Based on Data Augmentation
计算机科学, 2023, 50(2): 130-137. https://doi.org/10.11896/jsjkx.211200016
[12] 郭楠, 李婧源, 任曦.
基于深度学习的刚体位姿估计方法综述
Survey of Rigid Object Pose Estimation Algorithms Based on Deep Learning
计算机科学, 2023, 50(2): 178-189. https://doi.org/10.11896/jsjkx.211200164
[13] 李俊林, 欧阳智, 杜逆索.
基于改进区域候选网络的场景文本检测
Scene Text Detection with Improved Region Proposal Network
计算机科学, 2023, 50(2): 201-208. https://doi.org/10.11896/jsjkx.211000191
[14] 华杰, 刘学亮, 赵烨.
基于特征融合的小样本目标检测
Few-shot Object Detection Based on Feature Fusion
计算机科学, 2023, 50(2): 209-213. https://doi.org/10.11896/jsjkx.220500153
[15] 曹金娟, 钱忠, 李培峰.
基于联合模型的端到端事件可信度识别
End-to-End Event Factuality Identification with Joint Model
计算机科学, 2023, 50(2): 292-299. https://doi.org/10.11896/jsjkx.211200108
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!