Computer Science ›› 2020, Vol. 47 ›› Issue (5): 120-123.doi: 10.11896/jsjkx.190900111

• Computer Graphics & Multimedia • Previous Articles     Next Articles

Sound Recognition and Detection Based on Multi-scale Attention Fusion in Weak LabelEnvironment

ZHENG Wei-zhe1, QIU Peng2, WEI Juan2   

  1. 1 School of Electronic Engineering,Xidian University,Xi'an 710071,China
    2 School of Telecommunications Engineering,Xidian University,Xi'an 710071,China
  • Received:2019-09-16 Online:2020-05-15 Published:2020-05-19
  • About author:ZHENG Wei-zhe,born in 1998,post-graduate.His main research interests include sound and image recognition.
    WEI Juan,born in 1973,Ph.D,associate professor.Her main research interests include sound localization and recognition.
  • Supported by:
    This work was supported by the National Natural Science Foundation of China (51675425) and Key Research Program of Shaanxi Province,China (2018SF-365)

Abstract: At present,most of the research on sound recognition and detection is based on the datasets with strong labels.Howe-ver,in real-world sound recognition and detection tasks,it is difficult to obtain strong label audio data due to incomplete audio labels with a large amount of noise,which in turn affects the accuracy of sound identification and detection.To this end,a multi-scale attention fusion mechanism is proposed based on the convolutional cyclic neural network model.This mechanism uses the attention gating unit to make more use of the effective features while reducing the effects of noise in the sound time-frequency map features.At the same time,feature fusion is performed by combining convolution kernels of multiple sizes to further improve the effective extraction of sound features.In addition,the sound signal is identified by a weighting method that combines the results of the frame detection.Finally,using the proposed model,a weak labeled data set containing 17 kinds of urban vehicle sounds is selected from the AudioSet database for detection and identification in the weak label environment.For the test set,the F1 value of sound recognition result is 58.9%,and the F1 value of detection result is 43.7%.The simulation experiments show that the CRNN baseline model used in this paper is more accurate than the traditional sound recognition detection model under the weakly labeled city vehicle sound datasets.And the methods involved in the paper,such as the importance weighted recognition method and multi-scale attention fusion method,can improve the accuracy of the model for sound recognition and detection.

Key words: Attention, Multi-scale, Sound detection, Sound recognition, Weak label

CLC Number: 

  • TP391
[1]KUMAR A,RAJ B.Audio event detection using weakly labeled data[C]//Proceedings of the 2016 ACM on MultimediaConfe-rence.ACM,2016:1038-1047.
[2]TSENG S Y,LI J,WANG Y,et al.Multiple Instance DeepLearning for Weakly Supervised Small-Footprint Audio Event Detection[C]//Proc.Interspeech.2018:1-5.
[3]CHOU S Y,JANG J S,YANG Y H.Frame CNN:A weakly supervised learning framework for frame-wise acoustic event detection and classification [R].DCASE2017 Challenge,2017.
[4]DIMITROV S,BRITZ J,BRANDHERM B,et al.Analyzingsounds of home environment for device recognition[C]//European Conference on Ambient Intelligence.Cham:Springer,2014:1-16.
[5]BOGDANOV D,WACK N,GóMEZ E,et al.Essentia:an open-source library for sound and music analysis[C]//Proceedings of the 21st ACM international conference on Multimedia.ACM,2013:855-858.
[6]ANWAR,M Z,KALEEM Z,et al.Machine learning inspired sound-based amateur drone detection for public safety applications [J].IEEE Transactions on Vehicular Technology,2019(68):2526-2534.
[7]PARASCANDOLO G,HEITTOLA T,HUTTUNEN H,et al.Convolutional recurrent neural networks for polyphonic sound event detection [J].IEEE/ACM Transactions on Audio,Speech,and Language Processing,2017,2(6):1291-1303.
[8]ZHOU Q,FENG Z R,BENETOS E.Adaptive Noise Reduction for Sound Event Detection Using Subband-Weighted NMF [J].sensors,2019,19(14):3206.
[9]CAKIR E,VIRTANEN T.End-to-End Polyphonic Sound Event Detection Using Convolutional Recurrent Neural Networks with Learned Time-Frequency Representation Input[C]//2018 International Joint Conference on Neural Networks (IJCNN).2018.
[10]XIA X,TOGNERI R,SOHEL F,et al.Random Forest Classification based Acoustic Event Detection Utilizing Contextual-Information and Bottleneck Features[J].Pattern Recognition,2018(81):1-13.
[11]CHOI K,FAZEKAS G,SANDLER M.Automatic tagging using deep convolutional neural networks [J].arXiv:1606.00298.
[12]XU Y,KONG Q,HUANG Q,et al.Attention and localization based on a deep convolutional recurrent model for weakly supervised audio tagging[C]//Proc.Interspeech.2017:3083-3087.
[13]XU Y,KONG Q,HUANG Q,et al.Convolutional gated recurrent neural network incorporating spatial features for audio tagging[C]//2017 International Joint Conference on Neural Networks (IJCNN).IEEE,2017:3461-3466.
[14]XU Y,KONG Q,WANG W,et al.Large-scale weakly super-vised audio classification using gated convolutional neural network[C]//2018 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP).IEEE,2018:121-125.
[15]SZEGEDY C,LIU W,JIA Y,et al.Going deeper with convolutions[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.IEEE,2015:1-9.
[16]GEMMEKE J F,ELLIS D P W,FREEDMAN D,et al.Audio set:An ontology and human-labeled dataset for audio events[C]//2017 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP).IEEE,2017:776-780.
[1] RAO Zhi-shuang, JIA Zhen, ZHANG Fan, LI Tian-rui. Key-Value Relational Memory Networks for Question Answering over Knowledge Graph [J]. Computer Science, 2022, 49(9): 202-207.
[2] WU Zi-yi, LI Shao-mei, JIANG Meng-han, ZHANG Jian-peng. Ontology Alignment Method Based on Self-attention [J]. Computer Science, 2022, 49(9): 215-220.
[3] ZHOU Fang-quan, CHENG Wei-qing. Sequence Recommendation Based on Global Enhanced Graph Neural Network [J]. Computer Science, 2022, 49(9): 55-63.
[4] DAI Yu, XU Lin-feng. Cross-image Text Reading Method Based on Text Line Matching [J]. Computer Science, 2022, 49(9): 139-145.
[5] ZHOU Le-yuan, ZHANG Jian-hua, YUAN Tian-tian, CHEN Sheng-yong. Sequence-to-Sequence Chinese Continuous Sign Language Recognition and Translation with Multi- layer Attention Mechanism Fusion [J]. Computer Science, 2022, 49(9): 155-161.
[6] XIONG Li-qin, CAO Lei, LAI Jun, CHEN Xi-liang. Overview of Multi-agent Deep Reinforcement Learning Based on Value Factorization [J]. Computer Science, 2022, 49(9): 172-182.
[7] SHI Dian-xi, ZHAO Chen-ran, ZHANG Yao-wen, YANG Shao-wu, ZHANG Yong-jun. Adaptive Reward Method for End-to-End Cooperation Based on Multi-agent Reinforcement Learning [J]. Computer Science, 2022, 49(8): 247-256.
[8] LI Yao, LI Tao, LI Qi-fan, LIANG Jia-rui, Ibegbu Nnamdi JULIAN, CHEN Jun-jie, GUO Hao. Construction and Multi-feature Fusion Classification Research Based on Multi-scale Sparse Brain Functional Hyper-network [J]. Computer Science, 2022, 49(8): 257-266.
[9] WANG Xin-tong, WANG Xuan, SUN Zhi-xin. Network Traffic Anomaly Detection Method Based on Multi-scale Memory Residual Network [J]. Computer Science, 2022, 49(8): 314-322.
[10] JIANG Meng-han, LI Shao-mei, ZHENG Hong-hao, ZHANG Jian-peng. Rumor Detection Model Based on Improved Position Embedding [J]. Computer Science, 2022, 49(8): 330-335.
[11] LI Rong-fan, ZHONG Ting, WU Jin, ZHOU Fan, KUANG Ping. Spatio-Temporal Attention-based Kriging for Land Deformation Data Interpolation [J]. Computer Science, 2022, 49(8): 33-39.
[12] WANG Ming, PENG Jian, HUANG Fei-hu. Multi-time Scale Spatial-Temporal Graph Neural Network for Traffic Flow Prediction [J]. Computer Science, 2022, 49(8): 40-48.
[13] FANG Yi-qiu, ZHANG Zhen-kun, GE Jun-wei. Cross-domain Recommendation Algorithm Based on Self-attention Mechanism and Transfer Learning [J]. Computer Science, 2022, 49(8): 70-77.
[14] ZHU Cheng-zhang, HUANG Jia-er, XIAO Ya-long, WANG Han, ZOU Bei-ji. Deep Hash Retrieval Algorithm for Medical Images Based on Attention Mechanism [J]. Computer Science, 2022, 49(8): 113-119.
[15] WEI Kai-xuan, FU Ying. Re-parameterized Multi-scale Fusion Network for Efficient Extreme Low-light Raw Denoising [J]. Computer Science, 2022, 49(8): 120-126.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!