弱标签环境下基于多尺度注意力融合的声音识别检测

doi:10.11896/jsjkx.190900111

计算机科学 ›› 2020, Vol. 47 ›› Issue (5): 120-123.doi: 10.11896/jsjkx.190900111

• 计算机图形学&多媒体 • 上一篇下一篇

弱标签环境下基于多尺度注意力融合的声音识别检测

郑伟哲¹, 仇鹏², 韦娟²

1 西安电子科技大学电子工程学院西安710071
2 西安电子科技大学通信工程学院西安710071

收稿日期:2019-09-16 出版日期:2020-05-15 发布日期:2020-05-19
通讯作者: 韦娟(weijuan@xidian.edu.cn)
作者简介:634877973@qq.com
基金资助:
国家自然科学基金(51675425);陕西省重点研发计划(2018SF-365)

Sound Recognition and Detection Based on Multi-scale Attention Fusion in Weak LabelEnvironment

ZHENG Wei-zhe¹, QIU Peng², WEI Juan²

1 School of Electronic Engineering,Xidian University,Xi'an 710071,China
2 School of Telecommunications Engineering,Xidian University,Xi'an 710071,China

Received:2019-09-16 Online:2020-05-15 Published:2020-05-19
About author:ZHENG Wei-zhe,born in 1998,post-graduate.His main research interests include sound and image recognition.
WEI Juan,born in 1973,Ph.D,associate professor.Her main research interests include sound localization and recognition.
Supported by:
This work was supported by the National Natural Science Foundation of China (51675425) and Key Research Program of Shaanxi Province,China (2018SF-365)

摘要/Abstract

摘要： 目前大多数声音识别检测的研究都是基于强标签数据集的,但在真实环境的声音识别与检测任务中,音频标签不完整并且含有大量噪声,使得获取强标签音频数据比较困难,进而影响对声音的准确识别与检测。为此,在卷积循环神经网络模型的基础上,提出了一种多尺度注意力融合机制。该机制使用注意力门控单元,在降低声音时频图特征中噪声影响的同时,能够更多地利用有效特征。同时,通过结合多个尺寸的卷积核进行特征融合,进一步提升对声音特征的有效提取。此外,采用一种结合帧检测结果的加权法对声音信号进行识别。最后,在弱标签环境下,从AudioSet数据库中选取一个包含17种城市交通工具声音的弱标签数据集进行检测识别,所提模型对测试集声音识别结果的F1值为58.9%,检测结果的F1值为43.7%。结果表明,在弱标签城市交通工具声数据集下,网络模型相比传统的声音识别检测模型具有更高的识别检测精度;同时,重要性加权识别方法、多尺度注意力融合方法均可提升模型对声音识别检测的精度。

关键词: 多尺度, 弱标签, 声音检测, 声音识别, 注意力

Abstract: At present,most of the research on sound recognition and detection is based on the datasets with strong labels.Howe-ver,in real-world sound recognition and detection tasks,it is difficult to obtain strong label audio data due to incomplete audio labels with a large amount of noise,which in turn affects the accuracy of sound identification and detection.To this end,a multi-scale attention fusion mechanism is proposed based on the convolutional cyclic neural network model.This mechanism uses the attention gating unit to make more use of the effective features while reducing the effects of noise in the sound time-frequency map features.At the same time,feature fusion is performed by combining convolution kernels of multiple sizes to further improve the effective extraction of sound features.In addition,the sound signal is identified by a weighting method that combines the results of the frame detection.Finally,using the proposed model,a weak labeled data set containing 17 kinds of urban vehicle sounds is selected from the AudioSet database for detection and identification in the weak label environment.For the test set,the F1 value of sound recognition result is 58.9%,and the F1 value of detection result is 43.7%.The simulation experiments show that the CRNN baseline model used in this paper is more accurate than the traditional sound recognition detection model under the weakly labeled city vehicle sound datasets.And the methods involved in the paper,such as the importance weighted recognition method and multi-scale attention fusion method,can improve the accuracy of the model for sound recognition and detection.

Key words: Attention, Multi-scale, Sound detection, Sound recognition, Weak label

中图分类号:

TP391

郑伟哲, 仇鹏, 韦娟. 弱标签环境下基于多尺度注意力融合的声音识别检测[J]. 计算机科学, 2020, 47(5): 120-123. https://doi.org/10.11896/jsjkx.190900111

ZHENG Wei-zhe, QIU Peng, WEI Juan. Sound Recognition and Detection Based on Multi-scale Attention Fusion in Weak LabelEnvironment[J]. Computer Science, 2020, 47(5): 120-123. https://doi.org/10.11896/jsjkx.190900111

参考文献

[1]KUMAR A,RAJ B.Audio event detection using weakly labeled data[C]//Proceedings of the 2016 ACM on MultimediaConfe-rence.ACM,2016:1038-1047.
[2]TSENG S Y,LI J,WANG Y,et al.Multiple Instance DeepLearning for Weakly Supervised Small-Footprint Audio Event Detection[C]//Proc.Interspeech.2018:1-5.
[3]CHOU S Y,JANG J S,YANG Y H.Frame CNN:A weakly supervised learning framework for frame-wise acoustic event detection and classification [R].DCASE2017 Challenge,2017.
[4]DIMITROV S,BRITZ J,BRANDHERM B,et al.Analyzingsounds of home environment for device recognition[C]//European Conference on Ambient Intelligence.Cham:Springer,2014:1-16.
[5]BOGDANOV D,WACK N,GóMEZ E,et al.Essentia:an open-source library for sound and music analysis[C]//Proceedings of the 21st ACM international conference on Multimedia.ACM,2013:855-858.
[6]ANWAR,M Z,KALEEM Z,et al.Machine learning inspired sound-based amateur drone detection for public safety applications [J].IEEE Transactions on Vehicular Technology,2019(68):2526-2534.
[7]PARASCANDOLO G,HEITTOLA T,HUTTUNEN H,et al.Convolutional recurrent neural networks for polyphonic sound event detection [J].IEEE/ACM Transactions on Audio,Speech,and Language Processing,2017,2(6):1291-1303.
[8]ZHOU Q,FENG Z R,BENETOS E.Adaptive Noise Reduction for Sound Event Detection Using Subband-Weighted NMF [J].sensors,2019,19(14):3206.
[9]CAKIR E,VIRTANEN T.End-to-End Polyphonic Sound Event Detection Using Convolutional Recurrent Neural Networks with Learned Time-Frequency Representation Input[C]//2018 International Joint Conference on Neural Networks (IJCNN).2018.
[10]XIA X,TOGNERI R,SOHEL F,et al.Random Forest Classification based Acoustic Event Detection Utilizing Contextual-Information and Bottleneck Features[J].Pattern Recognition,2018(81):1-13.
[11]CHOI K,FAZEKAS G,SANDLER M.Automatic tagging using deep convolutional neural networks [J].arXiv:1606.00298.
[12]XU Y,KONG Q,HUANG Q,et al.Attention and localization based on a deep convolutional recurrent model for weakly supervised audio tagging[C]//Proc.Interspeech.2017:3083-3087.
[13]XU Y,KONG Q,HUANG Q,et al.Convolutional gated recurrent neural network incorporating spatial features for audio tagging[C]//2017 International Joint Conference on Neural Networks (IJCNN).IEEE,2017:3461-3466.
[14]XU Y,KONG Q,WANG W,et al.Large-scale weakly super-vised audio classification using gated convolutional neural network[C]//2018 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP).IEEE,2018:121-125.
[15]SZEGEDY C,LIU W,JIA Y,et al.Going deeper with convolutions[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.IEEE,2015:1-9.
[16]GEMMEKE J F,ELLIS D P W,FREEDMAN D,et al.Audio set:An ontology and human-labeled dataset for audio events[C]//2017 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP).IEEE,2017:776-780.

相关文章 15

[1]	饶志双, 贾真, 张凡, 李天瑞. 基于Key-Value关联记忆网络的知识图谱问答方法 Key-Value Relational Memory Networks for Question Answering over Knowledge Graph 计算机科学, 2022, 49(9): 202-207. https://doi.org/10.11896/jsjkx.220300277
[2]	吴子仪, 李邵梅, 姜梦函, 张建朋. 基于自注意力模型的本体对齐方法 Ontology Alignment Method Based on Self-attention 计算机科学, 2022, 49(9): 215-220. https://doi.org/10.11896/jsjkx.210700190
[3]	周芳泉, 成卫青. 基于全局增强图神经网络的序列推荐 Sequence Recommendation Based on Global Enhanced Graph Neural Network 计算机科学, 2022, 49(9): 55-63. https://doi.org/10.11896/jsjkx.210700085
[4]	戴禹, 许林峰. 基于文本行匹配的跨图文本阅读方法 Cross-image Text Reading Method Based on Text Line Matching 计算机科学, 2022, 49(9): 139-145. https://doi.org/10.11896/jsjkx.220600032
[5]	周乐员, 张剑华, 袁甜甜, 陈胜勇. 多层注意力机制融合的序列到序列中国连续手语识别和翻译 Sequence-to-Sequence Chinese Continuous Sign Language Recognition and Translation with Multi- layer Attention Mechanism Fusion 计算机科学, 2022, 49(9): 155-161. https://doi.org/10.11896/jsjkx.210800026
[6]	熊丽琴, 曹雷, 赖俊, 陈希亮. 基于值分解的多智能体深度强化学习综述 Overview of Multi-agent Deep Reinforcement Learning Based on Value Factorization 计算机科学, 2022, 49(9): 172-182. https://doi.org/10.11896/jsjkx.210800112
[7]	史殿习, 赵琛然, 张耀文, 杨绍武, 张拥军. 基于多智能体强化学习的端到端合作的自适应奖励方法 Adaptive Reward Method for End-to-End Cooperation Based on Multi-agent Reinforcement Learning 计算机科学, 2022, 49(8): 247-256. https://doi.org/10.11896/jsjkx.210700100
[8]	李瑶, 李涛, 李埼钒, 梁家瑞, Ibegbu Nnamdi JULIAN, 陈俊杰, 郭浩. 基于多尺度的稀疏脑功能超网络构建及多特征融合分类研究 Construction and Multi-feature Fusion Classification Research Based on Multi-scale Sparse Brain Functional Hyper-network 计算机科学, 2022, 49(8): 257-266. https://doi.org/10.11896/jsjkx.210600094
[9]	王馨彤, 王璇, 孙知信. 基于多尺度记忆残差网络的网络流量异常检测模型 Network Traffic Anomaly Detection Method Based on Multi-scale Memory Residual Network 计算机科学, 2022, 49(8): 314-322. https://doi.org/10.11896/jsjkx.220200011
[10]	姜梦函, 李邵梅, 郑洪浩, 张建朋. 基于改进位置编码的谣言检测模型 Rumor Detection Model Based on Improved Position Embedding 计算机科学, 2022, 49(8): 330-335. https://doi.org/10.11896/jsjkx.210600046
[11]	方义秋, 张震坤, 葛君伟. 基于自注意力机制和迁移学习的跨领域推荐算法 Cross-domain Recommendation Algorithm Based on Self-attention Mechanism and Transfer Learning 计算机科学, 2022, 49(8): 70-77. https://doi.org/10.11896/jsjkx.210600011
[12]	朱承璋, 黄嘉儿, 肖亚龙, 王晗, 邹北骥. 基于注意力机制的医学影像深度哈希检索算法 Deep Hash Retrieval Algorithm for Medical Images Based on Attention Mechanism 计算机科学, 2022, 49(8): 113-119. https://doi.org/10.11896/jsjkx.210700153
[13]	魏恺轩, 付莹. 基于重参数化多尺度融合网络的高效极暗光原始图像降噪 Re-parameterized Multi-scale Fusion Network for Efficient Extreme Low-light Raw Denoising 计算机科学, 2022, 49(8): 120-126. https://doi.org/10.11896/jsjkx.220200179
[14]	刘冬梅, 徐洋, 吴泽彬, 刘倩, 宋斌, 韦志辉. 基于边框距离度量的增量目标检测方法 Incremental Object Detection Method Based on Border Distance Measurement 计算机科学, 2022, 49(8): 136-142. https://doi.org/10.11896/jsjkx.220100132
[15]	陈坤峰, 潘志松, 王家宝, 施蕾, 张锦. 基于双目叠加仿生的微换衣行人再识别 Moderate Clothes-Changing Person Re-identification Based on Bionics of Binocular Summation 计算机科学, 2022, 49(8): 165-171. https://doi.org/10.11896/jsjkx.210600140

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

弱标签环境下基于多尺度注意力融合的声音识别检测

Sound Recognition and Detection Based on Multi-scale Attention Fusion in Weak LabelEnvironment

PDF (PC)

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

Metrics

本文评价

推荐阅读 0