Computer Science ›› 2020, Vol. 47 ›› Issue (1): 159-164.doi: 10.11896/jsjkx.190200365

• Computer Graphics & Multimedia • Previous Articles     Next Articles

Environment-assisted Multi-task Learning for Polyphonic Acoustic Event Detection

GAO Li-jian,MAO Qi-rong   

  1. (School of Computer Science and Communication Engineering,Jiangsu University,Zhenjiang,Jiangsu 212013,China)
  • Received:2019-02-26 Published:2020-01-19
  • About author:GAO Li-jian,born in 1993,postgraduate.His main research interests include multimedia intelligent analysis;MAO Qi-rong,born in 1975,professor,Ph.D supervisor,is member of China Computer Federation (CCF).Her main research interests include multimedia intelligent analysis and emotional computing.
  • Supported by:
    This work was supported by the Key Projects of the National Natural Science Foundation of China (1836220) and National Nature Science Foundation of China (61672267).

Abstract: Polyphonic Acoustic Event Detection (AED) is a challenging task as the sounds are mixed with the signals from diffe-rent events,and the overall features extracted from the mixture can not represent each event well,leading to suboptimal AED performance especially when the number of sound events increases or environment changes.Existing methods do not consider the impact of environmental changes on detection performance.Therefore,an Environment-Assisted Multi-Task learning (EAMT) method for AED was proposed.EAMT model mainly consists of two core parts:environment classifier and sound event detector,where the environment classifier is used to learn environment context features.As additional information of event detection,the environment context features are fused with sound event features to assist sound event detection by muli-task learning,so as to improve the robustness of EAMT model to environmental changes and the performance of polyphonic event detection.Based on Freesound dataset,one of the mainstream open data set in the field of AED,and general performance evaluation metrics F1 score,three sets of comparative experiments were set up to compare the proposed method with DNN(baseline) and CRNN,which is one of the most popular methods.The experimental results show that:compared with the single task model,EAMT model improves the performance of environment classification and event detection,and the introduction of environment context features further improves the performance of acoustic event detection.EAMT model has stronger robustness than DNN and CRNN as the F1 score of EAMT is 2% to 5% higher than other models when environment changes.When the number of target events increases,EAMT model still performs prominently,and compared with other models,EAMT model achieves an improvement of about 2% to 10% in F1 score.

Key words: Acoustic event detection, Environmental robustness, Environment-assisted, Features fusion, Multi-task learning

CLC Number: 

  • TP391
[1]CAKIR E,HEITTOLA T,HUTTUNEN H,et al.Polyphonic sound event detection using multi label deep neural networks[C]∥Proceedings of the 6th International Joint Conference on Neural Networks.Killarney,Ireland,2015:1-7.
[2]ZHANG A Y,NI C J.Research on background model adaptive method of audio monitoring system based on audio event detection and classification[J].Computer Science,2016,43(9):310-314.
[3]ZHANG D,ELLISD.Detecting sound events in basketballvideo archive[R].Department of Electrical Engineering,Columbia University,New York,2001.
[4]CHU S,NARAYANAN S,KUO C l.Where am I? Scene Recognition for Mobile Robots using Audio Features[C]∥Procee-dings of the 7th International Conference on Multimedia and Expo.Toronto,Canada,2006:885-888.
[5]HARMAA,MCKINNEY M F,SKOWRONEK J.Automatic surveillance of the acoustic activity in our living environment[C]∥Proceedings of the 6th IEEE International Conference on Multimedia and Expo.Amsterdam,Netherlands,2005:634-637.
[6]INNAMI S,KASAH.NMF-based environmental sound source separation using time-variant gain features[J].Computers & Mathematics with Applications,2012,64(5):1333-1342.
[7]DESSEINA,CONT A,LEMAITRE G.Real-time detection of overlapping sound events with non-negative matrix factorization[M].Matrix Information Geometry,2013:341-371.
[8]MESARO A,HEITTOLA T,ERONEN A,et al.Acoustic event detection in real life recordings[C]∥Proceedings of the 18th Signal Processing Conference.Aalborg,Denmark,2010:1267-1271.
[9]HEITTOLA T,MESAROS A,VIRTANEN T,et al.Supervised model training for overlapping sound eventsbased on unsupervised source separation[C]∥Proceedings of the 38th IEEE International Conference on Acoustics,Speech and Signal Proces-sing.Vancouver,Canada,2013:8677-8681.
[10]MUN S,SHON S,KIM W,et al.Deep neural networkbottleneck features for acoustic event recognition[C]∥Proceedings of the 16th INTERSPEECH.SanFrancisco,USA,2016:2954-2957.
[11]GENCOGLUO,VIRTANEN T,HUTTUNEN H.Recognitionof acoustic events using deep neural networks[C]∥Proceedings of the 22nd European Signal Processing Conference.Lisbon,Portugal,2014:506-510.
[12]PARASCANDOLO G,HUTTUNEN H,VIRTANEN T.Re- current neural networks for polyphonic sound event detection in real life recordings[C]∥Proceedings of the 9th International Conference on Acoustics,Speech,and Signal Processing.Shanghai,China,2016:6440-6444.
[13]WANG Y,METZE F.A transfer learning based featureextractor for polyphonic sound event detection using connectionist temporal classification[C]∥Proceedings of the 19th INTERSPEECH.Hyderabad,India,2018:3097-3101.
[14]XIA X,TOGNERI R,SOHEL F,et al.Frame-wise dynamic threshold based polyphonic acoustic eventdetection[C]∥Proceedings of the 19th INTERSPEECH.Hyderabad,India,2018.
[15]ZHRER M,PERNKOPF F.Virtual adversarial trainingand data augmentation for acoustic event detection withgated recurrent neural networks[C]∥Proceedings of the 19th INTERSPEECH.Hyderabad,India,2018:493-497.
[16]MCLOUGHLIN I,ZHANGH,XIE Z,et al.Robust sound event classification using deep neuralnetworks[J].IEEE/ACM Transactions on Audio,Speech,and Language Processing,2015,23(3):540-552.
[17]DO V H,CHEN N F,LIM B P,et al.Multitask Learning for Phone Recognition of Underresourced Languages Using Mismatched Transcription[J].IEEE/ACM Transactions on Audio Speech & Language Processing,2018,26(3):501-514.
[18]TAN Z,MAK M W,MAK K W.DNN-Based Score Calibration With Multitask Learning for Noise Robust Speaker Verification[J].IEEE/ACM Transactions on Audio Speech & Language Processing,2018,26(4):700-712.
[19]LU Y,KUMAR A,ZHAI S,et al.Fully-adaptive feature sharing in multi-task networks with applications in person attribute classification[C]∥Proceedings of the 31st IEEE Conference on Computer Vision and Pattern Recognition.Salt Lake City,USA,2018:1131-1140.
[20]SØGAARD A,GOLDBERG Y.Deep multi-task learning with low level tasks supervised at lower layers[C]∥Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics,Berlin,Germany,2016:231-235.
[21]FONT F,ROMA G,SERRA X.Freesound technical demo[C]∥Proceedings of the 21st ACM International Conference on Multimedia.Barcelona,Spain,2013:411-412.
[22]ADAVANNE S,VIRTANEN T.A report on sound event detection with different binaural features[R].Technical Report,DCASE2017 Challenge,2017.
[1] DU Li-jun, TANG Xi-lu, ZHOU Jiao, CHEN Yu-lan, CHENG Jian. Alzheimer's Disease Classification Method Based on Attention Mechanism and Multi-task Learning [J]. Computer Science, 2022, 49(6A): 60-65.
[2] ZHAO Kai, AN Wei-chao, ZHANG Xiao-yu, WANG Bin, ZHANG Shan, XIANG Jie. Intracerebral Hemorrhage Image Segmentation and Classification Based on Multi-taskLearning of Shared Shallow Parameters [J]. Computer Science, 2022, 49(4): 203-208.
[3] YANG Xiao-yu, YIN Kang-ning, HOU Shao-qi, DU Wen-yi, YIN Guang-qiang. Person Re-identification Based on Feature Location and Fusion [J]. Computer Science, 2022, 49(3): 170-178.
[4] QU Zhong, CHEN Wen. Concrete Pavement Crack Detection Based on Dilated Convolution and Multi-features Fusion [J]. Computer Science, 2022, 49(3): 192-196.
[5] SONG Long-ze, WAN Huai-yu, GUO Sheng-nan, LIN You-fang. Multi-task Spatial-Temporal Graph Convolutional Network for Taxi Idle Time Prediction [J]. Computer Science, 2021, 48(7): 112-117.
[6] LI Jia-qian, YAN Hua. Crowd Counting Method Based on Cross-column Features Fusion [J]. Computer Science, 2021, 48(6): 118-124.
[7] LIU Xiao-long, HAN Fang, WANG Zhi-jie. Joint Question Answering Model Based on Knowledge Representation [J]. Computer Science, 2021, 48(6): 241-245.
[8] ZHOU Xiao-jin, XU Chen-ming, RUAN Tong. Multi-granularity Medical Entity Recognition for Chinese Electronic Medical Records [J]. Computer Science, 2021, 48(4): 237-242.
[9] ZHANG Chun-yun, QU Hao, CUI Chao-ran, SUN Hao-liang, YIN Yi-long. Process Supervision Based Sequence Multi-task Method for Legal Judgement Prediction [J]. Computer Science, 2021, 48(3): 227-232.
[10] WANG Ti-shuang, LI Pei-feng, ZHU Qiao-ming. Chinese Implicit Discourse Relation Recognition Based on Data Augmentation [J]. Computer Science, 2021, 48(10): 85-90.
[11] PAN Zu-jiang, LIU Ning, ZHANG Wei, WANG Jian-yong. MTHAM:Multitask Disease Progression Modeling Based on Hierarchical Attention Mechanism [J]. Computer Science, 2020, 47(9): 185-189.
[12] WU Hong-tao, LIU Li-yuan, MENG Ying, RONG Ya-peng and LI Lu-kai. Novel Threat Degree Analysis Method for Scattered ObJects in Road Traffic Based on Dynamic Multi-feature Fusion [J]. Computer Science, 2020, 47(6A): 196-205.
[13] ZHOU Zi-qin, YAN Hua. 3D Shape Recognition Based on Multi-task Learning with Limited Multi-view Data [J]. Computer Science, 2020, 47(4): 125-130.
[14] GENG Lei-lei, CUI Chao-ran, SHI Cheng, SHEN Zhen, YIN Yi-long, FENG Shi-hong. Social Image Tag and Group Joint Recommendation Based on Deep Multi-task Learning [J]. Computer Science, 2020, 47(12): 177-182.
[15] CHEN Xun-min, YE Shu-han, ZHAN Rui. Crowd Counting Model of Convolutional Neural Network Based on Multi-task Learning and Coarse to Fine [J]. Computer Science, 2020, 47(11A): 183-187.
Full text



No Suggested Reading articles found!