计算机科学 ›› 2017, Vol. 44 ›› Issue (10): 318-322.doi: 10.11896/j.issn.1002-137X.2017.10.057

• 图形图像与模式识别 • 上一篇    

资源稀缺蒙语语音识别研究

张爱英,倪崇嘉   

  1. 山东财经大学系统科学与信息处理研究所 济南250014,山东财经大学系统科学与信息处理研究所 济南250014
  • 出版日期:2018-12-01 发布日期:2018-12-01
  • 基金资助:
    本文受国家自然科学基金(61305027),山东省自然科学基金(ZR2011FQ024),山东省高等学校科技计划(J17KB160)资助

Research on Low-resource Mongolian Speech Recognition

ZHANG Ai-ying and NI Chong-jia   

  • Online:2018-12-01 Published:2018-12-01

摘要: 随着语音识别技术的发展,资源稀缺语言的语音识别系统的研究吸引了更广泛的关注。以蒙语为目标识别语言,研究了在资源稀缺的情况下(如仅有10小时的带标注的语音)如何利用其他多语言信息提高识别系统的性能。借助基于多语言深度神经网络的跨语言迁移学习和基于多语言深度Bottleneck神经网络的抽取特征可以获得更具有区分度的声学模型。通过搜索引擎以及网络爬虫的定向抓取获得大量的网页数据,有助于获得文本数据,以增强语言模型的性能。融合多个不同识别结果以进一步提高识别精度。与基线系统相比,多种系统融合的识别绝对错误率减少12%。

关键词: 资源稀缺,多语言深度神经网络,Web语言模型

Abstract: With the development of speech recognition technology,the research on low-resource speech recognition has gained extensive attention.Taking the Mongolian as the target language,we studied how to use the multilingual information to improve the performance of speech recognition in the low-resource condition,for example,only 10 hours of transcribed speech data are used for acoustic modeling.More discriminative acoustic model can be gotten by using cross-lingual transfer of multilingual deep neural network and multilingual deep bottleneck features.Large amount of web pages can be gotten by using the web search engine and Web crawler,which can help to get large amount of text data for improving the performance of language model.It can further improve the recognition results by fusing different number of recognition results from different recognizers.Comparing the fusion recognition result with the recognition result of baseline system,there are nearly 12% absolute word error rate (WER) reductions.

Key words: Low-resource,Multilingual deep neural network,Web based language model

[1] Ethnologue .http://www.ethnologue.com.
[2] Summer Institute for Linguistics Ethnologue Survey 1999.https://afrobranding.wordpress.com/tag/summer-institute-for-linguistics-sil-ethnologue-survey.
[3] BESACIER L,BARNARD E,KARPOV A,et al.AutomaticSpeech Recgnition for Under-resourced Languages:A Survey [J].Speech Communication,2014,56(1):85-100.
[4] HERMANSKY H,SHARMA S.Temporal Patterns (TRAPS) in ASR of Noisy Speech [C]∥Proc.of ICASSP 1999.1999:289-292.
[5] HERMANSKY H,SHARMA S,JAIN P.Data-derived Non-linear Mapping for Feature Extraction in HMM[C]∥Proc.of ASRU.1999.
[6] GRZL F,KARAFIA M,KONTAR S,et al.Probabilistic andBottle-Neck Features for LVCSR of Meetings[C]∥Proc.of ICASSP 2007.2007:757-760.
[7] THOMAS S,GANAPATHY S,HERMANSKY H.Multilin-gual MLP features for Low resource LVCSR Systems[C]∥Proc.of ICASSP 2012.2012:4269-4272.
[8] VESELY K,KARAFIAT M,GREZL F,et al.The Language-independent Bottleneck Features[C]∥Proc.of SLT 2012.2012:336-341.
[9] VU N T,BREITER W,METZE F,et al.An Investigation on Initialization Schemes for Multilayer Perceptron Training Using Multilingual Data and Their Effect on ASR Performance[J].Interspeech,2012,26(5):25681-25689.
[10] MIAO Y,METZE F.Improving Language-Universal FeatureExtraction with Deep Maxout and Convolutional Neural Networks [C]∥Proc.of Interspeech 2014.2014:800-804.
[11] YU D,DENG L.Automatic Speech Recognition-A DeepLearning Approach[M].Springer Press,2014.
[12] DAHL G E,YU D,DENG L,et al.Context-Dependent Pre-trained Deep Neural Networks for Large Vocabulary Speech RecognitionJ].IEEE Transactions on Audio,Speech,and Language Processing,2012,20(1):33-42.
[13] HINTON G,DENG L,YU D,et al.Deep Neural Networks for Acoustic Modeling in Speech Recognition [J].IEEE Signal Processing Magazine,2012,29(6):82-97.
[14] GHOSHAL A,SWIETOJANSKI P,RENTALS S.Multilingual Training of Deep Neural Networks[C]∥Proc.of ICASSP 2013.2013:7319-7323.
[15] HUANG J T,LI J,YU D,et al.Cross-language KnowledgeTransfer using Multilingual Deep Neural Network with Shared Hidden Layers [C]∥Proc.of ICASSP 2013.2013:7304-7308.
[16] LU Y,LU F,SEHGAL S,et al.Multitask Learning in Connectionist Speech Recognition[C]∥Proc.of Australian Internatio-nal Conference on Speech Science and Technology.2004.
[17] SELTZER M L,DROPPO J.Multi-task Learning in Deep Neural Networks for Improved Phoneme Recognition[C]∥Proc.of ICASSP 2013.2013:6965-6969.
[18] CHEN D,MAK B,LEUNG C C,et al.Joint Acoustic Modeling of Triphones and Trigraphemes by Multi-task Learning Deep Neural Networks for Low-resource Speech Recognition [C]∥Proc.of ICASSP 2014.2014:5592-5596.
[19] XU H,DO V H,XIAO X,et al.A Comparative Study of BNF and DNN Multilingual Training on Cross-lingual Low-resource Speech Recognition [C]∥Proc.of Interspeech 2015.2015:2132-2136.
[20] MENDELS G,COOPER E,SOTO V,et al.Improving SpeechRecognition and Keyword Search for Low-resource Languages Using Web Data[C]∥Proc.of Interspeech 2015.2015:829-833.
[21] CUCU H,BUZO A,BESACIER L,et al.SMT-based ASR Domain Adaptation Methods for Under-resourced Languages:Application to Romanian [J].Speech Communication,2014,56(1):195-212.
[22] OFLAZER K,EL-KAHLOUT I D.Exploring Different Representational Units in English-to-Turkish Statistical Machine Translation[C]∥Proc.of Statistical Machine Translation Workshop at ACL 2007.2007:25-32.
[23] XIE C,GUO W,HU G,et al.Web Data Selection Based onWord Embedding for Low-resource Speech Recognition [C]∥Proc.of Interspeech 2016.2016:1340-1344.
[24] POVEY D,GHOSHAL A,BOULIANNE G,et al.The KaldiSpeech Recognition Toolkit [C]∥Proc.of IEEE 2011 Workshop on Automatic Speech Recognition and Understanding.2011.
[25] STOLCKE A.SRILM-An Extensible Language Modeling Toolkit [C]∥Proc.of ICSLP 2002.2002.
[26] ZHANG Y,CHUANGSUWANICH E,GLASS J.Language ID-based Training of Multilingual Stacked Bottleneck Features [C]∥Proc.of Interspeech 2014.2014:1-5.
[27] 曹道巴特尔.喀喇沁蒙古语研究[M].北京:民族出版社,2007.

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!