计算机科学 ›› 2022, Vol. 49 ›› Issue (1): 53-58.doi: 10.11896/jsjkx.210800269
杨润延1,2, 程高峰1, 刘建1
YANG Run-yan1,2, CHENG Gao-feng1, LIU Jian1
摘要: 近十年来,端到端的语音识别框架发展迅速。区别于传统的基于隐马尔可夫模型的语音识别框架,端到端语音识别拥有众多新特性,而且可以达到相同或更优秀的性能。因此,端到端语音识别吸引了越来越多的关注,已经成为了与传统语音识别并列的第二类主流框架。针对端到端语音识别无法提供关键词检索所需的关键词准确时间起止点与可靠置信度的问题,提出了一种基于端到端语音识别和帧级别对齐的关键词检索框架,并在越南语数据集上进行了实验验证。首先,使用端到端语音识别模型解码待测语句,得到N-最佳假设;然后,从一个与上述识别模型联合训练的音素分类器中获得逐帧音素概率,使用一个基于动态规划的对齐算法为检出的N-最佳假设和逐帧音素概率进行对齐,进而得到N-最佳假设中各个单词的时间起止点和置信度;最后,在N-最佳假设中匹配关键词,并利用时间起止点和置信度合并重复匹配的关键词,得到最终检索结果。在一个越南语自由交谈数据集上的实验表明,提出的关键词检索系统的F1值可以达到77.6%,相对于传统的基于隐马尔可夫模型的关键词检索系统的F1值提升了7.8%,而且可以提供可靠的关键词置信度。
中图分类号:
[1]SHAO J,ZHAO Q,ZHANG P,et al.A fast fuzzy keywordspotting algorithm based on syllable confusion network[C]//Eighth Annual Conference of the International Speech Communication Association.2007. [2]ZHANG P,SHAO J,HAN J,et al.Keyword spotting based on phoneme confusion matrix[C]//Proc.of ISCSLP.2006:408-419. [3]AUDHKHASI K,ROSENBERG A,SETHY A,et al.End-to-end ASR-free keyword search from speech[J].IEEE Journal of Selected Topics in Signal Processing,2017,11(8):1351-1359. [4]MYER S,TOMAR V S.Efficient keyword spotting using time delay neural networks[C]//Proc. Interspeech 2018.2018:1264-1268. [5]KINGSBURY B,CUI J,CUI X,et al.A high-performance Cantonese keyword search system[C]//2013 IEEE International Conference on Acoustics,Speech and Signal Processing.IEEE,2013:8277-8281. [6]CHOROWSKI J,BAHDANAU D,SERDYUK D,et al.Attention-based models for speech recognition[C]//Advances in Neural Information Processing Systems 28:Annual Conference on Neural Information Processing Systems 2015.2015:577-585. [7]CHAN W,JAITLY N,LE Q,et al.Listen,attend and spell:A neural network for large vocabulary conversational speech recognition[C]//2016 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP).IEEE,2016:4960-4964. [8]GRAVES A,FERNÁNDEZ S,GOMEZ F,et al.Connectionist temporal classification:labelling unsegmented sequence data with recurrent neural networks[C]//Proceedings of the 23rd International Conference on Machine Learning.2006:369-376. [9]LI J,YE G,DAS A,et al.Advancing acoustic-to-word CTCmodel[C]//2018 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP).IEEE,2018:5794-5798. [10]WATANABE S,HORI T,KIM S,et al.Hybrid CTC/attention architecture for end-to-end speech recognition[J].IEEE Journal of Selected Topics in Signal Processing,2017,11(8):1240-1253. [11]VASWANI A,SHAZEER N,PARMAR N,et al.Attention is all you need[C]//Advances in Neural Information Processing Systems.2017:5998-6008. [12]NAKATANI T.Improving transformer-based end-to-end speech recognition with connectionist temporal classification and language model integration[C]//Proc. Interspeech 2019.2019:1408-1412. [13]SARACLAR M,SPROAT R.Lattice-based search for spokenutterance retrieval[C]//Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics:HLT-NAACL 2004.2004:129-136. [14]POVEY D,GHOSHAL A,BOULIANNE G,et al.The Kaldi speech recognition toolkit[C]//IEEE 2011 workshop on automatic speech recognition and understanding.IEEE Signal Processing Society,2011 (CONF). [15]ZHENG C J,WANG C L,JIA N.Survey of Acoustic Feature Extraction in Speech Tasks[J].Computer Science,2020,47(5):110-119. [16]GAGE P.A new algorithm for data compression[J].C Users Journal,1994,12(2):23-38. [17]HOCHREITER S,SCHMIDHUBER J.Long short-term memory[J].Neural Computation,1997,9(8):1735-1780. [18]ZHANG S,ZHENG D,HU X,et al.Bidirectional long short-term memory networks for relation classification[C]//Procee-dings of the 29th Pacific Asia Conference on Language,Information and Computation.2015:73-78. [19]WATANABE S,HORI T,KARITA S,et al.Espnet:End-to-end speech processing toolkit[C]//Interspeech.2018:2207-2211. [20]NANCY C.MUC-4 evaluation metrics[C]//Conference on Message Understanding.Association for Computational Linguistics,1992. [21]RAGHAVAN V,BOLLMANN P,JUNG G S.A critical investigation of recall and precision as measures of retrieval system performance[J].ACM Transactions on Information Systems (TOIS),1989,7(3):205-229. |
[1] | 徐鸣珂, 张帆. Head Fusion:一种提高语音情绪识别的准确性和鲁棒性的方法 Head Fusion:A Method to Improve Accuracy and Robustness of Speech Emotion Recognition 计算机科学, 2022, 49(7): 132-141. https://doi.org/10.11896/jsjkx.210100085 |
[2] | 李荪, 曹峰. 智能语音技术端到端框架模型分析和趋势研究 Analysis and Trend Research of End-to-End Framework Model of Intelligent Speech Technology 计算机科学, 2022, 49(6A): 331-336. https://doi.org/10.11896/jsjkx.210500180 |
[3] | 程高峰, 颜永红. 多语言语音识别声学模型建模方法最新进展 Latest Development of Multilingual Speech Recognition Acoustic Model Modeling Methods 计算机科学, 2022, 49(1): 47-52. https://doi.org/10.11896/jsjkx.210900013 |
[4] | 张鹏, 王新晴, 肖毅, 段宝国, 许鸿辉. 基于语义边缘驱动的实时双目深度估计算法 Real-time Binocular Depth Estimation Algorithm Based on Semantic Edge Drive 计算机科学, 2021, 48(9): 216-222. https://doi.org/10.11896/jsjkx.200800203 |
[5] | 刘东, 王叶斐, 林建平, 马海川, 杨闰宇. 端到端优化的图像压缩技术进展 Advances in End-to-End Optimized Image Compression Technologies 计算机科学, 2021, 48(3): 1-8. https://doi.org/10.11896/jsjkx.201100134 |
[6] | 蒋琪, 苏伟, 谢莹, 周弘安平, 张久文, 蔡川. 基于Transformer的汉字到盲文端到端自动转换 End-to-End Chinese-Braille Automatic Conversion Based on Transformer 计算机科学, 2021, 48(11A): 136-141. https://doi.org/10.11896/jsjkx.210100025 |
[7] | 何亨, 蒋俊君, 冯可, 李鹏, 徐芳芳. 多云环境中基于属性加密的高效多关键词检索方案 Efficient Multi-keyword Retrieval Scheme Based on Attribute Encryption in Multi-cloud Environment 计算机科学, 2021, 48(11A): 576-584. https://doi.org/10.11896/jsjkx.201000026 |
[8] | 郑纯军, 王春立, 贾宁. 语音任务下声学特征提取综述 Survey of Acoustic Feature Extraction in Speech Tasks 计算机科学, 2020, 47(5): 110-119. https://doi.org/10.11896/jsjkx.190400122 |
[9] | 张经, 杨健, 苏鹏. 语音识别中单音节识别研究综述 Survey of Monosyllable Recognition in Speech Recognition 计算机科学, 2020, 47(11A): 172-174. https://doi.org/10.11896/jsjkx.200200006 |
[10] | 崔阳, 刘长红. 基于PIFA的语音识别系统评测平台 PIFA-based Evaluation Platform for Speech Recognition System 计算机科学, 2020, 47(11A): 638-641. https://doi.org/10.11896/jsjkx.200500097 |
[11] | 花明, 李冬冬, 王喆, 高大启. 基于帧级特征的端到端说话人识别 End-to-End Speaker Recognition Based on Frame-level Features 计算机科学, 2020, 47(10): 169-173. https://doi.org/10.11896/jsjkx.190800054 |
[12] | 史燕燕, 白静. 融合CFCC和Teager能量算子倒谱参数的语音识别 Speech Recognition Combining CFCC and Teager Energy Operators Cepstral Coefficients 计算机科学, 2019, 46(5): 286-289. https://doi.org/10.11896/j.issn.1002-137X.2019.05.044 |
[13] | 金欢欢,尹海波,何玲娜. 端到端单通道睡眠EEG自动分期模型 End-to-End Single-channel Automatic Staging Model for Sleep EEG Signal 计算机科学, 2019, 46(3): 242-247. https://doi.org/10.11896/j.issn.1002-137X.2019.03.036 |
[14] | 管健, 汪璟玢, 卞倩虹. 基于城市安全知识图谱的多关键词流式并行检索算法 Multi-keyword Streaming Parallel Retrieval Algorithm Based on Urban Security Knowledge Graph 计算机科学, 2019, 46(2): 35-41. https://doi.org/10.11896/j.issn.1002-137X.2019.02.006 |
[15] | 戴华, 李啸, 朱向洋, 杨庚, 易训. 面向云环境的多关键词密文排序检索研究综述 Research on Multi-keyword Ranked Search over Encrypted Cloud Data 计算机科学, 2019, 46(1): 6-12. https://doi.org/10.11896/j.issn.1002-137X.2019.01.002 |
|