计算机科学 ›› 2022, Vol. 49 ›› Issue (1): 53-58.doi: 10.11896/jsjkx.210800269

• 多语言计算前沿技术* 上一篇    下一篇

基于端到端语音识别的关键词检索技术研究

杨润延1,2, 程高峰1, 刘建1   

  1. 1 中国科学院声学研究所 北京100190
    2 中国科学院大学 北京100049
  • 收稿日期:2021-08-31 修回日期:2021-10-15 出版日期:2022-01-15 发布日期:2022-01-18
  • 通讯作者: 刘建(liujian@hccl.ioa.ac.cn)
  • 作者简介:yangrunyan@hccl.ioa.ac.cn
  • 基金资助:
    国家重点研发计划(2020AAA0108002)

Study on Keyword Search Framework Based on End-to-End Automatic Speech Recognition

YANG Run-yan1,2, CHENG Gao-feng1, LIU Jian1   

  1. 1 Institute of Acoustics,Chinese Academy of Sciences,Beijing 100190,China
    2 University of Chinese Academy of Sciences,Beijing 100049,China
  • Received:2021-08-31 Revised:2021-10-15 Online:2022-01-15 Published:2022-01-18
  • About author:YANG Run-yan,born in 1996,postgra-duate.His main research interests include multi-lingual automatic speech recognition and keyword search.
    LIU Jian,born in 1971,Ph.D,professor,master's supervisor.His main research interests include continuous automatic speech recognition,embedded speech recognition,and music retrieval.
  • Supported by:
    National Key Research and Development Program(2020AAA0108002).

摘要: 近十年来,端到端的语音识别框架发展迅速。区别于传统的基于隐马尔可夫模型的语音识别框架,端到端语音识别拥有众多新特性,而且可以达到相同或更优秀的性能。因此,端到端语音识别吸引了越来越多的关注,已经成为了与传统语音识别并列的第二类主流框架。针对端到端语音识别无法提供关键词检索所需的关键词准确时间起止点与可靠置信度的问题,提出了一种基于端到端语音识别和帧级别对齐的关键词检索框架,并在越南语数据集上进行了实验验证。首先,使用端到端语音识别模型解码待测语句,得到N-最佳假设;然后,从一个与上述识别模型联合训练的音素分类器中获得逐帧音素概率,使用一个基于动态规划的对齐算法为检出的N-最佳假设和逐帧音素概率进行对齐,进而得到N-最佳假设中各个单词的时间起止点和置信度;最后,在N-最佳假设中匹配关键词,并利用时间起止点和置信度合并重复匹配的关键词,得到最终检索结果。在一个越南语自由交谈数据集上的实验表明,提出的关键词检索系统的F1值可以达到77.6%,相对于传统的基于隐马尔可夫模型的关键词检索系统的F1值提升了7.8%,而且可以提供可靠的关键词置信度。

关键词: 端到端, 关键词检索, 语音识别, 帧级别对齐

Abstract: In the past decade,end-to-end automatic speech recognition (ASR) frameworks have developed rapidly.End-to-end ASR has shown not only very different characteristics from traditional ASR based on hidden Markov models (HMMs),but also advanced performances.Thus,end-to-end ASR is being more and more popular and has become another major type of ASR frameworks.A keyword search (KWS) framework based on end-to-end ASR and frame-synchronous alignment is proposed for solving the problem that end-to-end ASR cannot provide accurate keyword timestamps and confidence scores,and experimental verification on a Vietnamese dataset is made.First,utterances are decoded by an end-to-end Uyghur ASR system,obtaining N-best hypotheses.Next,a dynamic programming-based alignment algorithm is implemented on each of these ASR hypotheses and per-frame phoneme probabilities,which are provided by a phoneme classifier jointly trained with the ASR model,to compute time stamps and confidence scores for each word in N-best hypotheses.Then,final KWS result is obtained by detecting keywords within N-best hypotheses and removing duplicated keyword occurrences according to time stamps and confident scores.Experimental results on a Vietnamese conversational telephone speech dataset show that the proposed KWS system achieves an F1 score of 77.6%,which is relatively 7.8% higher than the F1 score of the traditional HMM-based KWS system.The proposed system also provides reliable keyword confidence scores.

Key words: End-to-end, Frame-synchronous alignment, Keyword search, Speech recognition

中图分类号: 

  • TP391
[1]SHAO J,ZHAO Q,ZHANG P,et al.A fast fuzzy keywordspotting algorithm based on syllable confusion network[C]//Eighth Annual Conference of the International Speech Communication Association.2007.
[2]ZHANG P,SHAO J,HAN J,et al.Keyword spotting based on phoneme confusion matrix[C]//Proc.of ISCSLP.2006:408-419.
[3]AUDHKHASI K,ROSENBERG A,SETHY A,et al.End-to-end ASR-free keyword search from speech[J].IEEE Journal of Selected Topics in Signal Processing,2017,11(8):1351-1359.
[4]MYER S,TOMAR V S.Efficient keyword spotting using time delay neural networks[C]//Proc. Interspeech 2018.2018:1264-1268.
[5]KINGSBURY B,CUI J,CUI X,et al.A high-performance Cantonese keyword search system[C]//2013 IEEE International Conference on Acoustics,Speech and Signal Processing.IEEE,2013:8277-8281.
[6]CHOROWSKI J,BAHDANAU D,SERDYUK D,et al.Attention-based models for speech recognition[C]//Advances in Neural Information Processing Systems 28:Annual Conference on Neural Information Processing Systems 2015.2015:577-585.
[7]CHAN W,JAITLY N,LE Q,et al.Listen,attend and spell:A neural network for large vocabulary conversational speech recognition[C]//2016 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP).IEEE,2016:4960-4964.
[8]GRAVES A,FERNÁNDEZ S,GOMEZ F,et al.Connectionist temporal classification:labelling unsegmented sequence data with recurrent neural networks[C]//Proceedings of the 23rd International Conference on Machine Learning.2006:369-376.
[9]LI J,YE G,DAS A,et al.Advancing acoustic-to-word CTCmodel[C]//2018 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP).IEEE,2018:5794-5798.
[10]WATANABE S,HORI T,KIM S,et al.Hybrid CTC/attention architecture for end-to-end speech recognition[J].IEEE Journal of Selected Topics in Signal Processing,2017,11(8):1240-1253.
[11]VASWANI A,SHAZEER N,PARMAR N,et al.Attention is all you need[C]//Advances in Neural Information Processing Systems.2017:5998-6008.
[12]NAKATANI T.Improving transformer-based end-to-end speech recognition with connectionist temporal classification and language model integration[C]//Proc. Interspeech 2019.2019:1408-1412.
[13]SARACLAR M,SPROAT R.Lattice-based search for spokenutterance retrieval[C]//Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics:HLT-NAACL 2004.2004:129-136.
[14]POVEY D,GHOSHAL A,BOULIANNE G,et al.The Kaldi speech recognition toolkit[C]//IEEE 2011 workshop on automatic speech recognition and understanding.IEEE Signal Processing Society,2011 (CONF).
[15]ZHENG C J,WANG C L,JIA N.Survey of Acoustic Feature Extraction in Speech Tasks[J].Computer Science,2020,47(5):110-119.
[16]GAGE P.A new algorithm for data compression[J].C Users Journal,1994,12(2):23-38.
[17]HOCHREITER S,SCHMIDHUBER J.Long short-term memory[J].Neural Computation,1997,9(8):1735-1780.
[18]ZHANG S,ZHENG D,HU X,et al.Bidirectional long short-term memory networks for relation classification[C]//Procee-dings of the 29th Pacific Asia Conference on Language,Information and Computation.2015:73-78.
[19]WATANABE S,HORI T,KARITA S,et al.Espnet:End-to-end speech processing toolkit[C]//Interspeech.2018:2207-2211.
[20]NANCY C.MUC-4 evaluation metrics[C]//Conference on Message Understanding.Association for Computational Linguistics,1992.
[21]RAGHAVAN V,BOLLMANN P,JUNG G S.A critical investigation of recall and precision as measures of retrieval system performance[J].ACM Transactions on Information Systems (TOIS),1989,7(3):205-229.
[1] 徐鸣珂, 张帆.
Head Fusion:一种提高语音情绪识别的准确性和鲁棒性的方法
Head Fusion:A Method to Improve Accuracy and Robustness of Speech Emotion Recognition
计算机科学, 2022, 49(7): 132-141. https://doi.org/10.11896/jsjkx.210100085
[2] 李荪, 曹峰.
智能语音技术端到端框架模型分析和趋势研究
Analysis and Trend Research of End-to-End Framework Model of Intelligent Speech Technology
计算机科学, 2022, 49(6A): 331-336. https://doi.org/10.11896/jsjkx.210500180
[3] 程高峰, 颜永红.
多语言语音识别声学模型建模方法最新进展
Latest Development of Multilingual Speech Recognition Acoustic Model Modeling Methods
计算机科学, 2022, 49(1): 47-52. https://doi.org/10.11896/jsjkx.210900013
[4] 张鹏, 王新晴, 肖毅, 段宝国, 许鸿辉.
基于语义边缘驱动的实时双目深度估计算法
Real-time Binocular Depth Estimation Algorithm Based on Semantic Edge Drive
计算机科学, 2021, 48(9): 216-222. https://doi.org/10.11896/jsjkx.200800203
[5] 刘东, 王叶斐, 林建平, 马海川, 杨闰宇.
端到端优化的图像压缩技术进展
Advances in End-to-End Optimized Image Compression Technologies
计算机科学, 2021, 48(3): 1-8. https://doi.org/10.11896/jsjkx.201100134
[6] 蒋琪, 苏伟, 谢莹, 周弘安平, 张久文, 蔡川.
基于Transformer的汉字到盲文端到端自动转换
End-to-End Chinese-Braille Automatic Conversion Based on Transformer
计算机科学, 2021, 48(11A): 136-141. https://doi.org/10.11896/jsjkx.210100025
[7] 何亨, 蒋俊君, 冯可, 李鹏, 徐芳芳.
多云环境中基于属性加密的高效多关键词检索方案
Efficient Multi-keyword Retrieval Scheme Based on Attribute Encryption in Multi-cloud Environment
计算机科学, 2021, 48(11A): 576-584. https://doi.org/10.11896/jsjkx.201000026
[8] 郑纯军, 王春立, 贾宁.
语音任务下声学特征提取综述
Survey of Acoustic Feature Extraction in Speech Tasks
计算机科学, 2020, 47(5): 110-119. https://doi.org/10.11896/jsjkx.190400122
[9] 张经, 杨健, 苏鹏.
语音识别中单音节识别研究综述
Survey of Monosyllable Recognition in Speech Recognition
计算机科学, 2020, 47(11A): 172-174. https://doi.org/10.11896/jsjkx.200200006
[10] 崔阳, 刘长红.
基于PIFA的语音识别系统评测平台
PIFA-based Evaluation Platform for Speech Recognition System
计算机科学, 2020, 47(11A): 638-641. https://doi.org/10.11896/jsjkx.200500097
[11] 花明, 李冬冬, 王喆, 高大启.
基于帧级特征的端到端说话人识别
End-to-End Speaker Recognition Based on Frame-level Features
计算机科学, 2020, 47(10): 169-173. https://doi.org/10.11896/jsjkx.190800054
[12] 史燕燕, 白静.
融合CFCC和Teager能量算子倒谱参数的语音识别
Speech Recognition Combining CFCC and Teager Energy Operators Cepstral Coefficients
计算机科学, 2019, 46(5): 286-289. https://doi.org/10.11896/j.issn.1002-137X.2019.05.044
[13] 金欢欢,尹海波,何玲娜.
端到端单通道睡眠EEG自动分期模型
End-to-End Single-channel Automatic Staging Model for Sleep EEG Signal
计算机科学, 2019, 46(3): 242-247. https://doi.org/10.11896/j.issn.1002-137X.2019.03.036
[14] 管健, 汪璟玢, 卞倩虹.
基于城市安全知识图谱的多关键词流式并行检索算法
Multi-keyword Streaming Parallel Retrieval Algorithm Based on Urban Security Knowledge Graph
计算机科学, 2019, 46(2): 35-41. https://doi.org/10.11896/j.issn.1002-137X.2019.02.006
[15] 戴华, 李啸, 朱向洋, 杨庚, 易训.
面向云环境的多关键词密文排序检索研究综述
Research on Multi-keyword Ranked Search over Encrypted Cloud Data
计算机科学, 2019, 46(1): 6-12. https://doi.org/10.11896/j.issn.1002-137X.2019.01.002
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!