基于端到端语音识别的关键词检索技术研究

doi:10.11896/jsjkx.210800269

Computer Science ›› 2022, Vol. 49 ›› Issue (1): 53-58.doi: 10.11896/jsjkx.210800269

• Multilingual Computing Advanced Technology • Previous Articles Next Articles

Study on Keyword Search Framework Based on End-to-End Automatic Speech Recognition

YANG Run-yan^1,2, CHENG Gao-feng¹, LIU Jian¹

1 Institute of Acoustics,Chinese Academy of Sciences,Beijing 100190,China
2 University of Chinese Academy of Sciences,Beijing 100049,China

Received:2021-08-31 Revised:2021-10-15 Online:2022-01-15 Published:2022-01-18
About author:YANG Run-yan,born in 1996,postgra-duate.His main research interests include multi-lingual automatic speech recognition and keyword search.
LIU Jian,born in 1971,Ph.D,professor,master's supervisor.His main research interests include continuous automatic speech recognition,embedded speech recognition,and music retrieval.
Supported by:
National Key Research and Development Program(2020AAA0108002).

Abstract

Abstract: In the past decade,end-to-end automatic speech recognition (ASR) frameworks have developed rapidly.End-to-end ASR has shown not only very different characteristics from traditional ASR based on hidden Markov models (HMMs),but also advanced performances.Thus,end-to-end ASR is being more and more popular and has become another major type of ASR frameworks.A keyword search (KWS) framework based on end-to-end ASR and frame-synchronous alignment is proposed for solving the problem that end-to-end ASR cannot provide accurate keyword timestamps and confidence scores,and experimental verification on a Vietnamese dataset is made.First,utterances are decoded by an end-to-end Uyghur ASR system,obtaining N-best hypotheses.Next,a dynamic programming-based alignment algorithm is implemented on each of these ASR hypotheses and per-frame phoneme probabilities,which are provided by a phoneme classifier jointly trained with the ASR model,to compute time stamps and confidence scores for each word in N-best hypotheses.Then,final KWS result is obtained by detecting keywords within N-best hypotheses and removing duplicated keyword occurrences according to time stamps and confident scores.Experimental results on a Vietnamese conversational telephone speech dataset show that the proposed KWS system achieves an F1 score of 77.6%,which is relatively 7.8% higher than the F1 score of the traditional HMM-based KWS system.The proposed system also provides reliable keyword confidence scores.

Key words: End-to-end, Frame-synchronous alignment, Keyword search, Speech recognition

CLC Number:

TP391

YANG Run-yan, CHENG Gao-feng, LIU Jian. Study on Keyword Search Framework Based on End-to-End Automatic Speech Recognition[J].Computer Science, 2022, 49(1): 53-58.

References

[1]SHAO J,ZHAO Q,ZHANG P,et al.A fast fuzzy keywordspotting algorithm based on syllable confusion network[C]//Eighth Annual Conference of the International Speech Communication Association.2007.
[2]ZHANG P,SHAO J,HAN J,et al.Keyword spotting based on phoneme confusion matrix[C]//Proc.of ISCSLP.2006:408-419.
[3]AUDHKHASI K,ROSENBERG A,SETHY A,et al.End-to-end ASR-free keyword search from speech[J].IEEE Journal of Selected Topics in Signal Processing,2017,11(8):1351-1359.
[4]MYER S,TOMAR V S.Efficient keyword spotting using time delay neural networks[C]//Proc. Interspeech 2018.2018:1264-1268.
[5]KINGSBURY B,CUI J,CUI X,et al.A high-performance Cantonese keyword search system[C]//2013 IEEE International Conference on Acoustics,Speech and Signal Processing.IEEE,2013:8277-8281.
[6]CHOROWSKI J,BAHDANAU D,SERDYUK D,et al.Attention-based models for speech recognition[C]//Advances in Neural Information Processing Systems 28:Annual Conference on Neural Information Processing Systems 2015.2015:577-585.
[7]CHAN W,JAITLY N,LE Q,et al.Listen,attend and spell:A neural network for large vocabulary conversational speech recognition[C]//2016 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP).IEEE,2016:4960-4964.
[8]GRAVES A,FERNÁNDEZ S,GOMEZ F,et al.Connectionist temporal classification:labelling unsegmented sequence data with recurrent neural networks[C]//Proceedings of the 23rd International Conference on Machine Learning.2006:369-376.
[9]LI J,YE G,DAS A,et al.Advancing acoustic-to-word CTCmodel[C]//2018 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP).IEEE,2018:5794-5798.
[10]WATANABE S,HORI T,KIM S,et al.Hybrid CTC/attention architecture for end-to-end speech recognition[J].IEEE Journal of Selected Topics in Signal Processing,2017,11(8):1240-1253.
[11]VASWANI A,SHAZEER N,PARMAR N,et al.Attention is all you need[C]//Advances in Neural Information Processing Systems.2017:5998-6008.
[12]NAKATANI T.Improving transformer-based end-to-end speech recognition with connectionist temporal classification and language model integration[C]//Proc. Interspeech 2019.2019:1408-1412.
[13]SARACLAR M,SPROAT R.Lattice-based search for spokenutterance retrieval[C]//Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics:HLT-NAACL 2004.2004:129-136.
[14]POVEY D,GHOSHAL A,BOULIANNE G,et al.The Kaldi speech recognition toolkit[C]//IEEE 2011 workshop on automatic speech recognition and understanding.IEEE Signal Processing Society,2011 (CONF).
[15]ZHENG C J,WANG C L,JIA N.Survey of Acoustic Feature Extraction in Speech Tasks[J].Computer Science,2020,47(5):110-119.
[16]GAGE P.A new algorithm for data compression[J].C Users Journal,1994,12(2):23-38.
[17]HOCHREITER S,SCHMIDHUBER J.Long short-term memory[J].Neural Computation,1997,9(8):1735-1780.
[18]ZHANG S,ZHENG D,HU X,et al.Bidirectional long short-term memory networks for relation classification[C]//Procee-dings of the 29th Pacific Asia Conference on Language,Information and Computation.2015:73-78.
[19]WATANABE S,HORI T,KARITA S,et al.Espnet:End-to-end speech processing toolkit[C]//Interspeech.2018:2207-2211.
[20]NANCY C.MUC-4 evaluation metrics[C]//Conference on Message Understanding.Association for Computational Linguistics,1992.
[21]RAGHAVAN V,BOLLMANN P,JUNG G S.A critical investigation of recall and precision as measures of retrieval system performance[J].ACM Transactions on Information Systems (TOIS),1989,7(3):205-229.

Related Articles 15

[1]	XU Ming-ke, ZHANG Fan. Head Fusion:A Method to Improve Accuracy and Robustness of Speech Emotion Recognition [J]. Computer Science, 2022, 49(7): 132-141.
[2]	LI Sun, CAO Feng. Analysis and Trend Research of End-to-End Framework Model of Intelligent Speech Technology [J]. Computer Science, 2022, 49(6A): 331-336.
[3]	CHENG Gao-feng, YAN Yong-hong. Latest Development of Multilingual Speech Recognition Acoustic Model Modeling Methods [J]. Computer Science, 2022, 49(1): 47-52.
[4]	ZHANG Peng, WANG Xin-qing, XIAO Yi, DUAN Bao-guo, XU Hong-hui. Real-time Binocular Depth Estimation Algorithm Based on Semantic Edge Drive [J]. Computer Science, 2021, 48(9): 216-222.
[5]	LIU Dong, WANG Ye-fei, LIN Jian-ping, MA Hai-chuan, YANG Run-yu. Advances in End-to-End Optimized Image Compression Technologies [J]. Computer Science, 2021, 48(3): 1-8.
[6]	JIANG Qi, SU Wei, XIE Ying, ZHOUHONG An-ping, ZHANG Jiu-wen, CAI Chuan. End-to-End Chinese-Braille Automatic Conversion Based on Transformer [J]. Computer Science, 2021, 48(11A): 136-141.
[7]	ZHENG Chun-jun, WANG Chun-li, JIA Ning. Survey of Acoustic Feature Extraction in Speech Tasks [J]. Computer Science, 2020, 47(5): 110-119.
[8]	ZHANG Jing, YANG Jian, SU Peng. Survey of Monosyllable Recognition in Speech Recognition [J]. Computer Science, 2020, 47(11A): 172-174.
[9]	CUI Yang, LIU Chang-hong. PIFA-based Evaluation Platform for Speech Recognition System [J]. Computer Science, 2020, 47(11A): 638-641.
[10]	HUA Ming, LI Dong-dong, WANG Zhe, GAO Da-qi. End-to-End Speaker Recognition Based on Frame-level Features [J]. Computer Science, 2020, 47(10): 169-173.
[11]	HUA Zhen, ZHANG Hai-cheng, LI Jin-jiang. End-to-end Image Super Resolution Based on Residuals [J]. Computer Science, 2019, 46(6): 246-255.
[12]	SHI Yan-yan, BAI Jing. Speech Recognition Combining CFCC and Teager Energy Operators Cepstral Coefficients [J]. Computer Science, 2019, 46(5): 286-289.
[13]	GUANJian, WANG Jing-bin, BIAN Qian-hong. Multi-keyword Streaming Parallel Retrieval Algorithm Based on Urban Security Knowledge Graph [J]. Computer Science, 2019, 46(2): 35-41.
[14]	DAI Hua, LI Xiao, ZHU Xiang-yang, YANG Geng, YI Xun. Research on Multi-keyword Ranked Search over Encrypted Cloud Data [J]. Computer Science, 2019, 46(1): 6-12.
[15]	DAI Hua, BAO Jing-jing, ZHU Xiang-yang, YI Xun, YANG Geng. Integrity-verifying Single Keyword Search Method in Clouds [J]. Computer Science, 2018, 45(12): 92-97.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Study on Keyword Search Framework Based on End-to-End Automatic Speech Recognition

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 15

Metrics

Comments

Recommended 0