Computer Science ›› 2022, Vol. 49 ›› Issue (1): 59-64.doi: 10.11896/jsjkx.210900007

• Multilingual Computing Advanced Technology • Previous Articles     Next Articles

Query-by-Example with Acoustic Word Embeddings Using wav2vec Pretraining

LI Zhao-qi, LI Ta   

  1. Key Laboratory of Speech Acoustics and Content Understanding,Institute of Acoustics,Chinese Academy of Sciences,Beijing 100190,China
    University of Chinese Academy of Sciences,Beijing 100049,China
  • Received:2021-09-01 Revised:2021-10-09 Online:2022-01-15 Published:2022-01-18
  • About author:LI Zhao-qi,born in 1995,Ph.D.His main research interests include query by example spoken term detection and speech recognition.
    LI Ta,born in 1982,Ph.D,professor.His main research interests include large vocabulary continuous speech reco-gnition,keyword search,speaker recognition,pronunciation evaluation,emotion recognition,speech classification and analysis,and human-computer speech interaction technology.
  • Supported by:
    National Key R & D Program of China(2020AAA0108002).

Abstract: Query-by-Example is a popular keyword detection method in the absence of speech resources.It can build a keyword query system with excellent performance when there are few labeled voice resources and a lack of pronunciation dictionaries.In recent years,neural acoustic word embeddings has become a commonly used Query-by-Example method.In this paper,we propose to use wav2vec pre-training to optimize the neural acoustic word embeddings system,which is using bidirectional long short-term memory.On the data set extracted in SwitchBoard,the features extracted by the wav2vec model are directly used to replace the Mel frequency cepstral coefficient features,which relatively increases the system's average precision rate by 11.1% and precision recall break-even point by 10.0%.Subsequently,we tried some methods to fuse the wav2vec feature and Mel frequency cepstral coefficient feature to extract the embedding vector.The average precision rate and precision recall break-even point of the fusion method is a relative increase of 5.3% and 2.5% compared to the method using only wav2vec.

Key words: Acoustic word embedding, Isolated word discrimination, Query-by-example, Spoken term detection, wav2vec pretraining

CLC Number: 

  • TP181
[1]ITAKURAF.Minimum prediction residual principle applied to speech recognition[J].IEEE Transactions on Acoustics,Speech,and Signal Processing,1975,23(1):67-72.
[2]SETTLE S,LEVIN K,KAMPERH,et al.Query-by-examplesearch with discriminative neural acoustic word embeddings[C] //Proc. Interspeech.Stockholm,Sweden,2017:2874-2878.
[3]SHAH N,SREERAJ R,MADHAVI M C,et al.Query-By-Example Spoken Term Detection Using Generative Adversarial Network[C]//Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).IEEE,2020:644-648.
[4]HAZEN T J,SHEN W,WHITE C.Query-by-example spoken term detection using phonetic posteriorgram templates[C] // IEEE Workshop on Automatic Speech Recognition & Understanding.Merano,Italy,2009:421-426.
[5]ZHANG Y D,GLASS J R.Unsupervised spoken keyword spotting via segmental dtw on gaussian posteriorgrams[C]//IEEE Workshop on Automatic Speech Recognition & Understanding.Merano,Italy,2009:398-403.
[6]MA M,WU H,WANG X,et al.Acoustic word embedding system for code-switching query-by-example spoken term detection[C]//12th International Symposium on Chinese Spoken Language Processing (ISCSLP).IEEE,2021.
[7]CHEN H J,LEUNG C C,XIE L,et al.Unsupervised bottleneck features for low-resource query-by-example spoken term detection[C]//Proc.Interspeech.San Francisco,USA,2016:923-927.
[8]YUAN Y G,LEUNG C C,XIE L,et al.Pairwise learning using multi-lingual bottleneck features for lowresource query-by-example spoken term detection[C]//IEEE International Confe-rence on Acoustics,Speech and Signal Processing (ICASSP).New Orleans,USA,2017:5645-5649.
[9]RAM D,MICULICICH L,BOURLARD H.Multilingual bot-tleneck features for query by-example spoken term detection[C]//IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).Sentosa,Singapore,2019:621-628.
[10]RAM D,MICULICICH L,BOURLARD H.Neural networkbased end-to-end query by example spoken term detection[J].IEEE/ACM Transactions on Audio,Speech.and Language Processing,2020,28:1416-1427.
[11]LEVIN K,JANSEN A,VAN DURME B.Segmental acousticindexing for zero resource keyword search[C]//IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP).Brisbane,Australia,2015:5828-5832.
[12]CHUNG Y A,WU C C,SHEN C H,et al.Audio word2vec:Unsupervised learning of audio segment representations using sequence-to-sequence autoencoder[C]//Proc.Interspeech.San Francisco,USA,2016:765-769.
[13]MÜLLER M.Dynamic time warping[M]//Information Retrie-val for Music and Motion.Berlin:Springer,2007:69-84.
[14]DHANANJAY R,AFSANEH A,HERV B.I Sparse subspacemodeling for query by example spoken term detection[J].IEEE/ACM Trans.Audio,Speech,Lang.Process.,2018,26(6):1130-1143.
[15]ZHAN J,HE Q,SU J,et al.A Stage Match for Query-by-Example Spoken Term Detection Based On Structure Information of Query[C]//IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP 2021).IEEE,2021:6833-6837.
[16]HE W J,WANG W R,LIVESCU K.Multi-view recurrent neural acoustic word embeddings[C]//Proc.ICLR. Toulon,France,2017.
[17]JUNG M,LIM H,GOO J,et al.Additional shared decoder onsiamese multi-view encoders for learning acoustic word embeddings[C]//IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).Sentosa,Singapore,2019:629-636.
[18]AUDHKHASI K,ROSENBERG A,SETHY A,et al.End-to-end asr-free keyword search from speech[J].IEEE Journal of Selected Topics in Signal Processing,2017,11(8):1351-1359.
[19]KAMPER H,LIVESCU K,GOLDWATER S.An embeddedsegmental k-means model for unsupervised segmentation and clustering of speech[C]//IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).Okinawa,Japan,2017:719-726.
[20]SCHNEIDER S,BAEVSKI A,COLLOBERT R,et al. wav2vec:Unsupervised pre-training for speech recognition[C]//Proc.Interspeech.Graz,Austria,2019:3465-3469.
[21]BAEVSKI A,AULI M,MOHAMED A.Effectiveness of self-supervised pre-training for asr[C]//International Conference on Acoustics,Speech and Signal Processing (ICASSP).Barcelona,Spain,2020:7694-7698.
[22]RIVIÈRE M,JOULIN A,MAZARÈ P E,et al.Unsupervised pretraining transfers well across languages[C]//International Conference on Acoustics,Speech and Signal Processing (ICASSP).Virtual Barcelona,Spain,2020:7414-7418.
[23]HOFFER E,AILON N.Deep metric learning using triplet network[C]//International Workshop on Similarity-based Pattern Recognition.Cham:Springer,2015:84-92.
[24]GODFREY J J,HOLLIMAN E C,MCDANIE L J.SWITCHBOARD:telephone speech corpus for research and development[C]//IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP).San Francisco,USA,1992:517-520.
[25]POVEY D,GHOSHAL A.The Kaldi Speech Recognition Toolkit[C]//IEEE Workshop on Automatic Speech Recognition and Understanding(ASRU).Big Island,USA,2011:1-14.
[26]ABADI M,AGARWAL A,BARHAM P,et al.Tensorflow:Large-scale machine learning on heterogeneous distributed systems [EB/OL].(2016-3-16) [2021-08-31].https://arxiv.org/abs/1603.04467.
[27]PANAYOTOV V,CHEN G,POVEY D,et al.Librispeech:an asr corpus based on public domain audio books[C]//IEEE International Conference on Acoustics,Speech and Signal Proces-sing (ICASSP).Brisbane,Austrlia,2015:5206-5210.
[28]SETTLE S,LIVESCU K.Discriminative acoustic word embeddings:Tecurrent neural network-based approaches[C]//2016 IEEE Spoken Language Technology Workshop (SLT).IEEE,2016:503-510.
[1] BAO Fei-long,GAO Guang-lai,YAN Xue-liang and WANG Wei-hua. Research on Mongolian Spoken Term Detection Method Based on Segmentation Recognition [J]. Computer Science, 2013, 40(9): 208-211.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!