Query-by-Example with Acoustic Word Embeddings Using wav2vec Pretraining

LI Zhao-qi, LI Ta   

  1. Key Laboratory of Speech Acoustics and Content Understanding,Institute of Acoustics,Chinese Academy of Sciences,Beijing 100190,China
    University of Chinese Academy of Sciences,Beijing 100049,China
  Received:2021-09-01 Revised:2021-10-09 Online:2022-01-15 Published:2022-01-18
  • About author:LI Zhao-qi,born in 1995,Ph.D.His main research interests include query by example spoken term detection and speech recognition.
    LI Ta,born in 1982,Ph.D,professor.His main research interests include large vocabulary continuous speech reco-gnition,keyword search,speaker recognition,pronunciation evaluation,emotion recognition,speech classification and analysis,and human-computer speech interaction technology.
  Supported by:
    National Key R & D Program of China(2020AAA0108002).

Abstract: Query-by-Example is a popular keyword detection method in the absence of speech resources.It can build a keyword query system with excellent performance when there are few labeled voice resources and a lack of pronunciation dictionaries.In recent years,neural acoustic word embeddings has become a commonly used Query-by-Example method.In this paper,we propose to use wav2vec pre-training to optimize the neural acoustic word embeddings system,which is using bidirectional long short-term memory.On the data set extracted in SwitchBoard,the features extracted by the wav2vec model are directly used to replace the Mel frequency cepstral coefficient features,which relatively increases the system's average precision rate by 11.1% and precision recall break-even point by 10.0%.Subsequently,we tried some methods to fuse the wav2vec feature and Mel frequency cepstral coefficient feature to extract the embedding vector.The average precision rate and precision recall break-even point of the fusion method is a relative increase of 5.3% and 2.5% compared to the method using only wav2vec.

Key words: Acoustic word embedding, Isolated word discrimination, Query-by-example, Spoken term detection, wav2vec pretraining

