基于wav2vec预训练的样例关键词识别

doi:10.11896/jsjkx.210900007

计算机科学 ›› 2022, Vol. 49 ›› Issue (1): 59-64.doi: 10.11896/jsjkx.210900007

• 多语言计算前沿技术^* • 上一篇下一篇

基于wav2vec预训练的样例关键词识别

李昭奇, 黎塔

中国科学院声学研究所语言声学与内容理解重点实验室北京100190
中国科学院大学北京100049

收稿日期:2021-09-01 修回日期:2021-10-09 出版日期:2022-01-15 发布日期:2022-01-18
通讯作者: 黎塔(lita@hccl.ioa.ac.cn)
作者简介:lizhaoqi@hccl.ioa.ac.cn
基金资助:
国家重点研发计划(2020AAA0108002)

Query-by-Example with Acoustic Word Embeddings Using wav2vec Pretraining

LI Zhao-qi, LI Ta

Key Laboratory of Speech Acoustics and Content Understanding,Institute of Acoustics,Chinese Academy of Sciences,Beijing 100190,China
University of Chinese Academy of Sciences,Beijing 100049,China

Received:2021-09-01 Revised:2021-10-09 Online:2022-01-15 Published:2022-01-18
About author:LI Zhao-qi,born in 1995,Ph.D.His main research interests include query by example spoken term detection and speech recognition.
LI Ta,born in 1982,Ph.D,professor.His main research interests include large vocabulary continuous speech reco-gnition,keyword search,speaker recognition,pronunciation evaluation,emotion recognition,speech classification and analysis,and human-computer speech interaction technology.
Supported by:
National Key R & D Program of China(2020AAA0108002).

摘要/Abstract

摘要： 样例关键词识别是将语音关键词片段与语音流中的片段匹配的任务。在低资源或零资源的情况下,样例关键词识别通常采用基于动态时间规正的方法。近年来,神经网络声学词嵌入已成为一种常用的样例关键词识别方法,但神经网络的方法受限于标注数据数量。使用wav2vec预训练可以减少神经网络对数据量的依赖,提升系统的性能。使用wav2vec模型提取的预训练特征直接替换梅尔频率倒谱系数特征后,在SwitchBoard语料库中提取的数据集上使双向长短时记忆网络的神经网络声学词嵌入系统的平均准确率提高了11.1％,等精度召回值提高了10.0%。将wav2vec特征与梅尔频率倒谱系数特征相融合以提取嵌入向量的方法进一步提高了系统的性能,与仅使用wav2vec的方法相比,融合方法的平均准确率提高了5.3％,等精度召回值提高了2.5%。

关键词: wav2vec预训练, 孤立词识别, 声学词嵌入, 样例查询, 语音片段查询

Abstract: Query-by-Example is a popular keyword detection method in the absence of speech resources.It can build a keyword query system with excellent performance when there are few labeled voice resources and a lack of pronunciation dictionaries.In recent years,neural acoustic word embeddings has become a commonly used Query-by-Example method.In this paper,we propose to use wav2vec pre-training to optimize the neural acoustic word embeddings system,which is using bidirectional long short-term memory.On the data set extracted in SwitchBoard,the features extracted by the wav2vec model are directly used to replace the Mel frequency cepstral coefficient features,which relatively increases the system's average precision rate by 11.1% and precision recall break-even point by 10.0%.Subsequently,we tried some methods to fuse the wav2vec feature and Mel frequency cepstral coefficient feature to extract the embedding vector.The average precision rate and precision recall break-even point of the fusion method is a relative increase of 5.3% and 2.5% compared to the method using only wav2vec.

Key words: Acoustic word embedding, Isolated word discrimination, Query-by-example, Spoken term detection, wav2vec pretraining

中图分类号:

TP181

李昭奇, 黎塔. 基于wav2vec预训练的样例关键词识别[J]. 计算机科学, 2022, 49(1): 59-64. https://doi.org/10.11896/jsjkx.210900007

LI Zhao-qi, LI Ta. Query-by-Example with Acoustic Word Embeddings Using wav2vec Pretraining[J]. Computer Science, 2022, 49(1): 59-64. https://doi.org/10.11896/jsjkx.210900007

参考文献

[1]ITAKURAF.Minimum prediction residual principle applied to speech recognition[J].IEEE Transactions on Acoustics,Speech,and Signal Processing,1975,23(1):67-72.
[2]SETTLE S,LEVIN K,KAMPERH,et al.Query-by-examplesearch with discriminative neural acoustic word embeddings[C] //Proc. Interspeech.Stockholm,Sweden,2017:2874-2878.
[3]SHAH N,SREERAJ R,MADHAVI M C,et al.Query-By-Example Spoken Term Detection Using Generative Adversarial Network[C]//Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).IEEE,2020:644-648.
[4]HAZEN T J,SHEN W,WHITE C.Query-by-example spoken term detection using phonetic posteriorgram templates[C] // IEEE Workshop on Automatic Speech Recognition & Understanding.Merano,Italy,2009:421-426.
[5]ZHANG Y D,GLASS J R.Unsupervised spoken keyword spotting via segmental dtw on gaussian posteriorgrams[C]//IEEE Workshop on Automatic Speech Recognition & Understanding.Merano,Italy,2009:398-403.
[6]MA M,WU H,WANG X,et al.Acoustic word embedding system for code-switching query-by-example spoken term detection[C]//12th International Symposium on Chinese Spoken Language Processing (ISCSLP).IEEE,2021.
[7]CHEN H J,LEUNG C C,XIE L,et al.Unsupervised bottleneck features for low-resource query-by-example spoken term detection[C]//Proc.Interspeech.San Francisco,USA,2016:923-927.
[8]YUAN Y G,LEUNG C C,XIE L,et al.Pairwise learning using multi-lingual bottleneck features for lowresource query-by-example spoken term detection[C]//IEEE International Confe-rence on Acoustics,Speech and Signal Processing (ICASSP).New Orleans,USA,2017:5645-5649.
[9]RAM D,MICULICICH L,BOURLARD H.Multilingual bot-tleneck features for query by-example spoken term detection[C]//IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).Sentosa,Singapore,2019:621-628.
[10]RAM D,MICULICICH L,BOURLARD H.Neural networkbased end-to-end query by example spoken term detection[J].IEEE/ACM Transactions on Audio,Speech.and Language Processing,2020,28:1416-1427.
[11]LEVIN K,JANSEN A,VAN DURME B.Segmental acousticindexing for zero resource keyword search[C]//IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP).Brisbane,Australia,2015:5828-5832.
[12]CHUNG Y A,WU C C,SHEN C H,et al.Audio word2vec:Unsupervised learning of audio segment representations using sequence-to-sequence autoencoder[C]//Proc.Interspeech.San Francisco,USA,2016:765-769.
[13]MÜLLER M.Dynamic time warping[M]//Information Retrie-val for Music and Motion.Berlin:Springer,2007:69-84.
[14]DHANANJAY R,AFSANEH A,HERV B.I Sparse subspacemodeling for query by example spoken term detection[J].IEEE/ACM Trans.Audio,Speech,Lang.Process.,2018,26(6):1130-1143.
[15]ZHAN J,HE Q,SU J,et al.A Stage Match for Query-by-Example Spoken Term Detection Based On Structure Information of Query[C]//IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP 2021).IEEE,2021:6833-6837.
[16]HE W J,WANG W R,LIVESCU K.Multi-view recurrent neural acoustic word embeddings[C]//Proc.ICLR. Toulon,France,2017.
[17]JUNG M,LIM H,GOO J,et al.Additional shared decoder onsiamese multi-view encoders for learning acoustic word embeddings[C]//IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).Sentosa,Singapore,2019:629-636.
[18]AUDHKHASI K,ROSENBERG A,SETHY A,et al.End-to-end asr-free keyword search from speech[J].IEEE Journal of Selected Topics in Signal Processing,2017,11(8):1351-1359.
[19]KAMPER H,LIVESCU K,GOLDWATER S.An embeddedsegmental k-means model for unsupervised segmentation and clustering of speech[C]//IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).Okinawa,Japan,2017:719-726.
[20]SCHNEIDER S,BAEVSKI A,COLLOBERT R,et al. wav2vec:Unsupervised pre-training for speech recognition[C]//Proc.Interspeech.Graz,Austria,2019:3465-3469.
[21]BAEVSKI A,AULI M,MOHAMED A.Effectiveness of self-supervised pre-training for asr[C]//International Conference on Acoustics,Speech and Signal Processing (ICASSP).Barcelona,Spain,2020:7694-7698.
[22]RIVIÈRE M,JOULIN A,MAZARÈ P E,et al.Unsupervised pretraining transfers well across languages[C]//International Conference on Acoustics,Speech and Signal Processing (ICASSP).Virtual Barcelona,Spain,2020:7414-7418.
[23]HOFFER E,AILON N.Deep metric learning using triplet network[C]//International Workshop on Similarity-based Pattern Recognition.Cham:Springer,2015:84-92.
[24]GODFREY J J,HOLLIMAN E C,MCDANIE L J.SWITCHBOARD:telephone speech corpus for research and development[C]//IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP).San Francisco,USA,1992:517-520.
[25]POVEY D,GHOSHAL A.The Kaldi Speech Recognition Toolkit[C]//IEEE Workshop on Automatic Speech Recognition and Understanding(ASRU).Big Island,USA,2011:1-14.
[26]ABADI M,AGARWAL A,BARHAM P,et al.Tensorflow:Large-scale machine learning on heterogeneous distributed systems [EB/OL].(2016-3-16) [2021-08-31].https://arxiv.org/abs/1603.04467.
[27]PANAYOTOV V,CHEN G,POVEY D,et al.Librispeech:an asr corpus based on public domain audio books[C]//IEEE International Conference on Acoustics,Speech and Signal Proces-sing (ICASSP).Brisbane,Austrlia,2015:5206-5210.
[28]SETTLE S,LIVESCU K.Discriminative acoustic word embeddings:Tecurrent neural network-based approaches[C]//2016 IEEE Spoken Language Technology Workshop (SLT).IEEE,2016:503-510.

相关文章 15

[1]	程章桃, 钟婷, 张晟铭, 周帆. 基于图学习的推荐系统研究综述 Survey of Recommender Systems Based on Graph Learning 计算机科学, 2022, 49(9): 1-13. https://doi.org/10.11896/jsjkx.210900072
[2]	熊丽琴, 曹雷, 赖俊, 陈希亮. 基于值分解的多智能体深度强化学习综述 Overview of Multi-agent Deep Reinforcement Learning Based on Value Factorization 计算机科学, 2022, 49(9): 172-182. https://doi.org/10.11896/jsjkx.210800112
[3]	齐秀秀, 王佳昊, 李文雄, 周帆. 基于概率元学习的矩阵补全预测融合算法 Fusion Algorithm for Matrix Completion Prediction Based on Probabilistic Meta-learning 计算机科学, 2022, 49(7): 18-24. https://doi.org/10.11896/jsjkx.210600126
[4]	高振卓, 王志海, 刘海洋. 嵌入典型时间序列特征的随机Shapelet森林算法 Random Shapelet Forest Algorithm Embedded with Canonical Time Series Features 计算机科学, 2022, 49(7): 40-49. https://doi.org/10.11896/jsjkx.210700226
[5]	孙晓寒, 张莉. 基于评分区域子空间的协同过滤推荐算法 Collaborative Filtering Recommendation Algorithm Based on Rating Region Subspace 计算机科学, 2022, 49(7): 50-56. https://doi.org/10.11896/jsjkx.210600062
[6]	刘卫明, 安冉, 毛伊敏. 基于聚类和WOA的并行支持向量机算法 Parallel Support Vector Machine Algorithm Based on Clustering and WOA 计算机科学, 2022, 49(7): 64-72. https://doi.org/10.11896/jsjkx.210500040
[7]	周慧, 施皓晨, 屠要峰, 黄圣君. 基于主动采样的深度鲁棒神经网络学习 Robust Deep Neural Network Learning Based on Active Sampling 计算机科学, 2022, 49(7): 164-169. https://doi.org/10.11896/jsjkx.210600044
[8]	苏丹宁, 曹桂涛, 王燕楠, 王宏, 任赫. 小样本雷达辐射源识别的深度学习方法综述 Survey of Deep Learning for Radar Emitter Identification Based on Small Sample 计算机科学, 2022, 49(7): 226-235. https://doi.org/10.11896/jsjkx.210600138
[9]	于滨, 李学华, 潘春雨, 李娜. 基于深度强化学习的边云协同资源分配算法 Edge-Cloud Collaborative Resource Allocation Algorithm Based on Deep Reinforcement Learning 计算机科学, 2022, 49(7): 248-253. https://doi.org/10.11896/jsjkx.210400219
[10]	王宇飞, 陈文. 基于DECORATE集成学习与置信度评估的Tri-training算法 Tri-training Algorithm Based on DECORATE Ensemble Learning and Credibility Assessment 计算机科学, 2022, 49(6): 127-133. https://doi.org/10.11896/jsjkx.211100043
[11]	洪志理, 赖俊, 曹雷, 陈希亮, 徐志雄. 基于遗憾探索的竞争网络强化学习智能推荐方法研究 Study on Intelligent Recommendation Method of Dueling Network Reinforcement Learning Based on Regret Exploration 计算机科学, 2022, 49(6): 149-157. https://doi.org/10.11896/jsjkx.210600226
[12]	陈章辉, 熊贇. 基于解耦-检索-生成的图像风格化描述生成模型 Stylized Image Captioning Model Based on Disentangle-Retrieve-Generate 计算机科学, 2022, 49(6): 180-186. https://doi.org/10.11896/jsjkx.211100129
[13]	徐辉, 康金梦, 张加万. 基于特征感知的数字壁画复原方法 Digital Mural Inpainting Method Based on Feature Perception 计算机科学, 2022, 49(6): 217-223. https://doi.org/10.11896/jsjkx.210500105
[14]	许杰, 祝玉坤, 邢春晓. 机器学习在金融资产定价中的应用研究综述 Application of Machine Learning in Financial Asset Pricing:A Review 计算机科学, 2022, 49(6): 276-286. https://doi.org/10.11896/jsjkx.210900127
[15]	罗俊仁, 张万鹏, 陆丽娜, 陈璟. 即时策略博弈在线对抗规划方法综述 Survey on Online Adversarial Planning for Real-time Strategy Game 计算机科学, 2022, 49(6): 287-296. https://doi.org/10.11896/jsjkx.210600168

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

基于wav2vec预训练的样例关键词识别

Query-by-Example with Acoustic Word Embeddings Using wav2vec Pretraining

PDF (PC)

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

Metrics

本文评价

推荐阅读 0