基于Deep Speech与多层LSTM的儿童朗读语音评价模型

摘要/Abstract

摘要： 现代人大多忽略了朗读的重要性,然而对于5～12岁的儿童,朗读不仅是学习过程中必备的技能,还是陶冶情操的有效手段。由于朗读语音信号的特征与评价标准之间存在着非线性关系,递归神经网络虽然适用于时间序列的预测,但是对长时间跨度的预测效果有限。基于此,根据儿童朗读语音特点及其评价体系,设计了一种基于DeepSpeech与三层长短期记忆(Long Short-Term Memory,LSTM)神经网络相结合的模型。首先,在添加注意力机制的基础上,提出朗读语音评价的准确性和流利性度量,以频谱图作为特征提取的输入,其中,朗读评价的准确性采用改进后的Deep Speech以提高音素识别的准确率,流利性评价将频谱图送至三层LSTM模型中以呈现时间序列的影响;然后,将结果送入注意力机制进行权重调节;最终,将计算的总评价结果用于儿童朗读语音的评分。使用“出口成章”软件提供的儿童朗读语料库和TensorFlow平台进行实验。结果表明,与传统的模型相比,此模型不仅可以精确判断朗读的正确性和朗读的流利性,而且其评价模型获得的评分结果较准确。

关键词: DeepSpeech, 长短期记忆网络, 朗读语音评价模型, 频谱图, 注意力机制

Abstract: Most modern people ignore the importance of reading.However,for children aged 5~12,reading aloud is not only an essential skill in the learning process,but also an effective means of cultivating sentiment.Since there is a nonlinear relationship between the characteristics of the spoken speech signal and the evaluation criteria,the recurrent neural network is suitable for time series prediction,but its prediction effect is limited for long-term span.According to the characteristics of children’s spoken speech and its evaluation system,a new model combining Deep Speech and three-la-yer LSTM (Long Short-Term Memory) neural network was designed.Firstly,on the basis of adding attention mechanism,the accuracy and fluency measure of speech evaluation are put forward,and the spectrum map is used as the input of feature extraction.Among them,the accuracy of reading uses the new version of Deep Speech to improve the accuracy of phoneme recognition.For fluency evaluation,the spectrogram is sent to the three-layer LSTM model to present the effects of the time series.Then,the results are sent to the attention mechanism for weight adjustment,and finally the total evaluation results are used for the evaluation of children's spoken speech.The experiment uses the children’s reading corpus,which is provided by the “export chapter” software,and the experimental environment uses the TensorFlow platform.The experimental results show that compared with the traditional model,this model can accurately judge the correctness of spoken speech and the fluency of reading aloud,and the scoring results obtained by its evaluation model are more accurate.

Key words: Attention mechanism, DeepSpeech, Evaluation of spoken speech models, Long Short-Term Memory, Spectrogram

中图分类号:

TP183

郑纯军, 贾宁. 基于Deep Speech与多层LSTM的儿童朗读语音评价模型[J]. 计算机科学, 2019, 46(11A): 108-111. https://doi.org/

ZHENG Chun-jun, JIA Ning. Children’s Reading Speech Evaluation Model Based on Deep Speech and Multi-layer LSTM[J]. Computer Science, 2019, 46(11A): 108-111. https://doi.org/

参考文献

[1]孙丽妍.如何培养小学生的语文朗读能力[J].语文建设,2018,12:97.
[2]BERTIN-MAHIEUX T,ELLIS D P W.Large-scale cover songrecognition using hashed chroma landmarks[C]∥2011 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).NY,USA:IEEE,2011:117-120.
[3]OORD A V D,DIELEMAN S,ZEN H G,et al.Wavenet:A ge-nerative model for raw audio[C]∥arXiv:1609.03499.2016.
[4]EZZAT S,EL GAYAR N,GHANEM M M.Sentiment analysis of call centre audio conversations using text classification[J].International Journal of Computer Information Systems and Industrial Management Applications,2012,4(1):619-627.
[5]韩文静,李海峰,阮华斌,等.语音情感识别研究进展综述[J].软件学报,2014,25(1):37-50.
[6]TRABELSI I,AYED D B.On the Use of Different Feature Extraction Methods for Linear and Non Linear kernels[J].Computer Science,2014.
[7]PALO H K,MOHANTY M N,CHANDRA M.Computational Vision and Robotics[C]∥Advances in Intelligent Systems and Computing.2015:63-70.
[8]RODDY C.Emotion recognition in human-computer interaction[J].Signal Processing Magazine,IEEE,2001,18(1):32-80.
[9]GEROSA M,LEE S,GIULIANI D,et al.Analyzing children’s speech:An acoustic study of consonants and consonant-vowel transition[C]∥Proc.IEEE Int.Conf.Acoustics,Speech and Signal Processing,2006(ICASSP 2006).2006:1393-1396.
[10]YILDIRIM S,NARAYANAN S,BOYD D,et al.Acoustic analy-sis of preschool children’s speech[C]∥Proc.15th ICPhS.2003:949-952.
[11]LI,RUSSELL M J.An analysis of the causes of increased error rates in children’s speech recognition[C]∥Seventh Internatio-nal Conference on Spoken Language Processing.2002.
[12]AFAVI S,NAJAFIAN M,HANANI A,et al.Speaker recognition for children’s speech[J].arXiv:1609.07498,2016.
[13]LI P C,SONG Y,MCLOUGHLIN I V,et al.An Attention Pooling Based Representation Learning Method for Speech Emotion Recognition[C]∥Interspeech.2018:3087-3091
[14]BADSHAH A M,AHMAD J,RAHIM N,et al.Speech Emotion Recognition from Spectrograms with Deep Convolutional Neural Network[C]∥International Conference on Platform Technology & Service.IEEE,2017.
[15]ETIENNE C,FIDANZA G,PETROVSKII A,et al.CNN+LSTM Architecture for Speech Emotion Recognition with Data Augmentation[J].Computer Science,2018.
[16]CUMMINS N,AMIRIPARIAN S,HAGERER G.An Image-based Deep Spectrum Feature Representation for the Recognition of Emotional Speech[C]∥ACM on Multimedia Conference.2017.
[17]KANG J,ZHANG W Q,LIU J.Gated recurrent units based hybrid acoustic models for robust speech recognition[C]∥2016 10th International Symposium on Chinese Spoken Language Processing (ISCSLP).IEEE,2016.
[18]HUANG Y S,CHOU S Y,YANG Y H.Pop Music Highligh-ter:Marking the Emotion Keypoints[J].arXiv:1802.10495,2018.
[19]MIRSAMADI S,BARSOUM E,ZHANG C.Automatic Speech Emotion Recognition Using Recurrent Neural Networks with Local Attention[C]∥ICASSP.IEEE,2017.
[20]PASSALIS N,TEFAS A.Neural bag-of-features learning[J].Pattern Recognition,2017,64:277-294.
[21]KAUSHIK L,SANGWAN A,HANSEN J H.Automatic audio sentiment extraction using keyword spotting[C]∥Sixteenth Annual Conference of the International Speech Communication Association.2015.
[22]AMODEI D,ANUBHAI R,BATTENBERG E,et al.DeepSpeech 2:End-to-End Speech Recognition in English and Mandarin[J].Computer Science,2015.

相关文章 15

[1]	饶志双, 贾真, 张凡, 李天瑞. 基于Key-Value关联记忆网络的知识图谱问答方法 Key-Value Relational Memory Networks for Question Answering over Knowledge Graph 计算机科学, 2022, 49(9): 202-207. https://doi.org/10.11896/jsjkx.220300277
[2]	周芳泉, 成卫青. 基于全局增强图神经网络的序列推荐 Sequence Recommendation Based on Global Enhanced Graph Neural Network 计算机科学, 2022, 49(9): 55-63. https://doi.org/10.11896/jsjkx.210700085
[3]	戴禹, 许林峰. 基于文本行匹配的跨图文本阅读方法 Cross-image Text Reading Method Based on Text Line Matching 计算机科学, 2022, 49(9): 139-145. https://doi.org/10.11896/jsjkx.220600032
[4]	周乐员, 张剑华, 袁甜甜, 陈胜勇. 多层注意力机制融合的序列到序列中国连续手语识别和翻译 Sequence-to-Sequence Chinese Continuous Sign Language Recognition and Translation with Multi- layer Attention Mechanism Fusion 计算机科学, 2022, 49(9): 155-161. https://doi.org/10.11896/jsjkx.210800026
[5]	熊丽琴, 曹雷, 赖俊, 陈希亮. 基于值分解的多智能体深度强化学习综述 Overview of Multi-agent Deep Reinforcement Learning Based on Value Factorization 计算机科学, 2022, 49(9): 172-182. https://doi.org/10.11896/jsjkx.210800112
[6]	王馨彤, 王璇, 孙知信. 基于多尺度记忆残差网络的网络流量异常检测模型 Network Traffic Anomaly Detection Method Based on Multi-scale Memory Residual Network 计算机科学, 2022, 49(8): 314-322. https://doi.org/10.11896/jsjkx.220200011
[7]	姜梦函, 李邵梅, 郑洪浩, 张建朋. 基于改进位置编码的谣言检测模型 Rumor Detection Model Based on Improved Position Embedding 计算机科学, 2022, 49(8): 330-335. https://doi.org/10.11896/jsjkx.210600046
[8]	朱承璋, 黄嘉儿, 肖亚龙, 王晗, 邹北骥. 基于注意力机制的医学影像深度哈希检索算法 Deep Hash Retrieval Algorithm for Medical Images Based on Attention Mechanism 计算机科学, 2022, 49(8): 113-119. https://doi.org/10.11896/jsjkx.210700153
[9]	孙奇, 吉根林, 张杰. 基于非局部注意力生成对抗网络的视频异常事件检测方法 Non-local Attention Based Generative Adversarial Network for Video Abnormal Event Detection 计算机科学, 2022, 49(8): 172-177. https://doi.org/10.11896/jsjkx.210600061
[10]	闫佳丹, 贾彩燕. 基于双图神经网络信息融合的文本分类方法 Text Classification Method Based on Information Fusion of Dual-graph Neural Network 计算机科学, 2022, 49(8): 230-236. https://doi.org/10.11896/jsjkx.210600042
[11]	汪鸣, 彭舰, 黄飞虎. 基于多时间尺度时空图网络的交通流量预测模型 Multi-time Scale Spatial-Temporal Graph Neural Network for Traffic Flow Prediction 计算机科学, 2022, 49(8): 40-48. https://doi.org/10.11896/jsjkx.220100188
[12]	金方焱, 王秀利. 融合RACNN和BiLSTM的金融领域事件隐式因果关系抽取 Implicit Causality Extraction of Financial Events Integrating RACNN and BiLSTM 计算机科学, 2022, 49(7): 179-186. https://doi.org/10.11896/jsjkx.210500190
[13]	熊罗庚, 郑尚, 邹海涛, 于化龙, 高尚. 融合双向门控循环单元和注意力机制的软件自承认技术债识别方法 Software Self-admitted Technical Debt Identification with Bidirectional Gate Recurrent Unit and Attention Mechanism 计算机科学, 2022, 49(7): 212-219. https://doi.org/10.11896/jsjkx.210500075
[14]	彭双, 伍江江, 陈浩, 杜春, 李军. 基于注意力神经网络的对地观测卫星星上自主任务规划方法 Satellite Onboard Observation Task Planning Based on Attention Neural Network 计算机科学, 2022, 49(7): 242-247. https://doi.org/10.11896/jsjkx.210500093
[15]	赵冬梅, 吴亚星, 张红斌. 基于IPSO-BiLSTM的网络安全态势预测 Network Security Situation Prediction Based on IPSO-BiLSTM 计算机科学, 2022, 49(7): 357-362. https://doi.org/10.11896/jsjkx.210900103

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed