面向算法模型的语音数据集质量评估方法研究

doi:10.11896/jsjkx.210800246

摘要/Abstract

摘要： 随着智能语音技术和产品应用大规模的成熟落地,对高质量语音数据集的需求与日俱增。目前,针对结构化数据的质量评估方法有一定的研究,但尚未形成面向非结构化的语音数据集质量评估标准。通过研究语音算法模型的构建原理,分析语音数据集的建设需求,建设统一的语音数据集质量评估体系。该评估体系从4个维度对面向算法模型训练的语音数据集进行质量评价,包括广度覆盖性、选集区分性、领域深入性和数据完整性。通过提出具体的语音数据集质量评估指标、计算方法和评估步骤等,对车载应用领域语音数据集的质量进行评估并对结果进行分析,对评估语音数据集质量、促进数据集建设提供参考。考虑了语音数据集构建的多样化适用能力、隐私问题、效率要求、自动化需求等,提出了构建高质量的语音数据集的未来发展建议。

关键词: 人工智能, 语音数据集, 质量评估, 算法, 模型, 智能语音

Abstract: With the maturity of intelligent voice technology and product application,the demand for high-quality voice datasets is increasing.There have been some researchers put effort on the quality evaluation of the structured data,but there are few stan-dards appeared for the unstructured voice dataset.By analyzing the construction principle of speech algorithm model and analyzing the construction demand of voice dataset,a unified quality assessment framework for the voice dataset is presented.The framework proposes to evaluate the dataset in terms of four dimensions,each of which subsumes a set of criteria:breadth coverage,anthology distinction,field depth and accuracy completeness.The criteria that are suitable to evaluate the quality dimensions are presented,each with the definition,measurement method,and the evaluation process for the voice dataset quality measurement.Experimental assessment and analysis results of the voice datasets in the vehicular application field are presented as the reference for evaluating the voice dataset quality,and promoting the construction of the voice dataset.Considering the diversified applicabi-lity,privacy issues,efficiency requirements,automation requirements and other aspects of the construction of voice data sets,the development suggestions for building high-quality voice datasets are proposed.

Key words: Artificial intelligence, Speech dataset, Quality assessment, Algorithm, Model, Intelligent speech

中图分类号:

TN912.34

李荪, 曹峰, 刘姿杉. 面向算法模型的语音数据集质量评估方法研究[J]. 计算机科学, 2022, 49(11A): 210800246-6. https://doi.org/10.11896/jsjkx.210800246

LI Sun, CAO Feng, LIU Zi-shan. Study on Quality Evaluation Method of Speech Datasets for Algorithm Model[J]. Computer Science, 2022, 49(11A): 210800246-6. https://doi.org/10.11896/jsjkx.210800246

参考文献

[1]WANG R Y,STOREY V C,FIRTH C P.A framework for ana-lysis of data quality research[J].IEEE Transactions on Know-ledge and Data Engineering,1995,7(4):623-640.
[2]LEE Y W,STRONG D M,KAHN B K,et al.AIMQ:a metho-dology for information quality assessment[J].Information & Management,2002,40(2):133-146.
[3]PIPINO L L,LEE Y W,WANG R Y.Data quality assessment[J].Communications of the ACM,2002,45(4):211-218.
[4]YANG Q Y,ZHAO P Y,YANG D Q,et al.Research on data quality evaluation method[J].Computer Engineering and Application,2004,40(9):3-4,15.
[5]HUANG G,YUAN M,WU X Y,et al.Research on metadata driven data quality evaluation architecture[J].Computer Engineering and Application,2013(8):114-119,181.
[6]SHAN Y H,LI J,WANG X R,et al.The generation method of speech recognition training data and the training method of speech recognition model:CN111402865A[P].2020.
[7]ZU Y Q.Corpus design of Chinese continuous speech database[J].Journal of Acoustics,1999(3):236-247.
[8]WU H,XU B,HUANG T Y.Automatic corpus selection algorithm based on triphone model[J].Journal of Software,2000,11(2):271-276.
[9]ZHUANG J L.Research and application of quantitative analysis of data quality[D].Shanghai:Donghua University,2019.
[10]JIN J.Research on the value evaluation of information entropy in the era of big data[D].Changchun:Jilin University,2019.
[11]LIU L Y.Development of modern Chinese Corpus[J].Language Application,1996(3):3-9.
[12]GHEITH M,ABOUL-ELA M,ARAFA W.Learning WordGraph Representation for Document Classification[C]//27th Conference for Computer Science,Statistics and Operation Research.Egyptian Computer Society,2002.
[13]GOU H W,GOU X T.Analysis of word separation and sentence similarity based on word vector[J].Scientific and Technological Innovation,2018(33):55-56.
[14]GU B,LI J H,LIU K Y.Chinese text clustering based on COSA algorithm[J].Chinese Journal of Information,2007,21(6):65-70.
[15]LIU P,WANG Z Y.Multimodal speech endpoint detection[J].Journal of Tsinghua University(Natural Science Edition),2005(7):896-899.
[16]WANG T Q,LI A J.Design of continuous Chinese Speech Re-cognition Corpus[C]//National Conference on Modern Phone-tics.2003.

相关文章 15

[1]	柴慧敏, 张勇, 方敏. 基于特征相似度聚类的空中目标分群方法 Aerial Target Grouping Method Based on Feature Similarity Clustering 计算机科学, 2022, 49(9): 70-75. https://doi.org/10.11896/jsjkx.210800203
[2]	周乐员, 张剑华, 袁甜甜, 陈胜勇. 多层注意力机制融合的序列到序列中国连续手语识别和翻译 Sequence-to-Sequence Chinese Continuous Sign Language Recognition and Translation with Multi- layer Attention Mechanism Fusion 计算机科学, 2022, 49(9): 155-161. https://doi.org/10.11896/jsjkx.210800026
[3]	吴子仪, 李邵梅, 姜梦函, 张建朋. 基于自注意力模型的本体对齐方法 Ontology Alignment Method Based on Self-attention 计算机科学, 2022, 49(9): 215-220. https://doi.org/10.11896/jsjkx.210700190
[4]	胡玉姣, 贾庆民, 孙庆爽, 谢人超, 黄韬. 融智算力网络及其功能架构 Functional Architecture to Intelligent Computing Power Network 计算机科学, 2022, 49(9): 249-259. https://doi.org/10.11896/jsjkx.220500222
[5]	刘鑫, 王珺, 宋巧凤, 刘家豪. 一种基于AAE的协同多播主动缓存方案 Collaborative Multicast Proactive Caching Scheme Based on AAE 计算机科学, 2022, 49(9): 260-267. https://doi.org/10.11896/jsjkx.210800019
[6]	宁晗阳, 马苗, 杨波, 刘士昌. 密码学智能化研究进展与分析 Research Progress and Analysis on Intelligent Cryptology 计算机科学, 2022, 49(9): 288-296. https://doi.org/10.11896/jsjkx.220300053
[7]	王子凯, 朱健, 张伯钧, 胡凯. 区块链与智能合约并行方法研究与实现 Research and Implementation of Parallel Method in Blockchain and Smart Contract 计算机科学, 2022, 49(9): 312-317. https://doi.org/10.11896/jsjkx.210800102
[8]	姜洋洋, 宋丽华, 邢长友, 张国敏, 曾庆伟. 蜜罐博弈中信念驱动的攻防策略优化机制 Belief Driven Attack and Defense Policy Optimization Mechanism in Honeypot Game 计算机科学, 2022, 49(9): 333-339. https://doi.org/10.11896/jsjkx.220400011
[9]	窦家维. 保护隐私的汉明距离与编辑距离计算及应用 Privacy-preserving Hamming and Edit Distance Computation and Applications 计算机科学, 2022, 49(9): 355-360. https://doi.org/10.11896/jsjkx.220100241
[10]	陈俊, 何庆, 李守玉. 基于自适应反馈调节因子的阿基米德优化算法 Archimedes Optimization Algorithm Based on Adaptive Feedback Adjustment Factor 计算机科学, 2022, 49(8): 237-246. https://doi.org/10.11896/jsjkx.210700150
[11]	侯钰涛, 阿布都克力木·阿布力孜, 哈里旦木·阿布都克里木. 中文预训练模型研究进展 Advances in Chinese Pre-training Models 计算机科学, 2022, 49(7): 148-163. https://doi.org/10.11896/jsjkx.211200018
[12]	周慧, 施皓晨, 屠要峰, 黄圣君. 基于主动采样的深度鲁棒神经网络学习 Robust Deep Neural Network Learning Based on Active Sampling 计算机科学, 2022, 49(7): 164-169. https://doi.org/10.11896/jsjkx.210600044
[13]	唐枫, 冯翔, 虞慧群. 基于自适应知识迁移与资源分配的多任务协同优化算法 Multi-task Cooperative Optimization Algorithm Based on Adaptive Knowledge Transfer andResource Allocation 计算机科学, 2022, 49(7): 254-262. https://doi.org/10.11896/jsjkx.210600184
[14]	张翀宇, 陈彦明, 李炜. 边缘计算中面向数据流的实时任务调度算法 Task Offloading Online Algorithm for Data Stream Edge Computing 计算机科学, 2022, 49(7): 263-270. https://doi.org/10.11896/jsjkx.210300195
[15]	李瑭, 秦小麟, 迟贺宇, 费珂. 面向多无人系统的安全协同模型 Secure Coordination Model for Multiple Unmanned Systems 计算机科学, 2022, 49(7): 332-339. https://doi.org/10.11896/jsjkx.210600107

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed