计算机科学 ›› 2022, Vol. 49 ›› Issue (11A): 210800246-6.doi: 10.11896/jsjkx.210800246
李荪, 曹峰, 刘姿杉
LI Sun, CAO Feng, LIU Zi-shan
摘要: 随着智能语音技术和产品应用大规模的成熟落地,对高质量语音数据集的需求与日俱增。目前,针对结构化数据的质量评估方法有一定的研究,但尚未形成面向非结构化的语音数据集质量评估标准。通过研究语音算法模型的构建原理,分析语音数据集的建设需求,建设统一的语音数据集质量评估体系。该评估体系从4个维度对面向算法模型训练的语音数据集进行质量评价,包括广度覆盖性、选集区分性、领域深入性和数据完整性。通过提出具体的语音数据集质量评估指标、计算方法和评估步骤等,对车载应用领域语音数据集的质量进行评估并对结果进行分析,对评估语音数据集质量、促进数据集建设提供参考。考虑了语音数据集构建的多样化适用能力、隐私问题、效率要求、自动化需求等,提出了构建高质量的语音数据集的未来发展建议。
中图分类号:
[1]WANG R Y,STOREY V C,FIRTH C P.A framework for ana-lysis of data quality research[J].IEEE Transactions on Know-ledge and Data Engineering,1995,7(4):623-640. [2]LEE Y W,STRONG D M,KAHN B K,et al.AIMQ:a metho-dology for information quality assessment[J].Information & Management,2002,40(2):133-146. [3]PIPINO L L,LEE Y W,WANG R Y.Data quality assessment[J].Communications of the ACM,2002,45(4):211-218. [4]YANG Q Y,ZHAO P Y,YANG D Q,et al.Research on data quality evaluation method[J].Computer Engineering and Application,2004,40(9):3-4,15. [5]HUANG G,YUAN M,WU X Y,et al.Research on metadata driven data quality evaluation architecture[J].Computer Engineering and Application,2013(8):114-119,181. [6]SHAN Y H,LI J,WANG X R,et al.The generation method of speech recognition training data and the training method of speech recognition model:CN111402865A[P].2020. [7]ZU Y Q.Corpus design of Chinese continuous speech database[J].Journal of Acoustics,1999(3):236-247. [8]WU H,XU B,HUANG T Y.Automatic corpus selection algorithm based on triphone model[J].Journal of Software,2000,11(2):271-276. [9]ZHUANG J L.Research and application of quantitative analysis of data quality[D].Shanghai:Donghua University,2019. [10]JIN J.Research on the value evaluation of information entropy in the era of big data[D].Changchun:Jilin University,2019. [11]LIU L Y.Development of modern Chinese Corpus[J].Language Application,1996(3):3-9. [12]GHEITH M,ABOUL-ELA M,ARAFA W.Learning WordGraph Representation for Document Classification[C]//27th Conference for Computer Science,Statistics and Operation Research.Egyptian Computer Society,2002. [13]GOU H W,GOU X T.Analysis of word separation and sentence similarity based on word vector[J].Scientific and Technological Innovation,2018(33):55-56. [14]GU B,LI J H,LIU K Y.Chinese text clustering based on COSA algorithm[J].Chinese Journal of Information,2007,21(6):65-70. [15]LIU P,WANG Z Y.Multimodal speech endpoint detection[J].Journal of Tsinghua University(Natural Science Edition),2005(7):896-899. [16]WANG T Q,LI A J.Design of continuous Chinese Speech Re-cognition Corpus[C]//National Conference on Modern Phone-tics.2003. |
[1] | 柴慧敏, 张勇, 方敏. 基于特征相似度聚类的空中目标分群方法 Aerial Target Grouping Method Based on Feature Similarity Clustering 计算机科学, 2022, 49(9): 70-75. https://doi.org/10.11896/jsjkx.210800203 |
[2] | 周乐员, 张剑华, 袁甜甜, 陈胜勇. 多层注意力机制融合的序列到序列中国连续手语识别和翻译 Sequence-to-Sequence Chinese Continuous Sign Language Recognition and Translation with Multi- layer Attention Mechanism Fusion 计算机科学, 2022, 49(9): 155-161. https://doi.org/10.11896/jsjkx.210800026 |
[3] | 吴子仪, 李邵梅, 姜梦函, 张建朋. 基于自注意力模型的本体对齐方法 Ontology Alignment Method Based on Self-attention 计算机科学, 2022, 49(9): 215-220. https://doi.org/10.11896/jsjkx.210700190 |
[4] | 胡玉姣, 贾庆民, 孙庆爽, 谢人超, 黄韬. 融智算力网络及其功能架构 Functional Architecture to Intelligent Computing Power Network 计算机科学, 2022, 49(9): 249-259. https://doi.org/10.11896/jsjkx.220500222 |
[5] | 刘鑫, 王珺, 宋巧凤, 刘家豪. 一种基于AAE的协同多播主动缓存方案 Collaborative Multicast Proactive Caching Scheme Based on AAE 计算机科学, 2022, 49(9): 260-267. https://doi.org/10.11896/jsjkx.210800019 |
[6] | 宁晗阳, 马苗, 杨波, 刘士昌. 密码学智能化研究进展与分析 Research Progress and Analysis on Intelligent Cryptology 计算机科学, 2022, 49(9): 288-296. https://doi.org/10.11896/jsjkx.220300053 |
[7] | 王子凯, 朱健, 张伯钧, 胡凯. 区块链与智能合约并行方法研究与实现 Research and Implementation of Parallel Method in Blockchain and Smart Contract 计算机科学, 2022, 49(9): 312-317. https://doi.org/10.11896/jsjkx.210800102 |
[8] | 姜洋洋, 宋丽华, 邢长友, 张国敏, 曾庆伟. 蜜罐博弈中信念驱动的攻防策略优化机制 Belief Driven Attack and Defense Policy Optimization Mechanism in Honeypot Game 计算机科学, 2022, 49(9): 333-339. https://doi.org/10.11896/jsjkx.220400011 |
[9] | 窦家维. 保护隐私的汉明距离与编辑距离计算及应用 Privacy-preserving Hamming and Edit Distance Computation and Applications 计算机科学, 2022, 49(9): 355-360. https://doi.org/10.11896/jsjkx.220100241 |
[10] | 陈俊, 何庆, 李守玉. 基于自适应反馈调节因子的阿基米德优化算法 Archimedes Optimization Algorithm Based on Adaptive Feedback Adjustment Factor 计算机科学, 2022, 49(8): 237-246. https://doi.org/10.11896/jsjkx.210700150 |
[11] | 侯钰涛, 阿布都克力木·阿布力孜, 哈里旦木·阿布都克里木. 中文预训练模型研究进展 Advances in Chinese Pre-training Models 计算机科学, 2022, 49(7): 148-163. https://doi.org/10.11896/jsjkx.211200018 |
[12] | 周慧, 施皓晨, 屠要峰, 黄圣君. 基于主动采样的深度鲁棒神经网络学习 Robust Deep Neural Network Learning Based on Active Sampling 计算机科学, 2022, 49(7): 164-169. https://doi.org/10.11896/jsjkx.210600044 |
[13] | 唐枫, 冯翔, 虞慧群. 基于自适应知识迁移与资源分配的多任务协同优化算法 Multi-task Cooperative Optimization Algorithm Based on Adaptive Knowledge Transfer andResource Allocation 计算机科学, 2022, 49(7): 254-262. https://doi.org/10.11896/jsjkx.210600184 |
[14] | 张翀宇, 陈彦明, 李炜. 边缘计算中面向数据流的实时任务调度算法 Task Offloading Online Algorithm for Data Stream Edge Computing 计算机科学, 2022, 49(7): 263-270. https://doi.org/10.11896/jsjkx.210300195 |
[15] | 李瑭, 秦小麟, 迟贺宇, 费珂. 面向多无人系统的安全协同模型 Secure Coordination Model for Multiple Unmanned Systems 计算机科学, 2022, 49(7): 332-339. https://doi.org/10.11896/jsjkx.210600107 |
|