计算机科学 ›› 2026, Vol. 53 ›› Issue (5): 59-67.doi: 10.11896/jsjkx.250600187
王陈偲1, 杨思燕2, 苗启广1,3
WANG Chencai1, YANG Siyan2, MIAO Qiguang1,3
摘要: 随着深度学习和语音合成技术的不断发展,语音克隆在智能语音助手、虚拟主播和无障碍通信等领域展现出广阔的应用前景。然而,现有语音克隆系统在音色相似度、交互便捷性和大规模数据处理能力等方面仍存在不足,难以满足用户对高质量、个性化语音合成的实际需求。为此,基于XTTS模型设计并实现了一个支持多语种语音克隆与批量文本转语音的Web平台,针对语言覆盖数量有限、低资源条件下音色迁移受限以及批量处理效率低的问题进行了改进。系统采用前后端分离架构,后端基于 Flask 搭建 API 接口,前端结合主流 Web 技术与 AJAX 实现异步交互,数据库采用 MySQL 管理用户与音频数据。平台集成语音克隆、文本转语音与批量处理等功能模块,具备良好的灵活性与扩展性。测试结果表明,该系统在语音自然度与音色相似度方面表现良好,具有较高的应用价值与推广潜力。
中图分类号:
| [1]GRAVES A.Generating sequences with recurrent neural networks[J].arXiv:1308.0850,2013. [2]SHEN J,PANG R,WEISS R J,et al.Natural tts synthesis by conditioning wavenet on mel spectrogram predictions[C]//Proceedings of the 2018 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2018. [3]ARIK S Ö,CHRZANOWSKI M,COATES A,et al.Deep voice:Real-time neural text-to-speech[C]//Proceedings of the International Conference on Machine Learning.PMLR,2017. [4]GIBIANSKY A,ARIK S,DIAMOS G,et al.Deep voice 2:Multi-speaker neural text-to-speech[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems.2017:2966-2974. [5]PING W,PENG K,GIBIANSKY A,et al.Deep voice 3:Scaling text-to-speech with convolutional sequence learning[J].arXiv:1710.07654,2017. [6]SNYDER D,GARCIA-ROMERO D,SELL G,et al.X-vectors:Robust dnn embeddings for speaker recognition[C]//Procee-dings of the 2018 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2018. [7]YANG J,LEE J,KIM Y,et al.VocGAN:A high-fidelity real-time vocoder with a hierarchically-nested adversarial network[J].arXiv:2007.15256,2020. [8]KONG J,KIM J,BAE J.Hifi-gan:Generative adversarial net-works for efficient and high fidelity speech synthesis[J].Advances in Neural Information Processing Systems,2020,33:17022-17033. [9]JANG W,LIM D,YOON J,et al.Univnet:A neural vocoderwith multi-resolution spectrogram discriminators for high-fidelity waveform generation[C]//Proceedings Interspeech 2021.2021:2207-2211. [10]MORRISON M,KUMAR R,KUMAR K,et al.Chunked autoregressive gan for conditional waveform synthesis[C]//International Conference on Learning Representations.2021. [11]CHEN S,WANG C,WU Y,et al.Neural codec language models are zero-shot text to speech synthesizers[J].IEEE Transactions on Audio,Speech and Language Processing,2025,33:705-718. [12]LI T,WANG Z,ZHU X,et al.U-Style:Cascading U-Nets With Multi-Level Speaker and Style Modeling for Zero-Shot Voice Cloning[J].IEEE/ACM Transactions on Audio,Speech and Language Processing,2024,32:4026-4035. [13]QIN Z,ZHAO W,YU X,et al.Openvoice:Versatile instantvoice cloning[J].arXiv:2312.01479,2023. [14]WANG Y,ZHAN H,LIU L,et al.Maskgct:Zero-shot text-to-speech with masked generative codec transformer[J].arXiv:2409.00750,2024. [15]LU Y X,DU H P,SHENG Z Y,et al.Incremental Disentanglement for Environment-Aware Zero-Shot Text-to-Speech Synthesis[C]//Proceedings of the ICASSP 2025-2025 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2025. [16]MENG M,YANG Z,YANG J,et al.DS-TTS:Zero-Shot Speaker Style Adaptation from Voice Clips via Dynamic Dual-Style Feature Modulation[J].arXiv:2506.01020,2025. [17]ZHANG B,GUO C,YANG G,et al.Minimax-speech:Intrinsic zero-shot text-to-speech with a learnable speaker encoder[J].arXiv:2505.07916,2025. [18]DENG W,ZHOU S,SHU J,et al.IndexTTS:An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System[J].arXiv:2502.05512,2025. [19]CASANOVA E,DAVIS K,GÖLGE E,et al.Xtts:a massively multilingual zero-shot text-to-speech model[J].arXiv:2406,04904,2024. [20]LI J,TU W,XIAO L.Freevc:Towards high-quality text-freeone-shot voice conversion[C]//Proceedings of the ICASSP 2023-2023 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2023. |
|
||