计算机科学 ›› 2026, Vol. 53 ›› Issue (5): 59-67.doi: 10.11896/jsjkx.250600187

• 智能教育技术 • 上一篇    下一篇

基于XTTS模型的声音克隆系统研究

王陈偲1, 杨思燕2, 苗启广1,3   

  1. 1 西安电子科技大学计算机科学与技术学院 西安 710071
    2 陕西开放大学信息化处 西安 710119
    3 西安市大数据与视觉智能重点实验室 西安 710071
  • 收稿日期:2025-06-26 修回日期:2025-07-20 发布日期:2026-05-08
  • 通讯作者: 苗启广(qgmiao@xidian.edu.cn)
  • 作者简介:(25031212025@stu.xidian.edu.cn)
  • 基金资助:
    陕西工商职业学院重点课题(20GA06);广西可信软件重点实验室课题(KX202047);陕西省重点研发计划(2024GH-ZDXM-47);陕西高等教育教学改革研究项目(23JG003);中国高等教育学会高等教育科学研究规划课题(24PG0101)

Research on Voice Cloning System Based on XTTS Model

WANG Chencai1, YANG Siyan2, MIAO Qiguang1,3   

  1. 1 School of Computer Science and Technology, Xidian University, Xi’an 710071, China
    2 Department of Information Technology, The Open University of Shaanxi, Xi’an 710119, China
    3 Xi’an Key Laboratory of Big Data and Intelligent Vision, Xi’an 710071, China
  • Received:2025-06-26 Revised:2025-07-20 Online:2026-05-08
  • About author:WANG Chencai,born in 2003,postgra-duate.His main research interests include intelligent educational technology and so on.
    MIAO Qiguang,born in 1972,Ph.D,professor,is a councillor of CCF(No.09025D).His main research interests include computer vision,big data analysis and intelligent educational technology.
  • Supported by:
    Key Project of Shaanxi Polytechnic Institute Research Program(20GA06),Guangxi Key Laboratory of Trusted Software Project(KX202047),Key Research and Development Program of Shaanxi Province(2024GH-ZDXM-47),Higher Education Teaching Reform Research Program of Shaanxi Province(23JG003) and Research Project of the China Association of Higher Education(24PG0101).

摘要: 随着深度学习和语音合成技术的不断发展,语音克隆在智能语音助手、虚拟主播和无障碍通信等领域展现出广阔的应用前景。然而,现有语音克隆系统在音色相似度、交互便捷性和大规模数据处理能力等方面仍存在不足,难以满足用户对高质量、个性化语音合成的实际需求。为此,基于XTTS模型设计并实现了一个支持多语种语音克隆与批量文本转语音的Web平台,针对语言覆盖数量有限、低资源条件下音色迁移受限以及批量处理效率低的问题进行了改进。系统采用前后端分离架构,后端基于 Flask 搭建 API 接口,前端结合主流 Web 技术与 AJAX 实现异步交互,数据库采用 MySQL 管理用户与音频数据。平台集成语音克隆、文本转语音与批量处理等功能模块,具备良好的灵活性与扩展性。测试结果表明,该系统在语音自然度与音色相似度方面表现良好,具有较高的应用价值与推广潜力。

关键词: 语音克隆, 文本转语音, XTTS, FreeVC, Flask

Abstract: With the continuous advancement of deep learning and speech synthesis technologies,voice cloning has shown broad application prospects in intelligent voice assistants,virtual anchors,and barrier-free communication.However,existing voice cloning systems still face challenges in timbre similarity,interactive efficiency,and large-scale processing capability,making it difficult to meet the growing demand for high-quality,personalized speech synthesis.To address these limitations,this paper designs and implements a Web-based platform for multilingual voice cloning and batch text-to-speech synthesis,based on the XTTS mo-del.The system improves upon existing solutions by enhancing language coverage,reducing data dependency for timbre transfer,and optimizing batch processing efficiency.It adopts a front-end/back-end decoupled architecture,with a Flask-based RESTful API at the back end and mainstream Web technologies combined with AJAX at the front end.MySQL is used for managing user and audio data.The platform integrates voice cloning,text-to-speech,and batch synthesis modules,and demonstrates strong flexibility and scalability.Experimental results show that the system performs well in speech naturalness and timbre similarity,proving its practical value and application potential.

Key words: Voice cloning, Text-to-speech, XTTS, FreeVC, Flask

中图分类号: 

  • TP311
[1]GRAVES A.Generating sequences with recurrent neural networks[J].arXiv:1308.0850,2013.
[2]SHEN J,PANG R,WEISS R J,et al.Natural tts synthesis by conditioning wavenet on mel spectrogram predictions[C]//Proceedings of the 2018 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2018.
[3]ARIK S Ö,CHRZANOWSKI M,COATES A,et al.Deep voice:Real-time neural text-to-speech[C]//Proceedings of the International Conference on Machine Learning.PMLR,2017.
[4]GIBIANSKY A,ARIK S,DIAMOS G,et al.Deep voice 2:Multi-speaker neural text-to-speech[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems.2017:2966-2974.
[5]PING W,PENG K,GIBIANSKY A,et al.Deep voice 3:Scaling text-to-speech with convolutional sequence learning[J].arXiv:1710.07654,2017.
[6]SNYDER D,GARCIA-ROMERO D,SELL G,et al.X-vectors:Robust dnn embeddings for speaker recognition[C]//Procee-dings of the 2018 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2018.
[7]YANG J,LEE J,KIM Y,et al.VocGAN:A high-fidelity real-time vocoder with a hierarchically-nested adversarial network[J].arXiv:2007.15256,2020.
[8]KONG J,KIM J,BAE J.Hifi-gan:Generative adversarial net-works for efficient and high fidelity speech synthesis[J].Advances in Neural Information Processing Systems,2020,33:17022-17033.
[9]JANG W,LIM D,YOON J,et al.Univnet:A neural vocoderwith multi-resolution spectrogram discriminators for high-fidelity waveform generation[C]//Proceedings Interspeech 2021.2021:2207-2211.
[10]MORRISON M,KUMAR R,KUMAR K,et al.Chunked autoregressive gan for conditional waveform synthesis[C]//International Conference on Learning Representations.2021.
[11]CHEN S,WANG C,WU Y,et al.Neural codec language models are zero-shot text to speech synthesizers[J].IEEE Transactions on Audio,Speech and Language Processing,2025,33:705-718.
[12]LI T,WANG Z,ZHU X,et al.U-Style:Cascading U-Nets With Multi-Level Speaker and Style Modeling for Zero-Shot Voice Cloning[J].IEEE/ACM Transactions on Audio,Speech and Language Processing,2024,32:4026-4035.
[13]QIN Z,ZHAO W,YU X,et al.Openvoice:Versatile instantvoice cloning[J].arXiv:2312.01479,2023.
[14]WANG Y,ZHAN H,LIU L,et al.Maskgct:Zero-shot text-to-speech with masked generative codec transformer[J].arXiv:2409.00750,2024.
[15]LU Y X,DU H P,SHENG Z Y,et al.Incremental Disentanglement for Environment-Aware Zero-Shot Text-to-Speech Synthesis[C]//Proceedings of the ICASSP 2025-2025 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2025.
[16]MENG M,YANG Z,YANG J,et al.DS-TTS:Zero-Shot Speaker Style Adaptation from Voice Clips via Dynamic Dual-Style Feature Modulation[J].arXiv:2506.01020,2025.
[17]ZHANG B,GUO C,YANG G,et al.Minimax-speech:Intrinsic zero-shot text-to-speech with a learnable speaker encoder[J].arXiv:2505.07916,2025.
[18]DENG W,ZHOU S,SHU J,et al.IndexTTS:An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System[J].arXiv:2502.05512,2025.
[19]CASANOVA E,DAVIS K,GÖLGE E,et al.Xtts:a massively multilingual zero-shot text-to-speech model[J].arXiv:2406,04904,2024.
[20]LI J,TU W,XIAO L.Freevc:Towards high-quality text-freeone-shot voice conversion[C]//Proceedings of the ICASSP 2023-2023 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2023.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!