基于XTTS模型的声音克隆系统研究

doi:10.11896/jsjkx.250600187

Computer Science ›› 2026, Vol. 53 ›› Issue (5): 59-67.doi: 10.11896/jsjkx.250600187

• Intelligent Education Technology • Previous Articles Next Articles

Research on Voice Cloning System Based on XTTS Model

WANG Chencai¹, YANG Siyan², MIAO Qiguang^1,3

1 School of Computer Science and Technology, Xidian University, Xi’an 710071, China
2 Department of Information Technology, The Open University of Shaanxi, Xi’an 710119, China
3 Xi’an Key Laboratory of Big Data and Intelligent Vision, Xi’an 710071, China

Received:2025-06-26 Revised:2025-07-20 Published:2026-05-08
About author:WANG Chencai,born in 2003,postgra-duate.His main research interests include intelligent educational technology and so on.
MIAO Qiguang,born in 1972,Ph.D,professor,is a councillor of CCF(No.09025D).His main research interests include computer vision,big data analysis and intelligent educational technology.
Supported by:
Key Project of Shaanxi Polytechnic Institute Research Program(20GA06),Guangxi Key Laboratory of Trusted Software Project(KX202047),Key Research and Development Program of Shaanxi Province(2024GH-ZDXM-47),Higher Education Teaching Reform Research Program of Shaanxi Province(23JG003) and Research Project of the China Association of Higher Education(24PG0101).

Abstract

Abstract: With the continuous advancement of deep learning and speech synthesis technologies,voice cloning has shown broad application prospects in intelligent voice assistants,virtual anchors,and barrier-free communication.However,existing voice cloning systems still face challenges in timbre similarity,interactive efficiency,and large-scale processing capability,making it difficult to meet the growing demand for high-quality,personalized speech synthesis.To address these limitations,this paper designs and implements a Web-based platform for multilingual voice cloning and batch text-to-speech synthesis,based on the XTTS mo-del.The system improves upon existing solutions by enhancing language coverage,reducing data dependency for timbre transfer,and optimizing batch processing efficiency.It adopts a front-end/back-end decoupled architecture,with a Flask-based RESTful API at the back end and mainstream Web technologies combined with AJAX at the front end.MySQL is used for managing user and audio data.The platform integrates voice cloning,text-to-speech,and batch synthesis modules,and demonstrates strong flexibility and scalability.Experimental results show that the system performs well in speech naturalness and timbre similarity,proving its practical value and application potential.

Key words: Voice cloning, Text-to-speech, XTTS, FreeVC, Flask

CLC Number:

TP311

WANG Chencai, YANG Siyan, MIAO Qiguang. Research on Voice Cloning System Based on XTTS Model[J].Computer Science, 2026, 53(5): 59-67.

References

[1]GRAVES A.Generating sequences with recurrent neural networks[J].arXiv:1308.0850,2013.
[2]SHEN J,PANG R,WEISS R J,et al.Natural tts synthesis by conditioning wavenet on mel spectrogram predictions[C]//Proceedings of the 2018 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2018.
[3]ARIK S Ö,CHRZANOWSKI M,COATES A,et al.Deep voice:Real-time neural text-to-speech[C]//Proceedings of the International Conference on Machine Learning.PMLR,2017.
[4]GIBIANSKY A,ARIK S,DIAMOS G,et al.Deep voice 2:Multi-speaker neural text-to-speech[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems.2017:2966-2974.
[5]PING W,PENG K,GIBIANSKY A,et al.Deep voice 3:Scaling text-to-speech with convolutional sequence learning[J].arXiv:1710.07654,2017.
[6]SNYDER D,GARCIA-ROMERO D,SELL G,et al.X-vectors:Robust dnn embeddings for speaker recognition[C]//Procee-dings of the 2018 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2018.
[7]YANG J,LEE J,KIM Y,et al.VocGAN:A high-fidelity real-time vocoder with a hierarchically-nested adversarial network[J].arXiv:2007.15256,2020.
[8]KONG J,KIM J,BAE J.Hifi-gan:Generative adversarial net-works for efficient and high fidelity speech synthesis[J].Advances in Neural Information Processing Systems,2020,33:17022-17033.
[9]JANG W,LIM D,YOON J,et al.Univnet:A neural vocoderwith multi-resolution spectrogram discriminators for high-fidelity waveform generation[C]//Proceedings Interspeech 2021.2021:2207-2211.
[10]MORRISON M,KUMAR R,KUMAR K,et al.Chunked autoregressive gan for conditional waveform synthesis[C]//International Conference on Learning Representations.2021.
[11]CHEN S,WANG C,WU Y,et al.Neural codec language models are zero-shot text to speech synthesizers[J].IEEE Transactions on Audio,Speech and Language Processing,2025,33:705-718.
[12]LI T,WANG Z,ZHU X,et al.U-Style:Cascading U-Nets With Multi-Level Speaker and Style Modeling for Zero-Shot Voice Cloning[J].IEEE/ACM Transactions on Audio,Speech and Language Processing,2024,32:4026-4035.
[13]QIN Z,ZHAO W,YU X,et al.Openvoice:Versatile instantvoice cloning[J].arXiv:2312.01479,2023.
[14]WANG Y,ZHAN H,LIU L,et al.Maskgct:Zero-shot text-to-speech with masked generative codec transformer[J].arXiv:2409.00750,2024.
[15]LU Y X,DU H P,SHENG Z Y,et al.Incremental Disentanglement for Environment-Aware Zero-Shot Text-to-Speech Synthesis[C]//Proceedings of the ICASSP 2025－2025 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2025.
[16]MENG M,YANG Z,YANG J,et al.DS-TTS:Zero-Shot Speaker Style Adaptation from Voice Clips via Dynamic Dual-Style Feature Modulation[J].arXiv:2506.01020,2025.
[17]ZHANG B,GUO C,YANG G,et al.Minimax-speech:Intrinsic zero-shot text-to-speech with a learnable speaker encoder[J].arXiv:2505.07916,2025.
[18]DENG W,ZHOU S,SHU J,et al.IndexTTS:An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System[J].arXiv:2502.05512,2025.
[19]CASANOVA E,DAVIS K,GÖLGE E,et al.Xtts:a massively multilingual zero-shot text-to-speech model[J].arXiv:2406,04904,2024.
[20]LI J,TU W,XIAO L.Freevc:Towards high-quality text-freeone-shot voice conversion[C]//Proceedings of the ICASSP 2023-2023 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2023.

Related Articles 15

[1]	HU Junjie, CHEN Yujie, HU Yikun, WEN Cheng, CAO Jialun, MA Zhi, SU Jie, SUN Weidi, TIAN Cong, QIN Shengchao. Formal Theorem Proving Empowered by Large Language Model:Survey and Perspectives [J]. Computer Science, 2026, 53(4): 1-23.
[2]	LIU Yichen, LIN Yan, ZHOU Zeyu, GUO Shengnan, LIN Youfang, WAN Huaiyu. Efficient Semantic-aware Trajectory Representation Learning Method via State Space Model [J]. Computer Science, 2026, 53(4): 134-142.
[3]	XU Yamin, LI Xiaobin, ZHANG Run. Semi-supervised Learning Algorithm Based on Pointwise Manifold Structures and Uniform Regularity Constraints [J]. Computer Science, 2026, 53(4): 173-179.
[4]	KANG Jun, GAO Shengkai, LAI Jiabao. Fast Map Matching Method Based on Trajectory Micro-segment Model [J]. Computer Science, 2026, 53(4): 252-259.
[5]	ZHANG Can, LI Weixun, WANG Ming, ZHAN Xiong, XIE Ziguang, HAN Dongqi, WANG Zhiliang, YANG Jiahai. Network Traffic Generation Method for Malicious Traffic Identification [J]. Computer Science, 2026, 53(4): 415-423.
[6]	XU Jiawen, ZHENG Yungui, ZHOU Wei, XU Yaoqiang, HU Huiqi, ZHOU Xuan. SQL-MARS:Text-to-SQL Structured Data Recommendation System for Ambiguous UserRequirements [J]. Computer Science, 2026, 53(3): 52-63.
[7]	SONG Jianhua, HE Jiawei, ZHANG Yan. Dual-channel Source Code Vulnerability Detection Model Based on Contrastive Learning [J]. Computer Science, 2026, 53(3): 424-432.
[8]	SHAO Xinyi, ZHU Jingwei, ZHANG Liang. LLM-based Business Process Adaptation Method to Respond Long-tailed Changes [J]. Computer Science, 2026, 53(1): 29-38.
[9]	LU Chao, YANG Chaoshu, YAO Zhengzhu, LIU Ying, ZHANG Runyu. Survey on Optimization B+ Tree Index for Persistent Memory [J]. Computer Science, 2026, 53(1): 77-88.
[10]	LI Shunyong, ZHENG Mengjiao, LI Jiaming, ZHAO Xingwang. Joint Spectrum Embedding Clustering Algorithm Based on Multi-view Diversity Learning [J]. Computer Science, 2026, 53(1): 104-114.
[11]	SONG Yijing, ZHANG Jifu. Attribute Grouping-based Categorical Outlier Detection Using Isolation Forest Ensemble Strategy [J]. Computer Science, 2026, 53(1): 115-127.
[12]	XU Teng, LIU Luyao, JIANG Haoyu, LUO Chang, LI Heng, YUAN Wei. Survey on Security of Android SDKs [J]. Computer Science, 2026, 53(1): 285-297.
[13]	PAN Yanyang, YANG Binhao, JI Qingge. PBFT Consensus Algorithm Based on Bayesian Theory [J]. Computer Science, 2026, 53(1): 331-340.
[14]	ZHANG Lizheng, YANG Qiuhui, DAI Shengxin. Automated Program Repair Based on Perturbing and Freezing Pre-trained Model [J]. Computer Science, 2025, 52(12): 18-23.
[15]	ZHANG Cong, CHEN Zhe, WANG Huijie, WEI Yiyang. SCADE Model Checking Based on Implicit Predicate Abstraction and Property-directedReachability [J]. Computer Science, 2025, 52(12): 24-31.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!