The potential of synthetic data in text-to-speech (TTS) model training has gained increasing attention, yet its rationality and effectiveness require systematic validation. In this study, we systematically investigate the feasibility of using purely synthetic data for TTS training and explore how various factors--including text richness, speaker diversity, noise levels, and speaking styles--affect model performance. Our experiments reveal that increasing speaker and text diversity significantly enhances synthesis quality and robustness. Cleaner training data with minimal noise further improves performance. Moreover, we find that standard speaking styles facilitate more effective model learning. Our experiments indicate that models trained on synthetic data have great potential to outperform those trained on real data under similar conditions, due to the absence of real-world imperfections and noise.
翻译:合成数据在文本转语音模型训练中的潜力日益受到关注,但其合理性与有效性仍需系统验证。本研究系统探究了使用纯合成数据进行TTS训练的可行性,并深入分析了文本丰富度、说话人多样性、噪声水平及说话风格等多重因素对模型性能的影响。实验表明:提升说话人与文本多样性可显著增强合成质量与模型鲁棒性;降低训练数据噪声能进一步优化性能;标准说话风格则更有利于模型高效学习。研究还发现,在同等条件下,基于合成数据训练的模型因规避了真实数据固有的缺陷与噪声干扰,展现出超越真实数据训练模型的潜力。