This paper presents a novel data augmentation technique for text-to-speech (TTS), that allows to generate new (text, audio) training examples without requiring any additional data. Our goal is to increase diversity of text conditionings available during training. This helps to reduce overfitting, especially in low-resource settings. Our method relies on substituting text and audio fragments in a way that preserves syntactical correctness. We take additional measures to ensure that synthesized speech does not contain artifacts caused by combining inconsistent audio samples. The perceptual evaluations show that our method improves speech quality over a number of datasets, speakers, and TTS architectures. We also demonstrate that it greatly improves robustness of attention-based TTS models.
翻译:本文介绍了一种用于文本到语音的新颖的数据增强技术(TTS),这种技术可以产生新的(文本、音频)培训范例,而不需要任何额外数据。我们的目标是在培训期间增加现有的文字条件的多样性。这有助于减少过度配制,特别是在资源低的环境下。我们的方法依靠的是替换文本和音频碎片,以保持同步正确性。我们采取了额外措施,确保合成语言不包含由不一致的音频样本混合而成的手工艺品。感知性评估表明,我们的方法提高了许多数据集、演讲人和TTS结构的语音质量。我们还表明,它大大改善了基于关注的TTS模型的稳健性。