Previous works on expressive text-to-speech (TTS) have a limitation on robustness and speed when training and inferring. Such drawbacks mostly come from autoregressive decoding, which makes the succeeding step vulnerable to preceding error. To overcome this weakness, we propose STYLER, a novel expressive text-to-speech model with parallelized architecture. Expelling autoregressive decoding and introducing speech decomposition for encoding enables speech synthesis more robust even with high style transfer performance. Moreover, our novel noise modeling approach from audio using domain adversarial training and Residual Decoding enabled style transfer without transferring noise. Our experiments prove the naturalness and expressiveness of our model from comparison with other parallel TTS models. Together we investigate our model's robustness and speed by comparison with the expressive TTS model with autoregressive decoding.
翻译:先前关于表达文本到语音(TTS)的著作在培训和推论时对强度和速度有限制。 这些退步主要来自自动递减解码, 这使得接下来的步骤易受先前错误的伤害。 为了克服这一弱点, 我们提议STYLER, 这是一种新型的表达文本到语音的模式, 并带有平行结构。 排除自动递减解码, 引入语音分解以编码方式使语音合成更加有力, 即使具有高风格的传输性能。 此外, 我们的新噪音模型的模型方法来自使用域对称培训和余置解密允许风格转换的音频方法, 而没有传输噪音。 我们的实验证明了我们模型与其他平行的 TTS 模型相比的自然性和清晰性。 我们一起通过与带有自动递减的表达式 TTS 模型进行比较, 来调查我们的模型的坚固性和速度 。