Despite recent advances in generative modeling for text-to-speech synthesis, these models do not yet have the same fine-grained adjustability of pitch-conditioned deterministic models such as FastPitch and FastSpeech2. Pitch information is not only low-dimensional, but also discontinuous, making it particularly difficult to model in a generative setting. Our work explores several techniques for handling the aforementioned issues in the context of Normalizing Flow models. We also find this problem to be very well suited for Neural Spline flows, which is a highly expressive alternative to the more common affine-coupling mechanism in Normalizing Flows.
翻译:尽管最近在文本到语音合成的基因建模方面有所进展,但这些模型还没有像FastPitch和FastSpeech2等固态定型模型那样具有同样的细微可调整性。 Pitch信息不仅是低维的,而且不连续的,因此在基因化环境中特别难以建模。我们的工作探索了几种技术,以便在使流动模式正常化的背景下处理上述问题。我们还发现,这个问题非常适合神经Spline流,而神经Spline流是比正常化流程中更常见的亲和相联机制的高度表达的替代方法。