Modern text-to-speech synthesis pipelines typically involve multiple processing stages, each of which is designed or learnt independently from the rest. In this work, we take on the challenging task of learning to synthesise speech from normalised text or phonemes in an end-to-end manner, resulting in models which operate directly on character or phoneme input sequences and produce raw speech audio outputs. Our proposed generator is feed-forward and thus efficient for both training and inference, using a differentiable alignment scheme based on token length prediction. It learns to produce high fidelity audio through a combination of adversarial feedback and prediction losses constraining the generated audio to roughly match the ground truth in terms of its total duration and mel-spectrogram. To allow the model to capture temporal variation in the generated audio, we employ soft dynamic time warping in the spectrogram-based prediction loss. The resulting model achieves a mean opinion score exceeding 4 on a 5 point scale, which is comparable to the state-of-the-art models relying on multi-stage training and additional supervision.
翻译:现代文本到语音合成管道通常涉及多个处理阶段,每个过程的设计或学习都是独立于其他过程的。在这项工作中,我们承担了以端到端方式学习从正常文本或电话中合成语音的艰巨任务,从而形成直接操作字符或语音输入序列的模型,并产生原始语音输出。我们提议的生成器是向前进的,因此对培训和推断都有效,使用基于象征性长度预测的不同调整方案。它学会了通过对抗性反馈和预测损失相结合产生高度忠诚的音频,以限制生成的音频在总持续时间和光谱上大致匹配地面真理。为了让模型能够捕捉生成音频的时变,我们在光谱预测损失中采用了软动态时间转换法。因此产生的模型在5个尺度上取得了超过4的中位平均意见分,这与依赖多阶段培训和额外监督的状态模型相当。