Recent advances in neural text-to-speech research have been dominated by two-stage pipelines utilizing low-level intermediate speech representation such as mel-spectrograms. However, such predetermined features are fundamentally limited, because they do not allow to exploit the full potential of a data-driven approach through learning hidden representations. For this reason, several end-to-end methods have been proposed. However, such models are harder to train and require a large number of high-quality recordings with transcriptions. Here, we propose WavThruVec - a two-stage architecture that resolves the bottleneck by using high-dimensional Wav2Vec 2.0 embeddings as intermediate speech representation. Since these hidden activations provide high-level linguistic features, they are more robust to noise. That allows us to utilize annotated speech datasets of a lower quality to train the first-stage module. At the same time, the second-stage component can be trained on large-scale untranscribed audio corpora, as Wav2Vec 2.0 embeddings are already time-aligned. This results in an increased generalization capability to out-of-vocabulary words, as well as to a better generalization to unseen speakers. We show that the proposed model not only matches the quality of state-of-the-art neural models, but also presents useful properties enabling tasks like voice conversion or zero-shot synthesis.
翻译:神经文本到声音研究的最新进展主要以两阶段管道为主,利用低层次中间语言代表,如Mel-spectrogram等低层次中间语言代表,但这种预先确定的特点基本上有限,因为它们不允许通过学习隐藏的表达方式充分利用数据驱动方法的全部潜力。为此,提出了若干端到端的方法。然而,这些模型更难进行培训,需要大量高质量的录音记录,但需要大量记录。在这里,我们提议WavThruVec——一个两阶段结构,用高层次Wav2Vec 2.0嵌入作为中间语言代表来解决瓶颈问题。由于这些隐藏的特性无法通过学习隐藏的隐蔽的表达方式充分利用数据驱动方法的全部潜力,因此它们对于噪音来说更加强大。这使我们能够利用低质量附加的语音数据集来培训第一阶段模块。与此同时,第二阶段部分可以就大型无声化的音频组合进行培训,因为Wav2Vec 2.0嵌入式已经实现了时间校准,但对于升级的音质转换来说,我们只能展示总体质量转换能力,作为总体化模型。