Generally speaking, the main objective when training a neural speech synthesis system is to synthesize natural and expressive speech from the output layer of the neural network without much attention given to the hidden layers. However, by learning useful latent representation, the system can be used for many more practical scenarios. In this paper, we investigate the use of quantized vectors to model the latent linguistic embedding and compare it with the continuous counterpart. By enforcing different policies over the latent spaces in the training, we are able to obtain a latent linguistic embedding that takes on different properties while having a similar performance in terms of quality and speaker similarity. Our experiments show that the voice cloning system built with vector quantization has only a small degradation in terms of perceptive evaluations, but has a discrete latent space that is useful for reducing the representation bit-rate, which is desirable for data transferring, or limiting the information leaking, which is important for speaker anonymization and other tasks of that nature.
翻译:一般来说,在培训神经语言合成系统时,主要目标是将神经网络产出层的自然和表达式言词合成,而不重视隐性层。然而,通过学习有用的潜在代表,该系统可以用于许多更实际的情景。在本文件中,我们调查利用量化矢量来模拟潜在语言嵌入并与连续对应方进行比较。通过对培训中的潜在空间执行不同的政策,我们能够获得一种隐含语言嵌入,这种嵌入在不同的特性上,同时在质量和语言相似性方面具有类似的性能。我们的实验表明,用矢量量量化构建的语音克隆系统在认知性评价方面只有很小的退化,但有一个离散的潜在空间,有助于减少代表比特率,这对于数据传输或限制信息泄漏是可取的,而数据传输或限制信息泄漏对于语音匿名和该性质的其他任务非常重要。