Zero-shot multi-speaker text-to-speech (ZSM-TTS) models aim to generate a speech sample with the voice characteristic of an unseen speaker. The main challenge of ZSM-TTS is to increase the overall speaker similarity for unseen speakers. One of the most successful speaker conditioning methods for flow-based multi-speaker text-to-speech (TTS) models is to utilize the functions which predict the scale and bias parameters of the affine coupling layers according to the given speaker embedding vector. In this letter, we improve on the previous speaker conditioning method by introducing a speaker-normalized affine coupling (SNAC) layer which allows for unseen speaker speech synthesis in a zero-shot manner leveraging a normalization-based conditioning technique. The newly designed coupling layer explicitly normalizes the input by the parameters predicted from a speaker embedding vector while training, enabling an inverse process of denormalizing for a new speaker embedding at inference. The proposed conditioning scheme yields the state-of-the-art performance in terms of the speech quality and speaker similarity in a ZSM-TTS setting.
翻译:“ZSM-TTS”模型旨在生成一个具有隐形发言者声音特征的语音样本。ZSM-TTS的主要挑战是如何提高隐形发言者总体发言者的相似性。对流基多声音文本到语音模型最成功的发声器调节方法之一是利用根据特定发言者嵌入矢量预测折合层的规模和偏差参数的功能。在本信中,我们改进了前一位发言者的调制方法,采用了一个语音正常接合(SNAC)层,该层允许以零速方式使用基于正常化的调控技术进行隐性语音合成。新设计的混合层明确使发言者嵌入矢量预测参数的投入正常化,同时进行培训,使新发言者在推断时嵌入的变异过程得以实现反常化。拟议的调制方案在ZSM-TTS-设置的语音质量和相似度方面产生了最先进的表现。