Zero-shot speaker adaptation aims to clone an unseen speaker's voice without any adaptation time and parameters. Previous researches usually use a speaker encoder to extract a global fixed speaker embedding from reference speech, and several attempts have tried variable-length speaker embedding. However, they neglect to transfer the personal pronunciation characteristics related to phoneme content, leading to poor speaker similarity in terms of detailed speaking styles and pronunciation habits. To improve the ability of the speaker encoder to model personal pronunciation characteristics, we propose content-dependent fine-grained speaker embedding for zero-shot speaker adaptation. The corresponding local content embeddings and speaker embeddings are extracted from a reference speech, respectively. Instead of modeling the temporal relations, a reference attention module is introduced to model the content relevance between the reference speech and the input text, and to generate the fine-grained speaker embedding for each phoneme encoder output. The experimental results show that our proposed method can improve speaker similarity of synthesized speeches, especially for unseen speakers.
翻译:零点扬声器调整的目的是在没有任何时间和参数适应的情况下克隆一个隐蔽的发言者的声音。以前的研究通常使用一个扬声器编码器从参考演讲中提取一个全球固定的扬声器嵌入,并尝试了几次尝试变长的扬声器嵌入。然而,它们忽略了传输与通音内容有关的个人发音特征,导致音量和发音习惯在详细发言风格和发音习惯方面相似性差。为了提高扬声器编码器模拟个人发音特征的能力,我们建议采用基于内容的精细微扩增的扬声器嵌入零点音。相应的本地内容嵌入式和扬声器嵌入器分别从参考演讲中提取。除了模拟时间关系外,还引入了一个参考注意模块,以模拟参考演讲和输入文本之间的内容相关性,并为每个调音器输出生成精细的扬声器嵌入。实验结果显示,我们提出的方法可以改进演讲者综合演讲的相似性,特别是对隐形演讲者而言。