In this paper, we present CopyCat2 (CC2), a novel model capable of: a) synthesizing speech with different speaker identities, b) generating speech with expressive and contextually appropriate prosody, and c) transferring prosody at fine-grained level between any pair of seen speakers. We do this by activating distinct parts of the network for different tasks. We train our model using a novel approach to two-stage training. In Stage I, the model learns speaker-independent word-level prosody representations from speech which it uses for many-to-many fine-grained prosody transfer. In Stage II, we learn to predict these prosody representations using the contextual information available in text, thereby, enabling multi-speaker TTS with contextually appropriate prosody. We compare CC2 to two strong baselines, one in TTS with contextually appropriate prosody, and one in fine-grained prosody transfer. CC2 reduces the gap in naturalness between our baseline and copy-synthesised speech by $22.79\%$. In fine-grained prosody transfer evaluations, it obtains a relative improvement of $33.15\%$ in target speaker similarity.
翻译:在本文中,我们介绍了CopyCat2(CC2),这是一个新型模型,能够:(a) 综合具有不同发言者身份的语音,(b) 产生具有直观和背景上适合的假肢的演讲,以及(c) 在任何一对熟的发言者之间以细微的比重水平转移假肢,我们这样做的方法是激活网络的不同部分以开展不同的任务。我们用一种新颖的方法对两阶段的培训模式进行了培训。在第一阶段,该模型从它用于许多到许多微微微重的假肢移植的演讲中学习了语言依赖字级的代言语。在第二阶段,我们学会利用文本中提供的背景信息预测这些假肢的表现,从而使多讲的TTS能够以适合背景的比重来进行。我们将CC2比对两个强的基线进行比较,一个在TTS中,使用一种适合背景的假肢,一个是精细的模拟传输。CC2从它用来将我们的基线和复制合成的演讲之间的自然差距缩小到22.79美元。在精确的变式演讲中,一个相似的相对性变式评估中获得了一个目标。