This paper aims to synthesize the target speaker's speech with desired speaking style and emotion by transferring the style and emotion from reference speech recorded by other speakers. We address this challenging problem with a two-stage framework composed of a text-to-style-and-emotion (Text2SE) module and a style-and-emotion-to-wave (SE2Wave) module, bridging by neural bottleneck (BN) features. To further solve the multi-factor (speaker timbre, speaking style and emotion) decoupling problem, we adopt the multi-label binary vector (MBV) and mutual information (MI) minimization to respectively discretize the extracted embeddings and disentangle these highly entangled factors in both Text2SE and SE2Wave modules. Moreover, we introduce a semi-supervised training strategy to leverage data from multiple speakers, including emotion-labeled data, style-labeled data, and unlabeled data. To better transfer the fine-grained expression from references to the target speaker in non-parallel transfer, we introduce a reference-candidate pool and propose an attention-based reference selection approach. Extensive experiments demonstrate the good design of our model.
翻译:本文旨在通过将其他发言者所录参考演讲的风格和情感从其他发言者所录参考演讲中传递的风格和情感,将目标发言者的演讲与理想的演讲风格和情感综合起来。我们用由文本到风格和情绪(Text2SE)模块组成的两阶段框架和由神经瓶颈(BN)特征连接的风格和情绪(SE2Wave)模块组成的风格和情绪到波(SE2Wave)模块解决这一具有挑战性的问题。为了进一步解决多因素(声音的触角、声音的风格和情感)脱钩问题,我们采用了多标签双向二向矢量和相互信息(MBV)的最小化,分别将提取的嵌入器分离并分解这些高度纠结的因素。此外,我们引入了半封闭式的培训战略,以利用多个发言者的数据,包括情感标签数据、风格标签数据和无标签数据。为了更好地将精细的表达式表达式从非平行传输中引用目标演讲者,我们引入了一个参考扫描式的组合,我们引入了一种参考扫描式的模型,并提议一个广泛的实验式的模型。</s>