This paper aims to synthesize target speaker's speech with desired speaking style and emotion by transferring the style and emotion from reference speech recorded by other speakers. Specifically, we address this challenging problem with a two-stage framework composed of a text-to-style-and-emotion (Text2SE) module and a style-and-emotion-to-wave (SE2Wave) module, bridging by neural bottleneck (BN) features. To further solve the multi-factor (speaker timbre, speaking style and emotion) decoupling problem, we adopt the multi-label binary vector (MBV) and mutual information (MI) minimization to respectively discretize the extracted embeddings and disentangle these highly entangled factors in both Text2SE and SE2Wave modules. Moreover, we introduce a semi-supervised training strategy to leverage data from multiple speakers, including emotion-labelled data, style-labelled data, and unlabeled data. To better transfer the fine-grained expressiveness from references to the target speaker in the non-parallel transfer, we introduce a reference-candidate pool and propose an attention based reference selection approach. Extensive experiments demonstrate the good design of our model.
翻译:本文的目的是通过将其他发言者所录参考演讲的风格和情感从其他发言者所录参考演讲中传递的风格和情感,将目标发言者的演讲与理想的演讲风格和情感综合起来。具体地说,我们用由文本到风格和情绪(Text2SE)模块组成的两阶段框架和由神经瓶颈(BN)特征连接的风格和情绪(SE2Wave)模块组成的风格和情感-情绪-波(Se2Wave)模块,来应对这一具有挑战性的问题。为了进一步解决多因素(声音的音调、声音的风格和情感)脱钩问题,我们采用了多标签双向双向向矢量和相互信息(MBV)最小化的多标签双向矢量和相互信息(MI),以便分别分离提取的嵌入式嵌入和解开这些高度纠结的因素(Text2SE2Se2Wave模块和Se2Wave模块)。此外,我们采用了半封闭式培训策略来利用来自多个发言者的数据,包括情感标签数据、风格标签数据和无标签数据。为了更好地将微缩的表达式表达式表达式表达的表达式表达式表达式表达式表达式表达式的表达方式。