Cross-speaker style transfer in speech synthesis aims at transferring a style from source speaker to synthesized speech of a target speaker's timbre. In most previous methods, the synthesized fine-grained prosody features often represent the source speaker's average style, similar to the one-to-many problem(i.e., multiple prosody variations correspond to the same text). In response to this problem, a strength-controlled semi-supervised style extractor is proposed to disentangle the style from content and timbre, improving the representation and interpretability of the global style embedding, which can alleviate the one-to-many mapping and data imbalance problems in prosody prediction. A hierarchical prosody predictor is proposed to improve prosody modeling. We find that better style transfer can be achieved by using the source speaker's prosody features that are easily predicted. Additionally, a speaker-transfer-wise cycle consistency loss is proposed to assist the model in learning unseen style-timbre combinations during the training phase. Experimental results show that the method outperforms the baseline. We provide a website with audio samples.
翻译:在语音合成中,交叉口语风格传输的目的是将发源方发言者的风格转换为目标演讲人的音质综合表达式。在大多数以往的方法中,合成细微的感官功能往往代表源方发言者的平均风格,类似于一对多问题(即多种手动变异与同一文本类似) 。针对这一问题,建议使用一个强度控制下半监督式风格提取器,将风格与内容和音质脱钩,改进全球风格嵌入的表述和可解释性,这可以缓解一对多图象和数据不平衡问题。建议使用等级性动画预测器改进模拟。我们发现,使用源方发言者易于预测的动画性特征可以实现更好的风格转移。此外,还提议了一种发声器转移周期一致性损失,以帮助模型在培训阶段学习看不见的风格-丁字管组合。实验结果表明,该方法超出了基线。我们提供了一个有声频样本的网站。</s>