One-shot style transfer is a challenging task, since training on one utterance makes model extremely easy to over-fit to training data and causes low speaker similarity and lack of expressiveness. In this paper, we build on the recognition-synthesis framework and propose a one-shot voice conversion approach for style transfer based on speaker adaptation. First, a speaker normalization module is adopted to remove speaker-related information in bottleneck features extracted by ASR. Second, we adopt weight regularization in the adaptation process to prevent over-fitting caused by using only one utterance from target speaker as training data. Finally, to comprehensively decouple the speech factors, i.e., content, speaker, style, and transfer source style to the target, a prosody module is used to extract prosody representation. Experiments show that our approach is superior to the state-of-the-art one-shot VC systems in terms of style and speaker similarity; additionally, our approach also maintains good speech quality.
翻译:单发风格传输是一项具有挑战性的任务,因为关于一个词的培训使模型极易过于适合培训数据,并造成低演讲者相似性和缺乏表达性。在本文中,我们以承认合成框架为基础,提出根据演讲者适应性进行风格传输的一发声音转换方法。首先,采用一个演讲者正常化模块,删除ASR所提取的瓶颈特征中与演讲者有关的信息。第二,我们在适应过程中采用权重规范化,以防止由于仅使用目标演讲者的一个词作为培训数据而造成过分匹配。最后,为了全面将演讲因素,即内容、演讲者、风格和传输源风格与目标脱钩,我们使用一个假音模块来提取流动的表述。实验表明,我们的方法在风格和类似语器方面优于最先进的单发VC系统;此外,我们的方法还保持良好的语音质量。