Voice style transfer, also called voice conversion, seeks to modify one speaker's voice to generate speech as if it came from another (target) speaker. Previous works have made progress on voice conversion with parallel training data and pre-known speakers. However, zero-shot voice style transfer, which learns from non-parallel data and generates voices for previously unseen speakers, remains a challenging problem. We propose a novel zero-shot voice transfer method via disentangled representation learning. The proposed method first encodes speaker-related style and voice content of each input voice into separated low-dimensional embedding spaces, and then transfers to a new voice by combining the source content embedding and target style embedding through a decoder. With information-theoretic guidance, the style and content embedding spaces are representative and (ideally) independent of each other. On real-world VCTK datasets, our method outperforms other baselines and obtains state-of-the-art results in terms of transfer accuracy and voice naturalness for voice style transfer experiments under both many-to-many and zero-shot setups.
翻译:语音风格的传输,也称为语音转换,试图改变一个发言者的声音,使其产生声音,仿佛来自另一个(目标)发言者。以前的作品在语音转换方面取得了进展,与平行培训数据和事先已知的发言者一起进行。然而,零光语音风格的传输,从非平行数据中学习,为先前看不见的发言者产生声音,仍然是一个具有挑战性的问题。我们提议了一种新颖的零光语音传输方法,通过分解的代言学习,将每个输入声音的与发言者相关的风格和声音内容编码为分开的低维嵌入空间,然后通过将源内容嵌入和目标样式嵌入一个解码器并嵌入一个新的声音。有了信息理论指导,嵌入空间的风格和内容具有代表性,并且(理想地)彼此独立。在现实世界的VCTK数据集中,我们的方法超越了其他基线,并获得了最先进的结果,即从传输准确性和声音风格自然性的角度,在多个组合和零组合下进行语音风格传输实验。