通过分解代表性学习改善零射语音风格传输 (Improving Zero-shot Voice Style Transfer via Disentangled Representation Learning)

Voice style transfer, also called voice conversion, seeks to modify one speaker's voice to generate speech as if it came from another (target) speaker. Previous works have made progress on voice conversion with parallel training data and pre-known speakers. However, zero-shot voice style transfer, which learns from non-parallel data and generates voices for previously unseen speakers, remains a challenging problem. We propose a novel zero-shot voice transfer method via disentangled representation learning. The proposed method first encodes speaker-related style and voice content of each input voice into separated low-dimensional embedding spaces, and then transfers to a new voice by combining the source content embedding and target style embedding through a decoder. With information-theoretic guidance, the style and content embedding spaces are representative and (ideally) independent of each other. On real-world VCTK datasets, our method outperforms other baselines and obtains state-of-the-art results in terms of transfer accuracy and voice naturalness for voice style transfer experiments under both many-to-many and zero-shot setups.

翻译：语音风格的传输,也称为语音转换,试图改变一个发言者的声音,使其产生声音,仿佛来自另一个(目标)发言者。以前的作品在语音转换方面取得了进展,与平行培训数据和事先已知的发言者一起进行。然而,零光语音风格的传输,从非平行数据中学习,为先前看不见的发言者产生声音,仍然是一个具有挑战性的问题。我们提议了一种新颖的零光语音传输方法,通过分解的代言学习,将每个输入声音的与发言者相关的风格和声音内容编码为分开的低维嵌入空间,然后通过将源内容嵌入和目标样式嵌入一个解码器并嵌入一个新的声音。有了信息理论指导,嵌入空间的风格和内容具有代表性,并且(理想地)彼此独立。在现实世界的VCTK数据集中,我们的方法超越了其他基线,并获得了最先进的结果,即从传输准确性和声音风格自然性的角度,在多个组合和零组合下进行语音风格传输实验。

相关内容

表示学习

关注 185

表示学习是通过利用训练数据来学习得到向量表示，这可以克服人工方法的局限性。表示学习通常可分为两大类，无监督和有监督表示学习。大多数无监督表示学习方法利用自动编码器（如去噪自动编码器和稀疏自动编码器等）中的隐变量作为表示。目前出现的变分自动编码器能够更好的容忍噪声和异常值。然而，推断给定数据的潜在结构几乎是不可能的。目前有一些近似推断的策略。此外，一些无监督表示学习方法旨在近似某种特定的相似性度量。提出了一种无监督的相似性保持表示学习框架，该框架使用矩阵分解来保持成对的DTW相似性。通过学习保持DTW的shaplets，即在转换后的空间中的欧式距离近似原始数据的真实DTW距离。有监督表示学习方法可以利用数据的标签信息，更好地捕获数据的语义结构。孪生网络和三元组网络是目前两种比较流行的模型，它们的目标是最大化类别之间的距离并最小化了类别内部的距离。

零样本文本分类，Zero-Shot Learning for Text Classification

专知会员服务

97+阅读 · 2020年5月31日

【Google】大迁移：通用视觉表示学习，General Visual Representation Learning

专知会员服务

37+阅读 · 2020年5月9日

从多个自我监督任务中学习问题无关的语音表示，Learning Problem-agnostic Speech Representations from Multiple Self-supervised Tasks

专知会员服务

17+阅读 · 2020年5月6日

【ACL2020-Facebook AI】跨语言表示学习，Unsupervised Cross-lingual Representation Learning at Scale

专知会员服务

27+阅读 · 2020年4月5日