The goal of voice conversion is to transform source speech into a target voice, keeping the content unchanged. In this paper, we focus on self-supervised representation learning for voice conversion. Specifically, we compare discrete and soft speech units as input features. We find that discrete representations effectively remove speaker information but discard some linguistic content - leading to mispronunciations. As a solution, we propose soft speech units. To learn soft units, we predict a distribution over discrete speech units. By modeling uncertainty, soft units capture more content information, improving the intelligibility and naturalness of converted speech. Samples available at https://ubisoft-laforge.github.io/speech/soft-vc/. Code available at https://github.com/bshall/soft-vc/.
翻译:语音转换的目的是将源语言转换成目标声音,使内容保持不变。 在本文中,我们侧重于自我监督的语音转换演示学习。 具体地说,我们将离散和软语音单位作为输入功能加以比较。 我们发现,离散的表达方式有效地删除了发言者的信息,但抛弃了某些语言内容,导致错误的表达方式。 我们提出软语音单位。 为了学习软语言单位,我们预测离散语音单位的分布。 通过模拟不确定性,软单位获取了更多的内容信息,提高了转换语音的智能和自然性。 样本可见于 https://ubisoft-laforge.github.io/speech/soft-vc/。 代码可在 https://github.com/bshall/soft-vc/上查阅。