Voice Conversion (VC) is the task of making a spoken utterance by one speaker sound as if uttered by a different speaker, while keeping other aspects like content unchanged. Current VC methods, focus primarily on spectral features like timbre, while ignoring the unique speaking style of people which often impacts prosody. In this study, we introduce a method for converting not only the timbre, but also prosodic information (i.e., rhythm and pitch changes) to those of the target speaker. The proposed approach is based on a pretrained, self-supervised, model for encoding speech to discrete units, which make it simple, effective, and easy to optimise. We consider the many-to-many setting with no paired data. We introduce a suite of quantitative and qualitative evaluation metrics for this setup, and empirically demonstrate the proposed approach is significantly superior to the evaluated baselines. Code and samples can be found under https://pages.cs.huji.ac.il/adiyoss-lab/dissc/ .
翻译:语音转换( VC) 是一个发言者的口头发声任务, 使声音声音听起来像一个不同的发言者所说的话一样, 同时保持内容等其他方面不变。 当前的 VC 方法主要侧重于光谱特征, 如色调, 却忽略了人们通常影响运动的独特说话风格。 在这次研究中, 我们引入了一种方法, 不仅将音调转换成目标发言者的信息( 即, 节奏和音调幅变化) 。 提议的方法基于预先训练的、 自我监督的、 向离散单位编码演讲的模式, 这使得它简单、 有效、 容易优化。 我们考虑许多到许多的设置, 没有配对的数据。 我们为这个设置引入了一套定量和定性评价的衡量标准, 实验性地展示了拟议方法比评估的基线要高得多。 代码和样本可以在 https://pages. cs.huji. ac. il/adiyos-lab/ dissc/ 下找到 。