StarGAN-ZSVC:在低资源背景下走向零热语音转换 (StarGAN-ZSVC: Towards Zero-Shot Voice Conversion in Low-Resource Contexts)

from arxiv, 16 pages, 3 figures. Published in Springer Communications in Computer and Information Science, Artificial Intelligence Research (SACAIR 2021), vol. 1342, pp. 69-84, 2020

Voice conversion is the task of converting a spoken utterance from a source speaker so that it appears to be said by a different target speaker while retaining the linguistic content of the utterance. Recent advances have led to major improvements in the quality of voice conversion systems. However, to be useful in a wider range of contexts, voice conversion systems would need to be (i) trainable without access to parallel data, (ii) work in a zero-shot setting where both the source and target speakers are unseen during training, and (iii) run in real time or faster. Recent techniques fulfil one or two of these requirements, but not all three. This paper extends recent voice conversion models based on generative adversarial networks (GANs), to satisfy all three of these conditions. We specifically extend the recent StarGAN-VC model by conditioning it on a speaker embedding (from a potentially unseen speaker). This allows the model to be used in a zero-shot setting, and we therefore call it StarGAN-ZSVC. We compare StarGAN-ZSVC against other voice conversion techniques in a low-resource setting using a small 9-minute training set. Compared to AutoVC -- another recent neural zero-shot approach -- we observe that StarGAN-ZSVC gives small improvements in the zero-shot setting, showing that real-time zero-shot voice conversion is possible even for a model trained on very little data. Further work is required to see whether scaling up StarGAN-ZSVC will also improve zero-shot voice conversion quality in high-resource contexts.

翻译：语音转换的任务是将源演讲者的语音变换成源源演讲者的语音表达方式,这样,似乎可以让不同的目标演讲者在保留语句的语言内容的同时说出这种话语,最近的进展导致声音转换系统质量的大幅提高。然而,为了在更广泛的范围内发挥作用,声音转换系统需要:(一) 无需平行数据即可培训,(二) 在零发环境中工作,在培训期间源和目标演讲者都看不见的地方工作,以及(三) 实时运行或更快。最新技术满足了其中的一两个要求,但不是全部三个要求。本文扩展了基于星际对称转换网络(GANs)的最新声音转换模式,以满足了所有三种条件。我们特别扩展了最近的StarGAN-VC模式,将其设置在(可能看不见的发言者)嵌入的发言者上。这样就可以在零发环境中使用该模式,因此我们称之为StarGAN-ZS模型,我们用一个小的语音-Star-S-S-S-SVC系统比其它低资源转换技术,我们用一个小9分钟的质量对Q-A-AVC系统进行实时的升级的升级,我们用一个小的OVC 将显示一个小的零-C系统,我们用一个小的G-Star-一个小的升级的升级的升级的升级的升级的系统,在零点显示的升级的升级到一个小的系统,我们可以显示的升级到另一个的升级的零点显示的升级的升级到一个小的轨道,我们的轨道-直观的零点,我们做到一个小的升级的G-一个小的G-一个小的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的轨道,在零的升级的升级的升级的升级的升级的升级的升级的升级的升级的系统。