Building cross-lingual voice conversion (VC) systems for multiple speakers and multiple languages has been a challenging task for a long time. This paper describes a parallel non-autoregressive network to achieve bilingual and code-switched voice conversion for multiple speakers when there are only mono-lingual corpora for each language. We achieve cross-lingual VC between Mandarin speech with multiple speakers and English speech with multiple speakers by applying bilingual bottleneck features. To boost voice cloning performance, we use an adversarial speaker classifier with a gradient reversal layer to reduce the source speaker's information from the output of encoder. Furthermore, in order to improve speaker similarity between reference speech and converted speech, we adopt an embedding consistency loss between the synthesized speech and its natural reference speech in our network. Experimental results show that our proposed method can achieve high quality converted speech with mean opinion score (MOS) around 4. The conversion system performs well in terms of speaker similarity for both in-set speaker conversion and out-set-of one-shot conversion.
翻译:长期以来,为多种语言和多种语言建立跨语言语音转换系统是一项具有挑战性的任务。本文件描述了一个平行的非侵略性网络,目的是在每种语言只有单一语言社团的情况下,实现多种语言的双语和代码转换。我们通过应用双语瓶颈功能,实现了多语言语种和多语言英语语言的跨语言语音转换系统。为了提高语音克隆的性能,我们使用一个带有梯度逆转层的对立语言分类器,以减少源语者从编码器输出中获取的信息。此外,为了改进参考语言和转换语言之间的相似性,我们采用了一种将合成语言与其自然引用语言嵌入网络的一致性损失。实验结果表明,我们拟议的方法可以实现高质量的转换语言,平均评分(MOS)大约4。