In this paper, we pose the current state-of-the-art voice conversion (VC) systems as two-encoder-one-decoder models. After comparing these models, we combine the best features and propose Assem-VC, a new state-of-the-art any-to-many non-parallel VC system. This paper also introduces the GTA finetuning in VC, which significantly improves the quality and the speaker similarity of the outputs. Assem-VC outperforms the previous state-of-the-art approaches in both the naturalness and the speaker similarity on the VCTK dataset. As an objective result, the degree of speaker disentanglement of features such as phonetic posteriorgrams (PPG) is also explored. Our investigation indicates that many-to-many VC results are no longer distinct from human speech and similar quality can be achieved with any-to-many models. Audio samples are available at https://mindslab-ai.github.io/assem-vc/
翻译:在本文中,我们将目前最先进的语音转换系统(VC)作为二进coder-one-decoder模型。在比较这些模型之后,我们结合了最佳功能,并提出了Assem-VC,这是一个新的最先进的任何至许多非平行VC系统。本文还介绍了VC的GTA微调,该微调大大提高了产出的质量和音量相似性。Assem-VC在自然特性和VCTK数据集的发言者相似性方面都超越了以前最先进的方法。作为一个客观的结果,还探讨了扬声器分解功能的程度,例如语音后视镜(PPG)。我们的调查显示,许多到许多VC的结果不再不同于人类的言语,任何到许多模型都可达到类似的质量。音样样样见https://mindslab-ai.github.io/assem-vc/ https://mindslab-ai.github.io/assem-vc/。