Any-to-any voice conversion aims to convert the voice from and to any speakers even unseen during training, which is much more challenging compared to one-to-one or many-to-many tasks, but much more attractive in real-world scenarios. In this paper we proposed FragmentVC, in which the latent phonetic structure of the utterance from the source speaker is obtained from Wav2Vec 2.0, while the spectral features of the utterance(s) from the target speaker are obtained from log mel-spectrograms. By aligning the hidden structures of the two different feature spaces with a two-stage training process, FragmentVC is able to extract fine-grained voice fragments from the target speaker utterance(s) and fuse them into the desired utterance, all based on the attention mechanism of Transformer as verified with analysis on attention maps, and is accomplished end-to-end. This approach is trained with reconstruction loss only without any disentanglement considerations between content and speaker information and doesn't require parallel data. Objective evaluation based on speaker verification and subjective evaluation with MOS both showed that this approach outperformed SOTA approaches, such as AdaIN-VC and AutoVC.
翻译:任何声音转换到任何声音,都是为了在培训期间将声音转换成任何声音,甚至任何声音都看不见,这比一对一或多个任务或多个任务更具挑战性,但在现实世界情景中更具吸引力。在本文中,我们提议了分形VC,其中源演讲者发声的潜在语音结构取自Wav2Vec 2.0,而目标演讲者发声的光谱特征取自日志Mel-spectrogram。通过将两个不同功能空间的隐藏结构与两阶段培训进程相匹配,碎片VC能够从目标演讲者发声中提取精细微的语音碎片并将其结合到预期的发音中,所有这些都以变异器的注意机制为基础,通过对关注地图的分析加以核实,完成端对端。这一方法仅经过重建损失培训,而没有考虑到内容和发言者信息之间的任何分解因素,不需要平行的数据。基于发言者核实和主观评价的客观评价,如MOS-C和AutoVTA-S-TA的做法显示这种方法。