We propose a novel architecture and improved training objectives for non-parallel voice conversion. Our proposed CycleGAN-based model performs a shape-preserving transformation directly on a high frequency-resolution magnitude spectrogram, converting its style (i.e. speaker identity) while preserving the speech content. Throughout the entire conversion process, the model does not resort to compressed intermediate representations of any sort (e.g. mel spectrogram, low resolution spectrogram, decomposed network feature). We propose an efficient axial residual block architecture to support this expensive procedure and various modifications to the CycleGAN losses to stabilize the training process. We demonstrate via experiments that our proposed model outperforms Scyclone and shows a comparable or better performance to that of CycleGAN-VC2 even without employing a neural vocoder.
翻译:我们为非平行语音转换建议了一个新的结构和更好的培训目标。 我们提出的以循环GAN为基础的模型直接在高频分辨率分光度光谱仪上进行形状保护转换,转换其风格(即发言者身份),同时保留语音内容。在整个转换过程中,模型不使用任何类型的压缩中间显示器(如光谱仪、低分辨率光谱仪、分解网络功能 ) 。 我们提议建立一个高效的轴余块结构,以支持这一昂贵的程序和对循环GAN损失的各种修改,以稳定培训过程。 我们通过实验证明,我们提议的模型优于Scycon, 显示其性能与循环GAN-VC2相似或更好, 即使没有使用神经电动器。