This paper proposes a non-parallel many-to-many voice conversion (VC) method using a variant of the conditional variational autoencoder (VAE) called an auxiliary classifier VAE (ACVAE). The proposed method has three key features. First, it adopts fully convolutional architectures to construct the encoder and decoder networks so that the networks can learn conversion rules that capture time dependencies in the acoustic feature sequences of source and target speech. Second, it uses an information-theoretic regularization for the model training to ensure that the information in the attribute class label will not be lost in the conversion process. With regular CVAEs, the encoder and decoder are free to ignore the attribute class label input. This can be problematic since in such a situation, the attribute class label will have little effect on controlling the voice characteristics of input speech at test time. Such situations can be avoided by introducing an auxiliary classifier and training the encoder and decoder so that the attribute classes of the decoder outputs are correctly predicted by the classifier. Third, it avoids producing buzzy-sounding speech at test time by simply transplanting the spectral details of the input speech into its converted version. Subjective evaluation experiments revealed that this simple method worked reasonably well in a non-parallel many-to-many speaker identity conversion task.
翻译:本文建议一种非平行的多到多声音转换方法, 使用条件变异自动编码器( VAE) 的变体, 称为辅助分类器 VAE( ACVAE ) 。 提议的方法有三个关键特征。 首先, 它采用完全进化结构来构建编码器和解码器网络, 这样网络就可以学习在源和目标演讲的音频序列和目标演讲中捕捉时间依赖性的转换规则。 其次, 它使用模型培训的信息理论规范化, 以确保属性类标签中的信息不会在转换过程中丢失。 由于常规的 CVAE, 编码器和解码器可以自由忽略属性类标签输入。 这可能会有问题, 因为在这种情况下, 属性类标签不会对测试时控制输入语音的语音特性产生什么影响。 这样的情况可以通过引入辅助分类器来避免, 培训解码器和解码器的解码器输出的属性类别, 以便分类器能够正确预测在转换过程中不会丢失。 第三, 它避免了常规的 CVAVAE, 编码器和解码器可以自由忽略属性类标签标签标签输入的标签标签输入。 。 在这种解析变换成不易变的磁变换成正常的磁变变变变变式的磁变式的语音分析器中, 。 在测试中, 将许多式的语音变换成不制式的语音变换成不制式的变变变变式的语音变的变式的变式的变式的变式的变式的变制式的变式的变式的变制式的变换换制式式式式式式式式式式式式式式式式式。