This paper describes an end-to-end adversarial singing voice conversion (EA-SVC) approach. It can directly generate arbitrary singing waveform by given phonetic posteriorgram (PPG) representing content, F0 representing pitch, and speaker embedding representing timbre, respectively. Proposed system is composed of three modules: generator $G$, the audio generation discriminator $D_{A}$, and the feature disentanglement discriminator $D_F$. The generator $G$ encodes the features in parallel and inversely transforms them into the target waveform. In order to make timbre conversion more stable and controllable, speaker embedding is further decomposed to the weighted sum of a group of trainable vectors representing different timbre clusters. Further, to realize more robust and accurate singing conversion, disentanglement discriminator $D_F$ is proposed to remove pitch and timbre related information that remains in the encoded PPG. Finally, a two-stage training is conducted to keep a stable and effective adversarial training process. Subjective evaluation results demonstrate the effectiveness of our proposed methods. Proposed system outperforms conventional cascade approach and the WaveNet based end-to-end approach in terms of both singing quality and singer similarity. Further objective analysis reveals that the model trained with the proposed two-stage training strategy can produce a smoother and sharper formant which leads to higher audio quality.
翻译:本文描述一个端到端对端对立的歌声转换(EA-SVC) 方法。 它可以通过代表内容、 F0 代表音, 和发言者嵌入代表音调的缩略图, 直接产生任意的歌声波。 提议的系统由三个模块组成: 发电机$G$, 音频生成歧视器$D ⁇ A} 和特征分解歧视 $D_F$。 发电机用$G$来编码这些功能, 平行和反向地将它们转换成目标波形。 为了使调音转换更加稳定和可控制, 发言人嵌入进一步与一组代表不同音调组的可训练矢量的加权总和脱节。 此外, 为了实现更有力和准确的歌唱转换, 拟用$D_F$, 来删除编码PPPGGT中仍然保存的声调和调相关信息。 最后, 进行了两阶段培训, 以保持一个稳定和有效的对调培训过程。 为了让音频转换更稳定且可控制, 演讲结果显示我们所拟的更高级的音频质量方法的有效性,, 将提出一个以更精练的系统 以更精练的演的系统 升级的方法, 以更精练的制成常规的演制成一种制制成一种制式的制式的演制式的演制成一种制式式式式式的演制式式式的演制式式式式式的演制式的制制制制制制式的系统 。