Previous research has shown that established techniques for spoken voice conversion (VC) do not perform as well when applied to singing voice conversion (SVC). We propose an alternative loss component in a loss function that is otherwise well-established among VC tasks, which has been shown to improve our model's SVC performance. We first trained a singer identity embedding (SIE) network on mel-spectrograms of singer recordings to produce singer-specific variance encodings using contrastive learning. We subsequently trained a well-known autoencoder framework (AutoVC) conditioned on these SIEs, and measured differences in SVC performance when using different latent regressor loss components. We found that using this loss w.r.t. SIEs leads to better performance than w.r.t. bottleneck embeddings, where converted audio is more natural and specific towards target singers. The inclusion of this loss component has the advantage of explicitly forcing the network to reconstruct with timbral similarity, and also negates the effect of poor disentanglement in AutoVC's bottleneck embeddings. We demonstrate peculiar diversity between computational and human evaluations on singer-converted audio clips, which highlights the necessity of both. We also propose a pitch-matching mechanism between source and target singers to ensure these evaluations are not influenced by differences in pitch register.
翻译:先前的研究显示,声音变换( VC) 的既定技术在应用到歌声变换( SVC) 时效果不佳。 我们提议在VC任务中,在损失函数中采用其他损失函数中的替代损失部分,这个功能在VC任务中已经确立,这已证明可以改进我们模型的 SVC 性能。 我们最初在歌唱录音的Mel-spectrogrogram上培训了歌手身份嵌入网络(SIE), 以便使用对比性学习来制作歌唱特有的差异编码。 我们随后培训了一个以这些SIE为条件的著名自动自动变换码框架(Autover), 并用不同的潜伏递增损失元分数部分来衡量SVC的性能差异。 我们发现,使用这种损失 w.r.t.t. sleck 嵌入比 w.t.t.t. cleck 嵌入比 w.t. cloeck 更自然和具体针对歌唱家的变音频缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩图。我们还展示了这些变的变的变的变式缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩图图图图。</s>