In this paper we propose modifications to the neural network framework, AutoVC for the task of singing technique conversion. This includes utilising a pretrained singing technique encoder which extracts technique information, upon which a decoder is conditioned during training. By swapping out a source singer's technique information for that of the target's during conversion, the input spectrogram is reconstructed with the target's technique. We document the beneficial effects of omitting the latent loss, the importance of sequential training, and our process for fine-tuning the bottleneck. We also conducted a listening study where participants rate the specificity of technique-converted voices as well as their naturalness. From this we are able to conclude how effective the technique conversions are and how different conditions affect them, while assessing the model's ability to reconstruct its input data.
翻译:在本文中,我们建议修改神经网络框架,AutoVC用于歌唱技术转换的任务,这包括使用事先训练的歌唱技术编码器,该编码器可以提取技术信息,在训练期间,解码器将以此为条件。通过将源歌手的技术信息转换成目标转换过程中的技术信息,输入光谱图将用目标技术重建。我们记录了省略潜值损失的有益影响、连续训练的重要性以及我们微调瓶颈的过程。我们还进行了一项倾听研究,参与者在这项研究中评估了技术转变声音的特殊性及其自然性质。我们由此可以得出技术转换的效果以及不同条件如何影响它们,同时评估模型重建输入数据的能力。