Many existing works on singing voice conversion (SVC) require clean recordings of target singer's voice for training. However, it is often difficult to collect them in advance and singing voices are often distorted with reverb and accompaniment music. In this work, we propose robust one-shot SVC (ROSVC) that performs any-to-any SVC robustly even on such distorted singing voices using less than 10s of a reference voice. To this end, we propose two-stage training method called Robustify. In the first stage, a novel one-shot SVC model based on a generative adversarial network is trained on clean data to ensure high-quality conversion. In the second stage, enhancement modules are introduced to the encoders of the model to improve the robustness against distortions in the feature space. Experimental results show that the proposed method outperforms one-shot SVC baselines for both seen and unseen singers and greatly improves the robustness against the distortions.
翻译:许多关于歌声转换的现有作品(SVC)都需要为培训目的对目标歌手的声音进行清洁的录音,然而,通常很难事先收集这些录音,而且歌声经常被反动和伴奏音乐扭曲。在这项工作中,我们提议采用强力的一发SVC(ROSVC)(ROSVC)(ROSVC)(ROVC)(SVC)),在这种扭曲的歌声上,甚至使用不到10个参考声音进行强力的声调。为此,我们提议了名为Robustify的两阶段培训方法。在第一阶段,以基因化对称的对称网络为基础的新型一发SVC模型(一发SVC) 接受清洁数据培训,以确保高质量的转换。在第二阶段,我们向模型的编译者引入强化模块,以提高对地貌空间扭曲的稳健性。实验结果表明,拟议方法优于视觉和看不见歌手的一发光的SVC基线,大大改进了对抗扭曲的强度。