Conventional singing voice conversion (SVC) methods often suffer from operating in high-resolution audio owing to a high dimensionality of data. In this paper, we propose a hierarchical representation learning that enables the learning of disentangled representations with multiple resolutions independently. With the learned disentangled representations, the proposed method progressively performs SVC from low to high resolutions. Experimental results show that the proposed method outperforms baselines that operate with a single resolution in terms of mean opinion score (MOS), similarity score, and pitch accuracy.
翻译:常规歌声转换(SVC)方法往往由于数据的高度维度而以高分辨率音频运作。 在本文中,我们建议进行分级代表制学习,以便能够独立地学习与多个分辨率的表达方式不相干。在有学分的表达方式下,拟议方法逐渐将SVC从低分辨率提高到高分辨率。实验结果显示,拟议方法在平均意见评分(MOS)、相似性评分和音量精确度方面优于单一分辨率运作的基线。