Singing voice separation (SVS) is a task that separates singing voice audio from its mixture with instrumental audio. Previous SVS studies have mainly employed the spectrogram masking method which requires a large dimensionality in predicting the binary masks. In addition, they focused on extracting a vocal stem that retains the wet sound with the reverberation effect. This result may hinder the reusability of the isolated singing voice. This paper addresses the issues by predicting mel-spectrogram of dry singing voices from the mixed audio as neural vocoder features and synthesizing the singing voice waveforms from the neural vocoder. We experimented with two separation methods. One is predicting binary masks in the mel-spectrogram domain and the other is directly predicting the mel-spectrogram. Furthermore, we add a singing voice detector to identify the singing voice segments over time more explicitly. We measured the model performance in terms of audio, dereverberation, separation, and overall quality. The results show that our proposed model outperforms state-of-the-art singing voice separation models in both objective and subjective evaluation except the audio quality.
翻译:歌声分离( SVS) 是一项任务, 将歌声声音和音频混合的音频分开。 SVS 先前的研究主要采用光谱遮罩方法, 需要大维来预测二元面罩。 此外, 重点是提取一个声音干, 保留湿声, 并产生回响效应。 其结果可能会妨碍孤立的歌声的可恢复性。 本文通过预测音频混合音频中的干音频光谱, 包括电动电解器功能, 以及将歌声波形与神经电动器合成来解决问题。 我们实验了两种分离方法。 一种是预测Mel- spectrogram域的双向遮罩, 另一种是直接预测Mel- spectrogrogram。 此外, 我们添加了一个歌声探测器, 以更清晰的时间来识别歌声声音段。 我们用音频、 发源、 分离和总体质量来测量模型性表现。 结果显示, 我们提议的模型在音频质量上都超越了磁性分立模型。