Music source separation can be interpreted as the estimation of the constituent music sources that a music clip is composed of. In this work, we explore the single-channel singing voice separation problem from a multimodal perspective, by jointly learning from audio and visual modalities. To do so, we present Acappella, a dataset spanning around 46 hours of a cappella solo singing videos sourced from YouTube. We propose Y-Net, an audio-visual convolutional neural network which achieves state-of-the-art singing voice separation results on the Acappella dataset and compare it against its audio-only counterpart, U-Net, and a state-of-the-art audio-visual speech separation model. Singing voice separation can be particularly challenging when the audio mixture also comprises of other accompaniment voices and background sounds along with the target voice of interest. We demonstrate that our model can outperform the baseline models in the singing voice separation task in such challenging scenarios. The code, the pre-trained models and the dataset will be publicly available at https://ipcv.github.io/Acappella/
翻译:音乐源的分离可以解释为对音乐剪辑所构成的音乐源的构成源的估计。 在这项工作中,我们从多式联运的角度,共同从视听模式中学习,探索单声道声音分离问题。 为此,我们展示了Acappella,这是一个覆盖46小时的由YouTube制作的卡贝贝拉单独歌唱视频组成的数据集。我们提议Y-Net,这是一个视听神经网络,在Acappella数据集中实现最先进的歌声分离结果,并与它与其只听音对应方U-Net和最先进的视听语音分离模型进行比较。当音频混合还包括其他相容声音和背景声音以及目标声音时,声音分离尤其具有挑战性。我们证明我们的模型在这种富有挑战的情景下,超越了歌声分离任务中的基线模型。代码、预培训模型和数据集将在https://ipcv.github.io/Acapella/数据集上公开提供。