The task of isolating a target singing voice in music videos has useful applications. In this work, we explore the single-channel singing voice separation problem from a multimodal perspective, by jointly learning from audio and visual modalities. To do so, we present Acappella, a dataset spanning around 46 hours of a cappella solo singing videos sourced from YouTube. We also propose an audio-visual convolutional network based on graphs which achieves state-of-the-art singing voice separation results on our dataset and compare it against its audio-only counterpart, U-Net, and a state-of-the-art audio-visual speech separation model. We evaluate the models in the following challenging setups: i) presence of overlapping voices in the audio mixtures, ii) the target voice set to lower volume levels in the mix, and iii) combination of i) and ii). The third one being the most challenging evaluation setup. We demonstrate that our model outperforms the baseline models in the singing voice separation task in the most challenging evaluation setup. The code, the pre-trained models, and the dataset are publicly available at https://ipcv.github.io/Acappella/able at https://ipcv.github.io/Acappella/
翻译:在音乐视频中孤立一个目标歌声的任务具有有益的应用。 在这项工作中,我们从多式联运的角度,共同学习视听模式,探索单一频道唱出的声音分离问题。为此,我们介绍Acappella,这是一个数据库,覆盖了来自YouTube的Cappeella单独歌唱视频的46小时左右。我们还提议了一个视听演动网络,其依据的图表可以实现我们数据集上最先进的歌声分离结果,并将其与最富挑战性的评价设置中的音频对口单位U-Net和最先进的视听演讲分离模型进行比较。我们评估了以下具有挑战性的设置中的模型:(一) 音频混合物中存在重叠的声音,二) 组合中用于降低音量的目标声音,三) 组合(一) 和(二) 第三个是最具挑战性的评价设置。我们展示了我们的模型在最富挑战性的音频分离任务中超越了基线模型。代码、经过预先培训的模型和数据集,可在https://ipviv.Amblia/ablios/https/ablio/applio/https.