The advent of deep learning has led to the prevalence of deep neural network architectures for monaural music source separation, with end-to-end approaches that operate directly on the waveform level increasingly receiving research attention. Among these approaches, transformation of the input mixture to a learned latent space, and multiplicative application of a soft mask to the latent mixture, achieves the best performance, but is prone to the introduction of artifacts to the source estimate. To alleviate this problem, in this paper we propose a hybrid time-domain approach, termed the HTMD-Net, combining a lightweight masking component and a denoising module, based on skip connections, in order to refine the source estimated by the masking procedure. Evaluation of our approach in the task of monaural singing voice separation in the musdb18 dataset indicates that our proposed method achieves competitive performance compared to methods based purely on masking when trained under the same conditions, especially regarding the behavior during silent segments, while achieving higher computational efficiency.
翻译:深层学习的到来已导致用于寺庙音乐源分离的深层神经网络结构的盛行,在波形水平上直接运作的端到端方法日益受到研究的注意。在这些方法中,将投入混合转化为学习的潜在空间,将软面罩的倍增应用到潜在混合物中,取得了最佳的性能,但容易将人工制品引入源估计。为了缓解这一问题,我们在本文件中提议了一种混合时间-域方法,称为HTMD-Net,将轻量级遮罩组件和基于跳过连接的去音模块结合起来,以完善遮罩程序估计的来源。对我们在Musdb18数据集中单调歌声分离任务中所采用的方法的评估表明,我们拟议方法取得了竞争性的性能,与在相同条件下培训时纯粹基于遮罩的方法相比,特别是在静段的行为方面,同时实现了更高的计算效率。