Music source separation in the time-frequency domain is commonly achieved by applying a soft or binary mask to the magnitude component of (complex) spectrograms. The phase component is usually not estimated, but instead copied from the mixture and applied to the magnitudes of the estimated isolated sources. While this method has several practical advantages, it imposes an upper bound on the performance of the system, where the estimated isolated sources inherently exhibit audible "phase artifacts". In this paper we address these shortcomings by directly estimating masks in the complex domain, extending recent work from the speech enhancement literature. The method is particularly well suited for multi-instrument musical source separation since residual phase artifacts are more pronounced for spectrally overlapping instrument sources, a common scenario in music. We show that complex masks result in better separation than masks that operate solely on the magnitude component.
翻译:时间频域的音乐源分离通常是通过对(复合)光谱图的大小部分应用软面罩或二元面罩来实现的。 相片部件通常不作估计,而是从混合物中复制,并应用于估计的孤立源的大小。 虽然这种方法有若干实际优点,但它对系统的性能施加了上层约束, 估计的孤立源本身就显示有听觉的“ 相片文物 ” 。 在本文中,我们通过直接估计复杂域的面罩来处理这些缺点, 将最近的工作从增强语音的文献中延伸出来。 这种方法特别适合多仪器音乐源分离, 因为残余相片文物对于光谱重叠的仪器源来说更为明显, 音乐中的一种常见情景是: 复杂的面罩比仅仅在尺寸部件上运作的面具更能导致更好的分离。