State of the art audio source separation models rely on supervised data-driven approaches, which can be expensive in terms of labeling resources. On the other hand, approaches for training these models without any direct supervision are typically high-demanding in terms of memory and time requirements, and remain impractical to be used at inference time. We aim to tackle these limitations by proposing a simple yet effective unsupervised separation algorithm, which operates directly on a latent representation of time-domain signals. Our algorithm relies on deep Bayesian priors in the form of pre-trained autoregressive networks to model the probability distributions of each source. We leverage the low cardinality of the discrete latent space, trained with a novel loss term imposing a precise arithmetic structure on it, to perform exact Bayesian inference without relying on an approximation strategy. We validate our approach on the Slakh dataset arXiv:1909.08494, demonstrating results in line with state of the art supervised approaches while requiring fewer resources with respect to other unsupervised methods.
翻译:艺术音频源分离模型的状态取决于受监督的数据驱动方法,这种方法在标签资源方面费用昂贵。另一方面,在没有任何直接监督的情况下培训这些模型的方法在记忆和时间要求方面通常要求很高,在推论时间使用时仍然不切实际。我们的目标是通过提出简单而有效的、不受监督的分离算法来解决这些限制,该算法直接以时间-时间-空间信号的潜在表示方式运作。我们的算法依赖深巴伊西亚的先行,其形式是预先训练的自动递增网络,以模拟每个来源的概率分布。我们利用离散潜藏空间的低基点,经过培训,以新的损失术语对它进行精确的算术结构,在不依赖近似战略的情况下进行精确的拜伊斯推断。我们验证了我们在Slakh 数据集 arXiv:1909.08494上采用的方法,表明其结果与受监督的艺术方法的状态相符,同时对其他未受监督的方法则需要较少的资源。