自动分离声音场景自动自操作学习 (Self-Supervised Learning from Automatically Separated Sound Scenes)

Eduardo Fonseca,Aren Jansen,Daniel P. W. Ellis,Scott Wisdom,Marco Tagliasacchi,John R. Hershey,Manoj Plakal,Shawn Hershey,R. Channing Moore,Xavier Serra

Real-world sound scenes consist of time-varying collections of sound sources, each generating characteristic sound events that are mixed together in audio recordings. The association of these constituent sound events with their mixture and each other is semantically constrained: the sound scene contains the union of source classes and not all classes naturally co-occur. With this motivation, this paper explores the use of unsupervised automatic sound separation to decompose unlabeled sound scenes into multiple semantically-linked views for use in self-supervised contrastive learning. We find that learning to associate input mixtures with their automatically separated outputs yields stronger representations than past approaches that use the mixtures alone. Further, we discover that optimal source separation is not required for successful contrastive learning by demonstrating that a range of separation system convergence states all lead to useful and often complementary example transformations. Our best system incorporates these unsupervised separation models into a single augmentation front-end and jointly optimizes similarity maximization and coincidence prediction objectives across the views. The result is an unsupervised audio representation that rivals state-of-the-art alternatives on the established shallow AudioSet classification benchmark.

翻译：真实世界的声音场景包括时间变化式的声源集合,每个声音场景都产生独特的声音事件,在音频录音中混杂在一起。这些组成声音事件与混合物和彼此之间的结合受到语义限制:声音场景包含源类的结合,而不是所有类的自然共生。有了这个动机,本文件探索使用未经监督的自动声音分离,将无标签声音场景分解成多种语义联系的观点,供自我监督的对比学习使用。我们发现,将输入混合物与其自动分离的输出结果联系起来的学习比仅仅使用该混合物的以往方法更能产生更强烈的表达力。此外,我们发现,通过显示一系列分离系统汇合,所有都会导致有用的、往往是互补的实例转变,因此,不需要最佳的源分离。我们的最佳系统将这些未经监督的分离模型整合成单一增强的前端,共同优化各种观点的类似最大化和巧合预测目标。结果是一种未经监督的音频显示,在既定的浅音频卫星分类基准上与最先进的替代方法相对。