Deep learning techniques for separating audio into different sound sources face several challenges. Standard architectures require training separate models for different types of audio sources. Although some universal separators employ a single model to target multiple sources, they have difficulty generalizing to unseen sources. In this paper, we propose a three-component pipeline to train a universal audio source separator from a large, but weakly-labeled dataset: AudioSet. First, we propose a transformer-based sound event detection system for processing weakly-labeled training data. Second, we devise a query-based audio separation model that leverages this data for model training. Third, we design a latent embedding processor to encode queries that specify audio targets for separation, allowing for zero-shot generalization. Our approach uses a single model for source separation of multiple sound types, and relies solely on weakly-labeled data for training. In addition, the proposed audio separator can be used in a zero-shot setting, learning to separate types of audio sources that were never seen in training. To evaluate the separation performance, we test our model on MUSDB18, while training on the disjoint AudioSet. We further verify the zero-shot performance by conducting another experiment on audio source types that are held-out from training. The model achieves comparable Source-to-Distortion Ratio (SDR) performance to current supervised models in both cases.
翻译:将音频分离为不同声音源的深层学习技术面临若干挑战。标准架构要求为不同类型的音频源制定不同的培训模式。虽然一些通用分隔器使用单一模式来针对多个来源,但它们很难向无形来源推广。在本文中,我们提议了一个三部分管道,用于从大型但标签不高的数据集中培训一个通用音源分离器:AudioSet。首先,我们提议了一个基于变压器的音频事件探测系统,用于处理标签不高的培训数据。第二,我们设计了一个基于查询的音频分离模型,利用这些数据进行模型培训。第三,我们设计了一个潜在的嵌入处理器,用于编码查询,指定音频分离目标,允许零弹射概括。我们的方法使用单一模式来将多种声音类型的来源分离,并完全依靠标签不高的数据进行培训。此外,拟议的音频分离器检测系统可以在零光谱环境中使用,学习在培训中从未看到的不同类型的音频源源。我们测试了在MUSB18上的分离模型,同时进行有关音频分离性能测试,同时进行不同版本的演示模式。我们用的是,我们从零互动SDRDREDRTB 来进行测试。