CLIPSep: 学习用无标签的无噪音视频和文字加密声音分离 (CLIPSep: Learning Text-queried Sound Separation with Noisy Unlabeled Videos)

Recent years have seen progress beyond domain-specific sound separation for speech or music towards universal sound separation for arbitrary sounds. Prior work on universal sound separation has investigated separating a target sound out of an audio mixture given a text query. Such text-queried sound separation systems provide a natural and scalable interface for specifying arbitrary target sounds. However, supervised text-queried sound separation systems require costly labeled audio-text pairs for training. Moreover, the audio provided in existing datasets is often recorded in a controlled environment, causing a considerable generalization gap to noisy audio in the wild. In this work, we aim to approach text-queried universal sound separation by using only unlabeled data. We propose to leverage the visual modality as a bridge to learn the desired audio-textual correspondence. The proposed CLIPSep model first encodes the input query into a query vector using the contrastive language-image pretraining (CLIP) model, and the query vector is then used to condition an audio separation model to separate out the target sound. While the model is trained on image-audio pairs extracted from unlabeled videos, at test time we can instead query the model with text inputs in a zero-shot setting, thanks to the joint language-image embedding learned by the CLIP model. Further, videos in the wild often contain off-screen sounds and background noise that may hinder the model from learning the desired audio-textual correspondence. To address this problem, we further propose an approach called noise invariant training for training a query-based sound separation model on noisy data. Experimental results show that the proposed models successfully learn text-queried universal sound separation using only noisy unlabeled videos, even achieving competitive performance against a supervised model in some settings.

翻译：近些年来,在语音或音乐领域特定的声音隔离方面,已经取得了一些进展,超越了特定领域的声音声音或音乐的音频隔离,走向对任意声音的普遍声音隔离; 以往关于普遍声音隔离的工作已经调查了将目标声音从音频混合中分离出来的工作, 给文本查询提供了一个文本查询。这种文本化声音分离系统为指定任意目标声音提供了自然和可扩缩的界面。然而, 监管文本化声音分离系统需要花费昂贵的标签标签的音频配对来进行培训。此外, 现有数据集中提供的音频往往记录在一个有控制的环境下, 导致音频分离方法与野外音频隔绝。在这项工作中, 我们的目标是通过只使用未加标签的音频校正的音频校对音频隔音频分离方法, 我们提议利用视觉模式作为桥梁模式学习所需的音频文本通信通信通信通信通信通信。我们经常在测试时, 将音频变动视频输入到正在学习的音频培训模式, 也可以在升级的C- 学习一种语言学习。