We propose Wav2CLIP, a robust audio representation learning method by distilling from Contrastive Language-Image Pre-training (CLIP). We systematically evaluate Wav2CLIP on a variety of audio tasks including classification, retrieval, and generation, and show that Wav2CLIP can outperform several publicly available pre-trained audio representation algorithms. Wav2CLIP projects audio into a shared embedding space with images and text, which enables multimodal applications such as zero-shot classification, and cross-modal retrieval. Furthermore, Wav2CLIP needs just ~10% of the data to achieve competitive performance on downstream tasks compared with fully supervised models, and is more efficient to pre-train than competing methods as it does not require learning a visual model in concert with an auditory model. Finally, we demonstrate image generation from Wav2CLIP as qualitative assessment of the shared embedding space. Our code and model weights are open sourced and made available for further applications.
翻译:我们建议采用Wav2CLIP(Wav2CLIP),这是一种强大的音频代表学习方法,通过从不同语言图像培训前(CLIP)中提炼。 我们系统地评估Wav2CLIP(Wav2CLIP)的音频任务,包括分类、检索和生成,并表明Wav2CLIP(Wav2CLIP)的音频能力优于几种公开的预设音代表算法。 Wav2CLIP(Wav2CLIP)将音频投入一个包含图像和文本的共享嵌入空间,从而使得能够进行多式应用,例如零点分级和跨模式检索。此外,Wav2CLLIP(Wav2CLIP)的数据只需要~10%,才能在下游任务上取得与完全受监督的模式相比的竞争性性工作,而且比竞争性的预选方法更有效,因为它不需要与一个听力模型一起学习视觉模型。最后,我们将Wav2CLLIP(Wav2CIP)的图像生成作为共享嵌入空间的质量评估。我们的代码和模型重量是开放的来源,可供进一步应用。