Mainstream Audio Analytics models are trained to learn under the paradigm of one class label to many recordings focusing on one task. Learning under such restricted supervision limits the flexibility of models because they require labeled audio for training and can only predict the predefined categories. Instead, we propose to learn audio concepts from natural language supervision. We call our approach Contrastive Language-Audio Pretraining (CLAP), which learns to connect language and audio by using two encoders and a contrastive learning to bring audio and text descriptions into a joint multimodal space. We trained CLAP with 128k audio and text pairs and evaluated it on 16 downstream tasks across 8 domains, such as Sound Event Classification, Music tasks, and Speech-related tasks. Although CLAP was trained with significantly less pairs than similar computer vision models, it establishes SoTA for Zero-Shot performance. Additionally, we evaluated CLAP in a supervised learning setup and achieve SoTA in 5 tasks. Hence, CLAP's Zero-Shot capability removes the need of training with class labels, enables flexible class prediction at inference time, and generalizes to multiple downstream tasks.
翻译:主流音频分析模型通过一个类标签模式培训,学习许多侧重于一项任务的记录。在这种限制性监督下学习限制了模型的灵活性,因为它们需要标签的音频用于培训,只能预测预定的类别。相反,我们提议从自然语言监督中学习音频概念。我们称我们的方法“语言-语言交流前培训”,它学会通过使用两个编码器和对比学习将语言和音频连接到一个联合多式联运空间。因此,我们用128k音频和文本对CLAP进行了培训,并评估了8个领域16个下游任务,如“健康事件分类”、“音乐任务”和“语言演讲”相关任务。虽然CLAP的培训比类似的计算机视觉模型少得多,但它为零空间表现建立了SoTA。此外,我们用监督的学习设置和对比性学习来评价CLAP,在5项任务中实现了SoTA。因此,CLAP的“零热”能力消除了对课堂标签的培训需求,从而使得在推论时间进行灵活的课堂预测,并一般地向下游任务。