Contrastive learning has shown remarkable success in the field of multimodal representation learning. In this paper, we propose a pipeline of contrastive language-audio pretraining to develop an audio representation by combining audio data with natural language descriptions. To accomplish this target, we first release LAION-Audio-630K, a large collection of 633,526 audio-text pairs from different data sources. Second, we construct a contrastive language-audio pretraining model by considering different audio encoders and text encoders. We incorporate the feature fusion mechanism and keyword-to-caption augmentation into the model design to further enable the model to process audio inputs of variable lengths and enhance the performance. Third, we perform comprehensive experiments to evaluate our model across three tasks: text-to-audio retrieval, zero-shot audio classification, and supervised audio classification. The results demonstrate that our model achieves superior performance in text-to-audio retrieval task. In audio classification tasks, the model achieves state-of-the-art performance in the zero-shot setting and is able to obtain performance comparable to models' results in the non-zero-shot setting. LAION-Audio-630K and the proposed model are both available to the public.
翻译:对比学习已经在多模态表示学习领域中表现出显著的成功。在本文中,我们提出了一个对比语音-语言预训练的流程,通过将音频数据与自然语言描述相结合来开发音频表示。为了实现这个目标,我们首先发布了来自不同数据源的633,526个音频-文本对的大型数据集LAION-Audio-630K。其次,我们构建了一个对比语音-语言预训练模型,考虑了不同的音频编码器和文本编码器。我们将特征融合机制和关键字到字幕增强纳入模型设计,进一步让模型能够处理可变长度的音频输入并提高性能。第三,我们进行了全面的实验,评估了我们的模型在三个任务中的表现:文本到音频检索、零-shot音频分类和监督音频分类。结果表明,在文本到音频检索任务中,我们的模型取得了卓越的表现。在音频分类任务中,该模型在零-shot设置下实现了最先进的性能,并能够获得与非零-shot设置下的模型结果相当的性能。LAION-Audio-630K数据集和提议的模型都对公众开放。