In the past, the rapidly evolving field of sound classification greatly benefited from the application of methods from other domains. Today, we observe the trend to fuse domain-specific tasks and approaches together, which provides the community with new outstanding models. In this work, we present an extension of the CLIP model that handles audio in addition to text and images. Our proposed model incorporates the ESResNeXt audio-model into the CLIP framework using the AudioSet dataset. Such a combination enables the proposed model to perform bimodal and unimodal classification and querying, while keeping CLIP's ability to generalize to unseen datasets in a zero-shot inference fashion. AudioCLIP achieves new state-of-the-art results in the Environmental Sound Classification (ESC) task, out-performing other approaches by reaching accuracies of 90.07% on the UrbanSound8K and 97.15% on the ESC-50 datasets. Further it sets new baselines in the zero-shot ESC-task on the same datasets 68.78% and 69.40%, respectively). Finally, we also assess the cross-modal querying performance of the proposed model as well as the influence of full and partial training on the results. For the sake of reproducibility, our code is published.
翻译:过去,迅速演变的健全分类领域极大地得益于从其他领域应用的方法。今天,我们观察到将特定领域的任务和方法结合在一起的趋势,为社区提供了新的杰出模型。在这项工作中,我们展示了CLIP模型的扩展,该模型除了文本和图像外,还处理音频。我们提议的模型将ESResNeXt音频模型纳入使用音频Set数据集的CLIP框架。这种组合使拟议的模型能够进行双向和单式分类和查询,同时保持CLIP以零速推断的方式将特定领域的任务和方法推广到看不见的数据集的能力。音频CLIP在无害环境分类(ESC)任务中取得了新的最新成果,通过在城市Sound8K上达到90.07%,在ESC-50数据集上达到97.15%的CLIP。这种组合使得拟议的模型能够进行双向和单式分类和查询,同时保持CLIP以零速推算法将68.78%和69.40%的速度普及到看不见的数据集。最后,我们还评估了拟议的跨式代码的全面影响。