We present a multimodal framework to learn general audio representations from videos. Existing contrastive audio representation learning methods mainly focus on using the audio modality alone during training. In this work, we show that additional information contained in video can be utilized to greatly improve the learned features. First, we demonstrate that our contrastive framework does not require high resolution images to learn good audio features. This allows us to scale up the training batch size, while keeping the computational load incurred by the additional video modality to a reasonable level. Second, we use augmentations that mix together different samples. We show that this is effective to make the proxy task harder, which leads to substantial performance improvements when increasing the batch size. As a result, our audio model achieves a state-of-the-art of 42.4 mAP on the AudioSet classification downstream task, closing the gap between supervised and self-supervised methods trained on the same dataset. Moreover, we show that our method is advantageous on a broad range of non-semantic audio tasks, including speaker identification, keyword spotting, language identification, and music instrument classification.
翻译:我们提出一个多式框架,从视频中学习一般的音频表现。 现有的有对比的音频表现学习方法主要侧重于在培训期间单独使用音频模式。 在这项工作中,我们显示,视频中包含的额外信息可以用来大大改进所学的特征。 首先,我们表明,我们的对比性框架不需要高分辨率图像来学习良好的音频特征。 这使我们能够扩大培训批量规模,同时将额外视频模式产生的计算负荷保持在合理的水平上。 其次,我们使用将不同样本混杂在一起的扩增功能。 我们表明,这能够有效使代理任务变得更难,从而在增加批量规模时导致显著的性能改进。 因此,我们的音频模型在音频系统分类下游任务上实现了42.4兆帕的最新艺术,缩小了在同一数据集上培训的受监管和自我监督方法之间的差距。 此外,我们表明,我们的方法有利于广泛的非保密音频任务,包括语音识别、关键词识别、语言识别和音乐仪器分类。