Can we leverage the audiovisual information already present in video to improve self-supervised representation learning? To answer this question, we study various pretraining architectures and objectives within the masked autoencoding framework, motivated by the success of similar methods in natural language and image understanding. We show that we can achieve significant improvements on audiovisual downstream classification tasks, surpassing the state-of-the-art on VGGSound and AudioSet. Furthermore, we can leverage our audiovisual pretraining scheme for multiple unimodal downstream tasks using a single audiovisual pretrained model. We additionally demonstrate the transferability of our representations, achieving state-of-the-art audiovisual results on Epic Kitchens without pretraining specifically for this dataset.
翻译:我们能否利用视频中已有的视听信息来改进自我监督的代表性学习?为了回答这个问题,我们研究了在隐蔽自动编码框架内的各种培训前架构和目标,其动机是自然语言和图像理解方面的类似方法的成功。我们表明,我们可以大大改进视听下游分类任务,超过VGGSound和音频Set的最新产品。此外,我们可以利用单一的视听预培训模式,利用视听预培训方案,完成多种单一方式的下游任务。我们进一步展示了我们的代表性的可转移性,实现了关于Epic Kitchens的最新视听成果,而无需专门为这一数据集进行预先培训。