Contrastive learning of auditory and visual perception has been extremely successful when investigated individually. However, there are still major questions on how we could integrate principles learned from both domains to attain effective audiovisual representations. In this paper, we present a contrastive framework to learn audiovisual representations from unlabeled videos. The type and strength of augmentations utilized during self-supervised pre-training play a crucial role for contrastive frameworks to work sufficiently. Hence, we extensively investigate composition of temporal augmentations suitable for learning audiovisual representations; we find lossy spatio-temporal transformations that do not corrupt the temporal coherency of videos are the most effective. Furthermore, we show that the effectiveness of these transformations scales with higher temporal resolution and stronger transformation intensity. Compared to self-supervised models pre-trained on only sampling-based temporal augmentation, self-supervised models pre-trained with our temporal augmentations lead to approximately 6.5% gain on linear classifier performance on AVE dataset. Lastly, we show that despite their simplicity, our proposed transformations work well across self-supervised learning frameworks (SimSiam, MoCoV3, etc), and benchmark audiovisual dataset (AVE).
翻译:在单独调查时,对听觉和视觉认知的对比性学习非常成功。然而,对于我们如何整合从这两个领域学到的原则以获得有效的视听演示,仍然存在重大问题。在本文中,我们提出了一个从未贴标签的视频中学习视听表现的对比性框架。在自我监督培训前使用的增强功能的类型和强度对于对比性框架充分发挥作用具有关键作用。因此,我们广泛调查适合学习视听演示的时间增强功能的构成;我们发现,不腐蚀视频时间一致性的失序瞬时变换最为有效。此外,我们展示了这些转换尺度在时间分辨率更高和更强的变异强度下的有效性。与自我监督模型相比,我们仅对基于抽样的临时增强作用进行了预先培训,自我监督模型随着我们的时间增强而预先培训,导致在AVE数据集的线性分类性表现方面获得约6.5%的收益。最后,我们表明,尽管这些变异是简单易懂的,我们提议的变异样在自我监督的学习框架(SimSiam,MOV3,等等)和基准数据设置(AVAVA)之间仍然很有效。