In the image domain, excellent representations can be learned by inducing invariance to content-preserving transformations via noise contrastive learning. In this paper, we generalize contrastive learning to a wider set of transformations, and their compositions, for which either invariance or distinctiveness is sought. We show that it is not immediately obvious how existing methods such as SimCLR can be extended to do so. Instead, we introduce a number of formal requirements that all contrastive formulations must satisfy, and propose a practical construction which satisfies these requirements. In order to maximise the reach of this analysis, we express all components of noise contrastive formulations as the choice of certain generalized transformations of the data (GDTs), including data sampling. We then consider videos as an example of data in which a large variety of transformations are applicable, accounting for the extra modalities -- for which we analyze audio and text -- and the dimension of time. We find that being invariant to certain transformations and distinctive to others is critical to learning effective video representations, improving the state-of-the-art for multiple benchmarks by a large margin, and even surpassing supervised pretraining.
翻译:在图像领域,通过噪声对比学习,通过诱使内容保留变异,可以学习极好的表述方法。在本文中,我们将对比性学习概括为更广泛的变异及其组成,寻求的是这些变异或独特性。我们表明,目前尚不十分明显,如何扩大SimCLR等现有方法,以推广这些变异。相反,我们引入了一些所有变异的配方都必须满足的正式要求,并提出了符合这些要求的实用构件。为了最大限度地扩大这一分析的范围,我们把噪声变异配方的所有组成部分表述为选择数据(GDTs)的某些普遍变异(GDTs),包括数据抽样。我们随后将视频视为数据的一个范例,在其中可以应用大量变异,考虑到额外模式 -- -- 我们分析的是这些变异和文字 -- -- 以及时间的维度。我们发现,某些变异性与某些变异的配方不同,对于学习有效的视频表达方式、大范围改进多基准的状态甚至超越监督前阶段至关重要。