We present CrissCross, a self-supervised framework for learning audio-visual representations. A novel notion is introduced in our framework whereby in addition to learning the intra-modal and standard 'synchronous' cross-modal relations, CrissCross also learns 'asynchronous' cross-modal relationships. We show that by relaxing the temporal synchronicity between the audio and visual modalities, the network learns strong time-invariant representations. Our experiments show that strong augmentations for both audio and visual modalities with relaxation of cross-modal temporal synchronicity optimize performance. To pretrain our proposed framework, we use 3 different datasets with varying sizes, Kinetics-Sound, Kinetics-400, and AudioSet. The learned representations are evaluated on a number of downstream tasks namely action recognition, sound classification, and retrieval. CrissCross shows state-of-the-art performances on action recognition (UCF101 and HMDB51) and sound classification (ESC50). The codes and pretrained models will be made publicly available.
翻译:我们提出CrissCross,这是学习视听演示的自我监督框架,在我们的框架内引入了一个新概念,根据这个概念,CrissCross除了学习内部和标准的“同步”跨模式关系外,还学习“非同步”的跨模式关系。我们通过放松音频和视觉模式之间的时间同步性,显示网络学到了强大的时间差异。我们的实验显示,视听模式都有强大的增强,能够放松跨现代时同步性表现。为了准备我们提议的框架,我们将使用3个不同尺寸的不同数据集,即动因-声音、动因-400和音频-Set。对一些下游任务,即行动识别、声音分类和检索进行了评估。CrissCross展示了在行动识别(UCF101和HMDB51)和声音分类(ESC50)方面的最新艺术表现。代码和预先培训模型将公开提供。