We present CrissCross, a self-supervised framework for learning audio-visual representations. A novel notion is introduced in our framework whereby in addition to learning the intra-modal and standard synchronous cross-modal relations, CrissCross also learns asynchronous cross-modal relationships. We show that by relaxing the temporal synchronicity between the audio and visual modalities, the network learns strong generalized representations. Our experiments show that strong augmentations for both audio and visual modalities with relaxation of cross-modal temporal synchronicity optimize performance. To pretrain our proposed framework, we use 3 different datasets with varying sizes, Kinetics-Sound, Kinetics400, and AudioSet. The learned representations are evaluated on a number of downstream tasks namely action recognition, sound classification, and retrieval. CrissCross shows state-of-the-art performances on action recognition (UCF101 and HMDB51) and sound classification (ESC50 and DCASE). The codes and pretrained models will be made publicly available.
翻译:我们提出CrissCross,这是学习视听演示的自我监督框架,在我们的框架里引入了一个新概念,根据这个概念,CrissCross除了学习现代和标准同步的跨现代关系外,还学习非同步的跨现代关系。我们通过放松音频和视觉模式之间的时间同步,显示网络学到了强大的普遍表现。我们的实验显示,视听模式都得到了强大的增强,同时放松了跨现代时同步性的最佳性能。为了预先规划我们提议的框架,我们使用了3个不同尺寸的不同数据集,即动因-声音、动因-400和音频-Set。所学的表述是对一系列下游任务的评估,即行动识别、声音分类和检索。CrissCross展示了行动识别(UCF101和HMDB51)和声音分类(ESC50和DCASE)方面的最先进的表现。代码和预先培训模型将公开提供。