We present CrissCross, a self-supervised framework for learning audio-visual representations. A novel notion is introduced in our framework whereby in addition to learning the intra-modal and standard 'synchronous' cross-modal relations, CrissCross also learns 'asynchronous' cross-modal relationships. We perform in-depth studies showing that by relaxing the temporal synchronicity between the audio and visual modalities, the network learns strong generalized representations useful for a variety of downstream tasks. To pretrain our proposed solution, we use 3 different datasets with varying sizes, Kinetics-Sound, Kinetics400, and AudioSet. The learned representations are evaluated on a number of downstream tasks namely action recognition, sound classification, and action retrieval. Our experiments show that CrissCross either outperforms or achieves performances on par with the current state-of-the-art self-supervised methods on action recognition and action retrieval with UCF101 and HMDB51, as well as sound classification with ESC50 and DCASE. Moreover, CrissCross outperforms fully-supervised pretraining while pretrained on Kinetics-Sound. The codes and pretrained models are available on the project website.
翻译:我们提出CrissCross,这是学习视听演示的自我监督框架。在我们的框架内引入了一个新概念,除了学习内部和标准的“同步”跨模式关系外,CrissCross还学习“非同步”的跨模式关系。我们进行深入的研究显示,通过放松音频和视觉模式之间的时间同步性,网络学到了对一系列下游任务有用的强有力的普遍化表述。为了预先研究我们提议的解决方案,我们使用了3个不同大小的不同数据集,即“动因-声音”、“动因-400”和“音频-网络”等。所学的表述还评估了一系列下游任务,即行动识别、声音分类和行动检索。我们进行的实验显示,Crissross要么超越了或取得了与当前最先进的自我监督的行动识别和行动检索方法相当的成绩。与UCFC101和HMDB51相比,还有与ESC50和DCASE之前的正确分类。此外,CrisCs-C-C-strain sregreduction s redustrain s redual ex.