Multi-Modal Self-Supervised Learning from videos has been shown to improve model's performance on various downstream tasks. However, such Self-Supervised pre-training requires large batch sizes and a large amount of computation resources due to the noise present in the uncurated data. This is partly due to the fact that the prevalent training scheme is trained on coarse-grained setting, in which vectors representing the whole video clips or natural language sentences are used for computing similarity. Such scheme makes training noisy as part of the video clips can be totally not correlated with the other-modality input such as text description. In this paper, we propose a fine-grained multi-modal self-supervised training scheme that computes the similarity between embeddings at finer-scale (such as individual feature map embeddings and embeddings of phrases), and uses attention mechanisms to reduce noisy pairs' weighting in the loss function. We show that with the proposed pre-training scheme, we can train smaller models, with smaller batch-size and much less computational resources to achieve downstream tasks performances comparable to State-Of-The-Art, for tasks including action recognition and text-image retrievals.
翻译:从视频中看到多式自动操作学习,这显示是为了改进模型在各种下游任务方面的表现。然而,这种自我操作前培训需要大量批量规模和大量计算资源,因为未加工数据中存在噪音。部分原因是,普遍存在的培训计划是按粗皮设置培训的,代表整个视频片段或自然语言句的矢量用于计算相似性。这种计划使培训成为视频片段的一部分,使培训变得吵闹,与文本描述等其他现代投入完全不相关。在本文中,我们提出一个精细的多式多式自我监督培训计划,该培训计划将精细的嵌入精细尺寸(如个人特征图嵌入和短语嵌入)之间的相似性进行计算,并利用关注机制减少损失功能中的杂音配对的权重。我们证明,根据拟议的培训前计划,我们可以培训较小的模型,小批量和小得多的计算资源,以达到下游任务,包括可与国家可比较的动作。