The remarkable success of deep learning in various domains relies on the availability of large-scale annotated datasets. However, the use of human-generated annotations leads to models with biased learning, poor domain generalization, and poor robustness. Obtaining annotations is also expensive and requires great effort, which is especially challenging for videos. As an alternative, self-supervised learning provides a way for representation learning which does not require annotations and has shown promise in both image and video domains. Different from the image domain, learning video representations are more challenging due to the temporal dimension, bringing in motion and other environmental dynamics. This also provides opportunities for exclusive ideas which can advance self-supervised learning in the video and multimodal domain. In this survey, we provide a review of existing approaches on self-supervised learning focusing on the video domain. We summarize these methods into three different categories based on their learning objectives: pre-text tasks, generative modeling, and contrastive learning. These approaches also differ in terms of the modality which are being used: video, video-audio, video-text, and video-audio-text. We further introduce the commonly used datasets, downstream evaluation tasks, insights into the limitations of existing works, and the potential future directions in this area.
翻译:不同领域的深层学习的显著成功取决于能否获得大规模附加说明的数据集。然而,使用人造的注释会导致有偏向的学习模式,领域一般化差,而且不够稳健。获取说明也很昂贵,需要付出很大努力,对录像领域特别具有挑战性。作为一种替代办法,自我监督的学习为代表学习提供了一种方式,不需要说明,在图像和视频领域都显示了希望。与图像领域不同,学习的视频演示由于时间因素而更具挑战性,带来运动和其他环境动态。这也为独家想法提供了机会,这些想法可以推进视频和多式联运领域的自我监督学习。在这次调查中,我们审查了以视频领域为重点的自监督学习的现有方法。我们根据这些方法的学习目标将这些方法归纳为三个不同类别:前文本任务、缩写模型和对比式学习。这些方法在目前使用的方式方面也不同:视频、视频-视频-视频、视频-文字和视频-文字。我们进一步将这一常用的数据方向、下游洞察、现有任务和潜在工作引入了现有领域。我们进一步将这一共同使用的数据领域引入了现有领域。