The remarkable success of deep learning in various domains relies on the availability of large-scale annotated datasets. However, obtaining annotations is expensive and requires great effort, which is especially challenging for videos. Moreover, the use of human-generated annotations leads to models with biased learning and poor domain generalization and robustness. As an alternative, self-supervised learning provides a way for representation learning which does not require annotations and has shown promise in both image and video domains. Different from the image domain, learning video representations are more challenging due to the temporal dimension, bringing in motion and other environmental dynamics. This also provides opportunities for video-exclusive ideas that advance self-supervised learning in the video and multimodal domain. In this survey, we provide a review of existing approaches on self-supervised learning focusing on the video domain. We summarize these methods into four different categories based on their learning objectives: 1) pretext tasks, 2) generative learning, 3) contrastive learning, and 4) cross-modal agreement. We further introduce the commonly used datasets, downstream evaluation tasks, insights into the limitations of existing works, and the potential future directions in this area.
翻译:不同领域的深层学习的显著成功取决于能否获得大规模附加说明的数据集。然而,获取说明费用昂贵,需要付出巨大努力,这对视频尤其具有挑战性。此外,使用人造说明导致有偏向的学习模式,领域一般化和稳健性差。作为一种替代办法,自我监督的学习为代表学习提供了一种不需要说明、在图像和视频领域都表现出希望的方式。与图像领域不同,学习视频演示由于时间因素而更具挑战性,带来运动和其他环境动态。这也为视频独家想法提供了机会,推动视频和多式联运领域的自我监督学习。在这次调查中,我们审查了以视频领域为重点的自监督学习的现有方法。我们根据这些方法的学习目标将这些方法归纳为四个不同类别:(1) 托辞任务,(2) 基因化学习,(3) 对比性学习,(4) 跨模式协议。我们进一步介绍了常用的数据集、下游评估任务、对现有作品局限性的洞察力以及该领域未来方向。我们进一步介绍了这方面的共同使用的方法。