Large-scale labeled data are generally required to train deep neural networks in order to obtain better performance in visual feature learning from images or videos for computer vision applications. To avoid extensive cost of collecting and annotating large-scale datasets, as a subset of unsupervised learning methods, self-supervised learning methods are proposed to learn general image and video features from large-scale unlabeled data without using any human-annotated labels. This paper provides an extensive review of deep learning-based self-supervised general visual feature learning methods from images or videos. First, the motivation, general pipeline, and terminologies of this field are described. Then the common deep neural network architectures that used for self-supervised learning are summarized. Next, the main components and evaluation metrics of self-supervised learning methods are reviewed followed by the commonly used image and video datasets and the existing self-supervised visual feature learning methods. Finally, quantitative performance comparisons of the reviewed methods on benchmark datasets are summarized and discussed for both image and video feature learning. At last, this paper is concluded and lists a set of promising future directions for self-supervised visual feature learning.
翻译:通常需要大型标签数据来培训深层神经网络,以便从图像或视频中从计算机视觉应用的图像或视频中获取视觉特征学习的更好性能; 为避免收集和批注大型数据集的庞大成本,作为不受监督的学习方法的一个子集,建议采用自我监督的学习方法,从大型无标签数据中学习一般图像和视频特征,而不使用任何人类附加说明的标签; 本文对从图像或视频中获取的基于深层次学习的自监督的普通视觉特征学习方法进行广泛审查。 首先,对该领域的动机、一般管道和术语进行了描述。 然后,对用于自我监督学习的常见的深层神经网络结构进行了总结。接下来,对自我监督学习方法的主要组成部分和评价尺度进行了审查,随后是常用的图像和视频数据集以及现有的自监督的视觉特征学习方法。最后,为图像和视频特征的学习,对所审查的数据集的定量性能比较进行了总结和讨论。 最后,本文件总结并列出了一套前景展望性能,用于自我监督的学习。