Self-supervised methods have achieved remarkable success in transfer learning, often achieving the same or better accuracy than supervised pre-training. Most prior work has done so by increasing pre-training computation by adding complex data augmentation, multiple views, or lengthy training schedules. In this work, we investigate a related, but orthogonal question: given a fixed FLOP budget, what are the best datasets, models, and (self-)supervised training methods for obtaining high accuracy on representative visual tasks? Given the availability of large datasets, this setting is often more relevant for both academic and industry labs alike. We examine five large-scale datasets (JFT-300M, ALIGN, ImageNet-1K, ImageNet-21K, and COCO) and six pre-training methods (CLIP, DINO, SimCLR, BYOL, Masked Autoencoding, and supervised). In a like-for-like fashion, we characterize their FLOP and CO$_2$ footprints, relative to their accuracy when transferred to a canonical image segmentation task. Our analysis reveals strong disparities in the computational efficiency of pre-training methods and their dependence on dataset quality. In particular, our results call into question the commonly-held assumption that self-supervised methods inherently scale to large, uncurated data. We therefore advocate for (1) paying closer attention to dataset curation and (2) reporting of accuracies in context of the total computational cost.
翻译:自我监督的方法在转移学习方面取得了显著的成功,通常与受监督的训练前培训阶段取得相同或更好的准确性。大多数先前的工作都通过增加培训前的计算方法,增加了复杂的数据扩增、多种观点或冗长的培训时间表。在这项工作中,我们调查了一个相关但正统的问题:根据固定的FLOP预算,什么是最佳数据集、模型和(自我监督的)监督培训方法,以便在具有代表性的视觉任务中获得高度准确性?由于有大量数据集,这种设置往往对学术实验室和工业实验室都更为相关。我们研究了五个大型数据集(JFT-300M、ALIGN、图像Net-1K、图像Net-21K和CO)和六个培训前方法(CLIP、DINO、SimCLRR、BYOL、Makeed自动编码和监管 ) 。在类似时,我们将其FLOP和CO$2美元足迹定性,与它们被转移到计算机化的图像断面任务时的准确性关系更大。因此,我们的分析显示在计算前数据效率方面存在着很大的差异,因此,在计算方法的计算过程中,我们对于计算方法的深度要求的深度分析是更接近。