Transfer learning has witnessed remarkable progress in recent years, for example, with the introduction of augmentation-based contrastive self-supervised learning methods. While a number of large-scale empirical studies on the transfer performance of such models have been conducted, there is not yet an agreed-upon set of control baselines, evaluation practices, and metrics to report, which often hinders a nuanced and calibrated understanding of the real efficacy of the methods. We share an evaluation standard that aims to quantify and communicate transfer learning performance in an informative and accessible setup. This is done by baking a number of simple yet critical control baselines in the evaluation method, particularly the blind-guess (quantifying the dataset bias), scratch-model (quantifying the architectural contribution), and maximal-supervision (quantifying the upper-bound). To demonstrate how the evaluation standard can be employed, we provide an example empirical study investigating a few basic questions about self-supervised learning. For example, using this standard, the study shows the effectiveness of existing self-supervised pre-training methods is skewed towards image classification tasks versus dense pixel-wise predictions. In general, we encourage using/reporting the suggested control baselines in evaluating transfer learning in order to gain a more meaningful and informative understanding.
翻译:例如,近年来,转让学习取得了显著进展,例如,随着采用基于增强的对比性自我监督的自我监督学习方法,在转让学习方面取得了显著进展。虽然就这种模型的转让绩效进行了一些大规模的经验性研究,但还没有一套商定的控制基线、评价做法和报告衡量标准,这往往妨碍对方法的真正效力有细微和校准的理解。我们分享了一种评价标准,目的是在信息丰富和无障碍的设置中量化和传播转让学习成绩。这是通过在评价方法中扎根一些简单但关键的控制基线来实现的,特别是盲标(量化数据集偏差)、抓痕模型(量化建筑贡献)和最大监督设想(量化上限)。为了证明如何使用评价标准,我们提供了一个例子性经验性研究,调查关于自我控制学习的几个基本问题。例如,使用这一标准,研究显示现有自我监督的训练前方法在图像分类任务与密集的像素集偏差方面的效力,而采用最密集的象素定型模型(描述结构贡献),以及最大限度的监控(说明上限),我们建议采用更有意义的学习基线,鼓励在评估中进行有意义的学习控制。