Recent developments in large-scale machine learning suggest that by scaling up data, model size and training time properly, one might observe that improvements in pre-training would transfer favorably to most downstream tasks. In this work, we systematically study this phenomena and establish that, as we increase the upstream accuracy, the performance of downstream tasks saturates. In particular, we investigate more than 4800 experiments on Vision Transformers, MLP-Mixers and ResNets with number of parameters ranging from ten million to ten billion, trained on the largest scale of available image data (JFT, ImageNet21K) and evaluated on more than 20 downstream image recognition tasks. We propose a model for downstream performance that reflects the saturation phenomena and captures the nonlinear relationship in performance of upstream and downstream tasks. Delving deeper to understand the reasons that give rise to these phenomena, we show that the saturation behavior we observe is closely related to the way that representations evolve through the layers of the models. We showcase an even more extreme scenario where performance on upstream and downstream are at odds with each other. That is, to have a better downstream performance, we need to hurt upstream accuracy.
翻译:大规模机器学习的近期发展表明,通过适当扩大数据、模型规模和培训时间,人们可能会发现,培训前的改进将有利于向大多数下游任务转移。在这项工作中,我们系统地研究这一现象,并确定随着我们提高上游的准确性,下游任务饱和度的性能。特别是,我们调查了4800多个关于愿景变形器、MLP-Mixers和ResNets的实验,其参数从1000万到100亿不等,在现有最大规模图像数据(JFT、imageNet21K)方面受过培训,并在20多个下游图像识别任务中进行了评估。我们提出了下游业绩模式,以反映饱和现象,并捕捉上游和下游任务业绩的非线性关系。为了更深入地了解产生这些现象的原因,我们观察到的饱和行为与模型层的演化方式密切相关。我们展示了更极端的情景,即上游和下游业绩相互不匹配。也就是说,为了更好的下游业绩,我们需要损害上游的准确性。