A common approach to transfer learning under distribution shift is to fine-tune the last few layers of a pre-trained model, preserving learned features while also adapting to the new task. This paper shows that in such settings, selectively fine-tuning a subset of layers (which we term surgical fine-tuning) matches or outperforms commonly used fine-tuning approaches. Moreover, the type of distribution shift influences which subset is more effective to tune: for example, for image corruptions, fine-tuning only the first few layers works best. We validate our findings systematically across seven real-world data tasks spanning three types of distribution shifts. Theoretically, we prove that for two-layer neural networks in an idealized setting, first-layer tuning can outperform fine-tuning all layers. Intuitively, fine-tuning more parameters on a small target dataset can cause information learned during pre-training to be forgotten, and the relevant information depends on the type of shift.
翻译:在分布式转换中转移学习的常见方法是微调培训前模式的最后几层,保存学到的特征,同时适应新的任务。本文表明,在这种环境下,有选择地微调一组层(我们使用外科微调)的匹配或优于通常使用的微调方法。此外,分层的分布式转换影响类型对于调和更有效:例如,对于图像腐蚀,只微调前几层最有效。我们系统地验证我们跨越七个真实世界数据任务的发现,这七个数据任务跨越了三种类型的分布式转移。理论上,我们证明对于理想化环境中的两层神经网络来说,一层调能优于所有层次的微调。在小目标数据集上微调更多参数的直觉微调整可以导致在培训前学到的信息被遗忘,而相关信息取决于转移类型。