When transferring a pretrained model to a downstream task, two popular methods are full fine-tuning (updating all the model parameters) and linear probing (updating only the last linear layer -- the "head"). It is well known that fine-tuning leads to better accuracy in-distribution (ID). However, in this paper, we find that fine-tuning can achieve worse accuracy than linear probing out-of-distribution (OOD) when the pretrained features are good and the distribution shift is large. On 10 distribution shift datasets (Breeds-Living17, Breeds-Entity30, DomainNet, CIFAR $\to$ STL, CIFAR10.1, FMoW, ImageNetV2, ImageNet-R, ImageNet-A, ImageNet-Sketch), fine-tuning obtains on average 2% higher accuracy ID but 7% lower accuracy OOD than linear probing. We show theoretically that this tradeoff between ID and OOD accuracy arises even in a simple setting: fine-tuning overparameterized two-layer linear networks. We prove that the OOD error of fine-tuning is high when we initialize with a fixed or random head -- this is because while fine-tuning learns the head, the lower layers of the neural network change simultaneously and distort the pretrained features. Our analysis suggests that the easy two-step strategy of linear probing then full fine-tuning (LP-FT), sometimes used as a fine-tuning heuristic, combines the benefits of both fine-tuning and linear probing. Empirically, LP-FT outperforms both fine-tuning and linear probing on the above datasets (1% better ID, 10% better OOD than full fine-tuning).
翻译:当将预先培训的模型转换为下游任务时,两种流行的方法是完全微调(更新所有模型参数)和线性勘测(仅更新最后一个线性层 -- -- “头”)。众所周知,微调可以提高分布的准确性(ID)。然而,在本文中,我们发现微调的准确性可能比预先培训特点良好且分布变化很大时的线性推离(OOOD)的准确性要差得多。在10个分配性调整数据集(Breeds-Living17,Breeds-Entity30,DomainNet,CIFAR$\to$STL,CIFAR10.1、FMoW,图像NetV2,图像Net-R,图像Net-A,图像Net-Sketch)中,微调调的精确性比线性调(OOOOD)的准确性调整率要低7%。我们从理论上显示,这种偏差甚至出现在一个简单的设置中:微调超过精细的精细线性线性线性网络,我们同时将OOOOOD的正值调整了整个结构。