新的并非总是更好:重新思考可转移性指标、其特点、稳定性和性能 (Newer is not always better: Rethinking transferability metrics, their peculiarities, stability and performance)

Fine-tuning of large pre-trained image and language models on small customized datasets has become increasingly popular for improved prediction and efficient use of limited resources. Fine-tuning requires identification of best models to transfer-learn from and quantifying transferability prevents expensive re-training on all of the candidate models/tasks pairs. We show that the statistical problems with covariance estimation drive the poor performance of H-score [Bao et al., 2019] -- a common baseline for newer metrics -- and propose shrinkage-based estimator. This results in up to 80% absolute gain in H-score correlation performance, making it competitive with the state-of-the-art LogME measure by You et al. [2021]. Our shrinkage-based H-score is 3-55 times faster to compute compared to LogME. Additionally, we look into a less common setting of target (as opposed to source) task selection. We identify previously overlooked problems in such settings with different number of labels, class-imbalance ratios etc. for some recent metrics e.g., LEEP [Nguyen et al., 2020] that resulted in them being misrepresented as leading measures. We propose a correction and recommend measuring correlation performance against relative accuracy in such settings. We also outline the difficulties of comparing feature-dependent metrics, both supervised (e.g. H-score) and unsupervised measures (e.g., Maximum Mean Discrepancy [Long et al., 2015]), across source models/layers with different feature embedding dimension. We show that dimensionality reduction methods allow for meaningful comparison across models and improved performance of some of these measures. We investigate performance of 14 different supervised and unsupervised metrics and demonstrate that even unsupervised metrics can identify the leading models for domain adaptation. We support our findings with ~65,000 (fine-tuning trials) experiments.

翻译：微量调整小型定制数据集的大型预培训前图像和语言模型的微调越来越受欢迎,以更好地预测和有效使用有限资源。微调要求确定最佳模型,以便从中转移和量化可转让性,从而防止对所有候选模型/任务配对进行昂贵的再培训。我们显示,由于共变估算的统计问题,导致H-Score[Bao等人, 2019](新指标的共同基准)的性能不佳,并提议缩小基于缩放的估测值。这导致H-核心相关性业绩的绝对增益高达80%,使之与You等人(2021年)的LogME衡量标准具有竞争力。我们基于缩放的H-C-CS-C-CS-CS-CS-SDS-SDS-SDSDSDS 的性能调整速度是所有候选人/任务选择的3-55倍。此外,我们审视了目标(相对于源的)任务选择的不那么常见的设置。我们以前会忽略了这样的环境中的问题,有不同的标签数、基于精度的直率率率率比率比率比率比率比率比率等最近的一些指标模型(LII-S-S-S-S-S-S-S-IDDS-S-IDS-S-S-S-SDxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx