Model hubs with many pre-trained models (PTMs) have become a cornerstone of deep learning. Although built at a high cost, they remain \emph{under-exploited} -- practitioners usually pick one PTM from the provided model hub by popularity and then fine-tune the PTM to solve the target task. This na\"ive but common practice poses two obstacles to full exploitation of pre-trained model hubs: first, the PTM selection by popularity has no optimality guarantee, and second, only one PTM is used while the remaining PTMs are ignored. An alternative might be to consider all possible combinations of PTMs and extensively fine-tune each combination, but this would not only be prohibitive computationally but may also lead to statistical over-fitting. In this paper, we propose a new paradigm for exploiting model hubs that is intermediate between these extremes. The paradigm is characterized by two aspects: (1) We use an evidence maximization procedure to estimate the maximum value of label evidence given features extracted by pre-trained models. This procedure can rank all the PTMs in a model hub for various types of PTMs and tasks \emph{before fine-tuning}. (2) The best ranked PTM can either be fine-tuned and deployed if we have no preference for the model's architecture or the target PTM can be tuned by the top $K$ ranked PTMs via a Bayesian procedure that we propose. This procedure, which we refer to as \emph{B-Tuning}, not only improves upon specialized methods designed for tuning homogeneous PTMs, but also applies to the challenging problem of tuning heterogeneous PTMs where it yields a new level of benchmark performance.
翻译:具有许多预先培训模型的模型枢纽(PTMs)已成为深层学习的基石。尽管它们建于成本很高,但它们仍然被广泛精细地开发 。 实践者通常会从所提供的模型枢纽中挑选一个PTM, 并随后微调PTM, 以完成目标任务。 这种有代表性但常见的做法对充分利用预先培训模型枢纽构成两个障碍:(1) 受欢迎的PTM选择没有最佳性能保证, 第二, 仅使用一个PTM, 而其余的PTM则被忽略。 另一种可能的做法是考虑PTM的所有可能的组合, 并广泛微调每个组合, 但这不仅在计算上令人望望, 而且还可能导致统计过度。 在本文中, 我们提出了一个新的模式枢纽, 只能用证据最大化程序来估计通过事先培训模型提取的标签证据的最大价值。 这个程序可以将所有PTM的模型组合组合组合组合组合组合组合组合组合, 而不是大量精细的精细的组合组合。