The conventional recipe for maximizing model accuracy is to (1) train multiple models with various hyperparameters and (2) pick the individual model which performs best on a held-out validation set, discarding the remainder. In this paper, we revisit the second step of this procedure in the context of fine-tuning large pre-trained models, where fine-tuned models often appear to lie in a single low error basin. We show that averaging the weights of multiple models fine-tuned with different hyperparameter configurations often improves accuracy and robustness. Unlike a conventional ensemble, we may average many models without incurring any additional inference or memory costs -- we call the results "model soups." When fine-tuning large pre-trained models such as CLIP, ALIGN, and a ViT-G pre-trained on JFT, our soup recipe provides significant improvements over the best model in a hyperparameter sweep on ImageNet. The resulting ViT-G model, which attains 90.94% top-1 accuracy on ImageNet, achieved a new state of the art. Furthermore, we show that the model soup approach extends to multiple image classification and natural language processing tasks, improves out-of-distribution performance, and improves zero-shot performance on new downstream tasks. Finally, we analytically relate the performance similarity of weight-averaging and logit-ensembling to flatness of the loss and confidence of the predictions, and validate this relation empirically. Code is available at https://github.com/mlfoundations/model-soups.
翻译:使模型准确性最大化的传统秘方是:(1) 培训多种模型,使用各种超参数,(2) 选择在搁置的验证组中表现最佳的多种模型,抛弃其余部分。 在本文中,我们重新审视了这一程序的第二步,在微调大型预培训模型的背景下,微调模型往往位于一个低误差区,微调模型往往位于一个单一的低误差区。我们显示,以不同超参数配置进行微调的多个模型的重量平均比重往往提高准确性和稳健性。与传统的合用组合不同,我们可以在不产生任何额外的推断或记忆成本的情况下平均许多模型 -- -- 我们称之为“模版汤”。当微调大型预培训模型,如CLIP、ALIGIG和JFT的VT-G预培训模型时,我们的汤食谱配方对图像网超光度扫描的最佳模型有很大改进。 由此形成的VIT-G模型,在图像网络上达到90.94%的顶级-1准确性能,实现了新的艺术状态。 此外,我们显示,模型方法可以扩展到多图像级的图像质量和模度, 样级的模型, 和在零比值校正标值校平比值校平平比值上,我们的任务。