Despite being able to capture a range of features of the data, high accuracy models trained with supervision tend to make similar predictions. This seemingly implies that high-performing models share similar biases regardless of training methodology, which would limit ensembling benefits and render low-accuracy models as having little practical use. Against this backdrop, recent work has made very different training techniques, such as large-scale contrastive learning, yield competitively-high accuracy on generalization and robustness benchmarks. This motivates us to revisit the assumption that models necessarily learn similar functions. We conduct a large-scale empirical study of models across hyper-parameters, architectures, frameworks, and datasets. We find that model pairs that diverge more in training methodology display categorically different generalization behavior, producing increasingly uncorrelated errors. We show these models specialize in subdomains of the data, leading to higher ensemble performance: with just 2 models (each with ImageNet accuracy ~76.5%), we can create ensembles with 83.4% (+7% boost). Surprisingly, we find that even significantly low-accuracy models can be used to improve high-accuracy models. Finally, we show diverging training methodology yield representations that capture overlapping (but not supersetting) feature sets which, when combined, lead to increased downstream performance.
翻译:尽管能够捕捉到数据的一系列特征,但经过监督培训的高精度模型往往会做出类似的预测。这似乎意味着高性能模型无论培训方法如何,都有着相似的偏差。这似乎意味着高性能模型无论培训方法如何,都具有相似的偏差,这限制了组合效益,使低准确性模型几乎没有实际用途。在这种背景下,最近的工作产生了非常不同的培训技术,如大规模对比学习,在概括性和稳健性基准方面产生高竞争力的精确度。这促使我们重新审视模型必然会学习类似功能的假设。我们对超参数、结构、框架和数据集的模型进行大规模的经验性研究。我们发现,在培训方法方面差异更大的模型对等表现出完全不同的概括行为,产生越来越不切实际的错误。我们展示了这些模型在数据子域中的特殊性,导致共性业绩的提高:只有两个模型(每个模型都具有图像网的精确度~76.5%),我们就能创造83.4%(+7%的振荡度)的聚合性研究。我们发现,在培训方法中,即使低度模型的混合性模型也无法用来显示,最终的跨度模型能够改进高度,从而显示高度模型。