Within the machine learning community, the widely-used uniform convergence framework has been used to answer the question of how complex, over-parameterized models can generalize well to new data. This approach bounds the test error of the worst-case model one could have fit to the data, but it has fundamental limitations. Inspired by the statistical mechanics approach to learning, we formally define and develop a methodology to compute precisely the full distribution of test errors among interpolating classifiers from several model classes. We apply our method to compute this distribution for several real and synthetic datasets, with both linear and random feature classification models. We find that test errors tend to concentrate around a small typical value $\varepsilon^*$, which deviates substantially from the test error of the worst-case interpolating model on the same datasets, indicating that "bad" classifiers are extremely rare. We provide theoretical results in a simple setting in which we characterize the full asymptotic distribution of test errors, and we show that these indeed concentrate around a value $\varepsilon^*$, which we also identify exactly. We then formalize a more general conjecture supported by our empirical findings. Our results show that the usual style of analysis in statistical learning theory may not be fine-grained enough to capture the good generalization performance observed in practice, and that approaches based on the statistical mechanics of learning may offer a promising alternative.
翻译:在机器学习界中,广泛使用的统一趋同框架被用来回答如何复杂、过度参数化的模型能够很好地概括新数据的问题。这个方法将最差情况模型的测试错误与数据相适应,但有根本性的局限性。在统计机理学方法的启发下,我们正式界定并开发了一种方法,精确地计算若干模型分类者之间测试错误的全面分布。我们用我们的方法来计算若干真实和合成数据集的分布,既有线性模型,也有随机特征分类模型。我们发现测试错误往往集中于一个小的典型值 $\varepsilon $,该值与同一数据集中最差情况交叉计算模型的测试错误大相径庭,表明“坏”分类者极为罕见。我们在一个简单的环境中提供了理论结果,我们把测试错误的完全分布描述为“乐观”分布,我们用这些方法确实集中于一个价值(美元)和随机特征分类。我们发现,测试错误往往集中在一个小的典型价值(美元)上。我们随后正式确定一个比最差的统计方法,我们通常的精确的统计分析方法可能证明我们基于典型的精确的统计方法的学习结果。