关于深层学习竞赛的死后尸:辛普森的悖论和尺度衡量与形状衡量的互补作用 (Post-mortem on a deep learning contest: a Simpson's paradox and the complementary roles of scale metrics versus shape metrics)

To understand better the causes of good generalization performance in state-of-the-art neural network (NN) models, we analyze of a corpus of models that was made publicly-available for a contest to predict the generalization accuracy of NNs. These models include a wide range of qualities and were trained with a range of architectures and regularization hyperparameters. We identify what amounts to a Simpson's paradox: where "scale" metrics (from traditional statistical learning theory) perform well overall but perform poorly on subpartitions of the data of a given depth, when regularization hyperparameters are varied; and where "shape" metrics (from Heavy-Tailed Self Regularization theory) perform well on subpartitions of the data, when hyperparameters are varied for models of a given depth, but perform poorly overall when models with varying depths are aggregated. Our results highlight the subtly of comparing models when both architectures and hyperparameters are varied, as well as the complementary role of implicit scale versus implicit shape parameters in understanding NN model quality. Our results also suggest caution when one tries to extract causal insight with a single metric applied to aggregate data, and they highlight the need to go beyond one-size-fits-all metrics based on upper bounds from generalization theory to describe the performance of state-of-the-art NN models. Based on these findings, we present two novel shape metrics, one data-independent, and the other data-dependent, which can predict trends in the test accuracy of a series of NNs, of a fixed architecture/depth, when varying solver hyperparameters.

翻译：为了更好地了解最先进的神经网络模型中良好概括性表现的原因,我们分析了一系列模型,这些模型被公诸于众,可供竞争预测NN的通用准确性。这些模型包括范围广泛的质量,并经过一系列建筑和超参数的正规化培训。我们确定了辛普森的悖论:在“规模”指标(来自传统的统计学习理论)总体上表现良好,但在某一深度的数据中,当身份正规化超参数出现差异时,在某个深度的数据的分层上表现不佳;在“成形”指标(来自严重失败的自我正规化趋势理论)在数据分层上表现良好,在超标准参数对一个深度模型进行不同时,这些模型的全局性表现很差。我们的结果突出表明,当一个深度不同时,当一个深度的“规模”指标和隐性能参数对某个深度的分层,在了解NNN模型质量时,在“深度”中,一个序列(从重度的自我崩溃的“自我正规化”趋势)衡量数据的分层,当我们试图从一个模型到一个层次的因果关系,从一个模型对一个模型进行从一个层次的精确的精确性分析时,从一个模型到一个层次的数据到一个层次到一个层次的数据,我们需要从一个层次到一个层次到一个层次对一个层次的精确的模型到一个层次的数据,从一个到一个层次的数据,从一个到一个层次的数据,从一个层次的数据到一个到一个层次的数据到一个层次的数据到一个层次的数据到一个层次到一个层次到一个层次的数据,我们用来描述到一个层次的层次的层次的数据,我们用到一个层次的数据,我们所应用到一个层次的数据,我们用到一个层次的数据,我们用到一个层次的层次的层次的层次的层次的层次的层次的数据,我们用到一个层次的层次的数据,我们用到一个层次的层次到一个层次到一个层次的层次到一个层次的层次的层次的数据,我们所用到一个层次的数据,我们所用到一个层次的层次的数据,我们所用到一个比的数据,我们所用到一个层次的数据,我们所用到一个比的数据,我们所用到一个比的数据,我们用到一个层次的层次的数据,我们用到一个层次的层次的层次的层次的层次的层次的层次的层次的层次的层次的层次的层次的层次的层次的层次的