We empirically show that the test error of deep networks can be estimated by simply training the same architecture on the same training set but with a different run of Stochastic Gradient Descent (SGD), and measuring the disagreement rate between the two networks on unlabeled test data. This builds on -- and is a stronger version of -- the observation in Nakkiran & Bansal '20, which requires the second run to be on an altogether fresh training set. We further theoretically show that this peculiar phenomenon arises from the \emph{well-calibrated} nature of \emph{ensembles} of SGD-trained models. This finding not only provides a simple empirical measure to directly predict the test error using unlabeled test data, but also establishes a new conceptual connection between generalization and calibration.
翻译:我们从经验上表明,深海网络的测试错误可以通过简单地在同一个训练组上培训同样的结构,但采用不同的Stochatic Gradient Emple (SGD) 来估计,并测量两个网络之间在未贴标签的测试数据上的分歧率。这基于 -- -- 并且是一个更强的版本 -- -- Nakkiran & Bansal '20的观察,这要求第二次测试在全新训练组中进行。我们从理论上进一步表明,这一特殊现象产生于经SGD培训的模型的emph{ensembles} 的性质。这一发现不仅提供了使用未贴标签的测试数据直接预测测试错误的简单的经验性衡量标准,而且还在一般化和校准之间建立了新的概念联系。