Aimed at explaining the surprisingly good generalization behavior of overparameterized deep networks, recent works have developed a variety of generalization bounds for deep learning, all based on the fundamental learning-theoretic technique of uniform convergence. While it is well-known that many of these existing bounds are numerically large, through numerous experiments, we bring to light a more concerning aspect of these bounds: in practice, these bounds can {\em increase} with the training dataset size. Guided by our observations, we then present examples of overparameterized linear classifiers and neural networks trained by gradient descent (GD) where uniform convergence provably cannot "explain generalization" -- even if we take into account the implicit bias of GD {\em to the fullest extent possible}. More precisely, even if we consider only the set of classifiers output by GD, which have test errors less than some small $\epsilon$ in our settings, we show that applying (two-sided) uniform convergence on this set of classifiers will yield only a vacuous generalization guarantee larger than $1-\epsilon$. Through these findings, we cast doubt on the power of uniform convergence-based generalization bounds to provide a complete picture of why overparameterized deep networks generalize well.
翻译:为了解释过度分解的深层网络令人惊讶的良好概括行为,最近的一些工作为深层学习制定了各种通用框架,所有这些都基于基本的学习理论和统一趋同技术。虽然众所周知,许多现有界限在数字上是巨大的,我们通过无数的实验揭示了这些界限的更多方面:在实践中,这些界限可以随着培训数据集的大小而增加。根据我们的观察,我们随后提出了过度分解的线性分类器和由梯度下降(GD)训练的神经网络的例子,在这些方面,统一趋同器无法“解释一般化” -- -- 即使我们尽可能充分考虑到GD的隐含偏见。 更确切地说,即使我们只考虑GD的分类器输出,其测试误差小于我们环境中的少量美元,我们也表明,对这组分解器采用(两面)统一组合,只能产生比1美元大得多的分散化保证。通过这些结论,我们对普遍统一化的网络的完整统一度提出了疑问。