One of the principal scientific challenges in deep learning is explaining generalization, i.e., why the particular way the community now trains networks to achieve small training error also leads to small error on held-out data from the same population. It is widely appreciated that some worst-case theories -- such as those based on the VC dimension of the class of predictors induced by modern neural network architectures -- are unable to explain empirical performance. A large volume of work aims to close this gap, primarily by developing bounds on generalization error, optimization error, and excess risk. When evaluated empirically, however, most of these bounds are numerically vacuous. Focusing on generalization bounds, this work addresses the question of how to evaluate such bounds empirically. Jiang et al. (2020) recently described a large-scale empirical study aimed at uncovering potential causal relationships between bounds/measures and generalization. Building on their study, we highlight where their proposed methods can obscure failures and successes of generalization measures in explaining generalization. We argue that generalization measures should instead be evaluated within the framework of distributional robustness.
翻译:深层学习中的主要科学挑战之一是解释一般化,即为什么社区现在培训网络以实现小型培训错误的特殊方式也会导致同一人口持有的数据出现小错误。人们广泛认识到,一些最坏的理论 -- -- 例如现代神经网络结构引起的预测器类别VC层面的理论 -- -- 无法解释经验性表现。大量工作的目的是缩小这一差距,主要通过发展关于一般化错误、优化错误和超额风险的界限。然而,在经验性评价时,大多数这些界限都是数字性的。在注重一般化界限时,这项工作涉及如何以经验性方式评价这些界限的问题。江江等人(2020年)最近介绍了一项大规模的经验性研究,旨在发现界限/计量和概括性之间的潜在因果关系。我们在其研究的基础上,强调所提议的方法在哪些方面可以掩盖在解释一般化方面的失败和成功之处。我们指出,一般化措施应在分配性强度的框架内加以评价。