A common strategy to train deep neural networks (DNNs) is to use very large architectures and to train them until they (almost) achieve zero training error. Empirically observed good generalization performance on test data, even in the presence of lots of label noise, corroborate such a procedure. On the other hand, in statistical learning theory it is known that over-fitting models may lead to poor generalization properties, occurring in e.g. empirical risk minimization (ERM) over too large hypotheses classes. Inspired by this contradictory behavior, so-called interpolation methods have recently received much attention, leading to consistent and optimally learning methods for some local averaging schemes with zero training error. However, there is no theoretical analysis of interpolating ERM-like methods so far. We take a step in this direction by showing that for certain, large hypotheses classes, some interpolating ERMs enjoy very good statistical guarantees while others fail in the worst sense. Moreover, we show that the same phenomenon occurs for DNNs with zero training error and sufficiently large architectures.
翻译:培养深神经网络(DNNs)的共同战略是使用非常大型的结构,并对其进行培训,直到它们(几乎)达到零培训错误。在测试数据上,即使存在许多标签噪音,也经常观察到良好的通用性表现,这证实了这样一种程序。另一方面,在统计学理论中,众所周知,过于适合的模式可能导致不完善的概括性特征,例如,在经验风险最小化(ERM)等大型假设类中发生。受这种矛盾行为的影响,所谓的内插方法最近引起了很大的注意,导致某些没有培训错误的本地平均计划采用一致和最佳的学习方法。然而,迄今为止,对于类似机构化方法的相互作用没有进行理论分析。我们朝这个方向迈出了一步,表明某些大型假设类中,一些相互交错的机构享有非常良好的统计保证,而另一些则在最坏的意义上没有成功。此外,我们表明DNNs也有同样的现象发生,培训错误,而且结构也足够大。