The loss surface of an overparameterized neural network (NN) possesses many global minima of zero training error. We explain how common variants of the standard NN training procedure change the minimizer obtained. First, we make explicit how the size of the initialization of a strongly overparameterized NN affects the minimizer and can deteriorate its final test performance. We propose a strategy to limit this effect. Then, we demonstrate that for adaptive optimization such as AdaGrad, the obtained minimizer generally differs from the gradient descent (GD) minimizer. This adaptive minimizer is changed further by stochastic mini-batch training, even though in the non-adaptive case, GD and stochastic GD result in essentially the same minimizer. Lastly, we explain that these effects remain relevant for less overparameterized NNs. While overparameterization has its benefits, our work highlights that it induces sources of error absent from underparameterized models.
翻译:超参数神经网络(NN)的损耗表面具有许多全球零度训练错误。 我们解释标准NN培训程序的常见变体如何改变最小化器。 首先, 我们明确说明强度超分的NNT初始化的大小如何影响最小化器, 并可能恶化其最终测试性能。 我们提出限制这一效果的战略。 然后, 我们证明, AdaGrad这样的适应性优化获得的最小化器通常不同于梯度下沉( GD) 最小化器。 这个适应性最小化器通过随机性小型小批量训练而进一步改变, 尽管在非适应性案例中, GD 和 随机性GD 导致的最小化器基本上相同。 最后, 我们解释, 这些效应对于不那么高参数化的NND仍然相关。 虽然超度化有其好处, 我们的工作强调, 它会诱导出偏差的错误源,而光量度不足参数模型是不存在的。