Although dropout has achieved great success in deep learning, little is known about how it helps the training find a good generalization solution in the high-dimensional parameter space. In this work, we show that the training with dropout finds the neural network with a flatter minimum compared with standard gradient descent training. We further study the underlying mechanism of why dropout finds flatter minima through experiments. We propose a {\it Variance Principle} that the variance of a noise is larger at the sharper direction of the loss landscape. Existing works show that SGD satisfies the variance principle, which leads the training to flatter minima. Our work show that the noise induced by the dropout also satisfies the variance principle that explains why dropout finds flatter minima. In general, our work points out that the variance principle is an important similarity between dropout and SGD that lead the training to find flatter minima and obtain good generalization.
翻译:虽然辍学在深层学习中取得了巨大成功,但对于如何帮助培训在高维参数空间找到一个良好的通用解决方案却知之甚少。 在这项工作中,我们表明,与标准梯度下降培训相比,对辍学者的培训发现神经网络是最优的。我们进一步研究辍学者为何通过实验发现优美迷你的基本机制。我们提议在损失地貌的更明显方向上,噪音差异更大。现有的工程显示,SGD满足了差异原则,导致培训推广迷你。我们的工作表明,辍学者引起的噪音也满足了差异原则,该差异原则解释了辍学者为何会觉得优美迷你。 一般来说,我们的工作指出,辍学者与SGD之间的差异原则是一个重要的相似之处,它引导培训找到优美迷你,并获得良好的概括性。