It is important to understand how dropout, a popular regularization method, aids in achieving a good generalization solution during neural network training. In this work, we present a theoretical derivation of an implicit regularization of dropout, which is validated by a series of experiments. Additionally, we numerically study two implications of the implicit regularization, which intuitively rationalizes why dropout helps generalization. Firstly, we find that input weights of hidden neurons tend to condense on isolated orientations trained with dropout. Condensation is a feature in the non-linear learning process, which makes the network less complex. Secondly, we experimentally find that the training with dropout leads to the neural network with a flatter minimum compared with standard gradient descent training, and the implicit regularization is the key to finding flat solutions. Although our theory mainly focuses on dropout used in the last hidden layer, our experiments apply to general dropout in training neural networks. This work points out a distinct characteristic of dropout compared with stochastic gradient descent and serves as an important basis for fully understanding dropout.
翻译:在神经网络训练中,了解dropout这种常用正则化方法如何帮助实现良好的泛化解决方案非常重要。在这项工作中,我们推导了dropout的隐式正则化的理论,并通过一系列实验进行了验证。此外,我们还对隐式正则化的两个含义进行了数值研究,从直观上说明了为什么dropout有助于泛化。首先,我们发现隐藏神经元的输入权重在dropout训练过程中往往会凝聚在孤立的方向上。在非线性学习过程中,凝聚是一个减少网络复杂性的特征。其次,我们通过实验发现,dropout训练会导致神经网络的最小值相对于标准梯度下降训练具有更平的形状,并且隐式正则化是发现平坦解的关键。虽然我们的理论主要集中在在最后的隐藏层中使用的dropout上,但我们的实验适用于训练神经网络中的一般性dropout。这项工作指出了dropout相比于随机梯度下降的明显特征,并为充分理解dropout提供了重要基础。