Normalization methods such as batch [Ioffe and Szegedy, 2015], weight [Salimansand Kingma, 2016], instance [Ulyanov et al., 2016], and layer normalization [Baet al., 2016] have been widely used in modern machine learning. Here, we study the weight normalization (WN) method [Salimans and Kingma, 2016] and a variant called reparametrized projected gradient descent (rPGD) for overparametrized least-squares regression. WN and rPGD reparametrize the weights with a scale g and a unit vector w and thus the objective function becomes non-convex. We show that this non-convex formulation has beneficial regularization effects compared to gradient descent on the original objective. These methods adaptively regularize the weights and converge close to the minimum l2 norm solution, even for initializations far from zero. For certain stepsizes of g and w , we show that they can converge close to the minimum norm solution. This is different from the behavior of gradient descent, which converges to the minimum norm solution only when started at a point in the range space of the feature matrix, and is thus more sensitive to initialization.
翻译:在现代机器学习中广泛使用了分批[Ioffe和Szegedy,2015年]、重量[Salimansand Kingma,2016年]、例如[Ulyanov等人,2016年]和层正常化[Baet al,2016年]等正常化方法。在这里,我们研究了重量正常化方法[Salimans和Kingma,2016年]和称为超平衡最低平方回归的重新平衡预测梯度梯度下降(rPGD)的变异。WN和RPGD用比例g和单位矢量调整重量,从而使目标函数成为非convex。我们表明,这种非convex的正规化配方与梯度下降在原始目标上具有有利的效果。这些方法调整了重量,并接近于最低纬度标准值解决方案,即使初始化时间远为零。关于g和w的分级,我们表明它们可以接近最低限度的规范解决方案。这与梯度下降和单位矢量行为不同,因此,在初始标准解决方案中,只有开始时,才会接近于空间基质下降和基质特性。