In gradient descent, changing how we parametrize the model can lead to drastically different optimization trajectories, giving rise to a surprising range of meaningful inductive biases: identifying sparse classifiers or reconstructing low-rank matrices without explicit regularization. This implicit regularization has been hypothesised to be a contributing factor to good generalization in deep learning. However, natural gradient descent is approximately invariant to reparameterization, it always follows the same trajectory and finds the same optimum. The question naturally arises: What happens if we eliminate the role of parameterization, which solution will be found, what new properties occur? We characterize the behaviour of natural gradient flow in deep linear networks for separable classification under logistic loss and deep matrix factorization. Some of our findings extend to nonlinear neural networks with sufficient but finite over-parametrization. We demonstrate that there exist learning problems where natural gradient descent fails to generalize, while gradient descent with the right architecture performs well.
翻译:在梯度下,我们如何改变模型的外形,从而导致截然不同的优化轨迹,从而产生出一系列令人惊讶的有意义的诱导偏差:确定稀疏的分类者或在没有明确正规化的情况下重建低级矩阵。这种隐含的正规化被假定为有助于深层学习中良好的概括化。然而,自然梯度下行几乎是无法重新量化的,它总是沿着同样的轨迹走,发现同样的最佳。问题自然会发生:如果我们消除参数化的作用,找到哪种解决办法,会发生什么情况?我们把深线性网络中的自然梯度流动行为定性为在后勤损失和深基质因子化下可分解的分类。我们的一些发现延伸到非线性神经网络,有足够的但有限的超度平衡化。我们证明,在自然梯度下降不能普遍化,而与正确结构的梯度下流运行良好的情况下,存在着学习问题。