Recently, there has been significant progress in understanding the convergence and generalization properties of gradient-based methods for training overparameterized learning models. However, many aspects including the role of small random initialization and how the various parameters of the model are coupled during gradient-based updates to facilitate good generalization remain largely mysterious. A series of recent papers have begun to study this role for non-convex formulations of symmetric Positive Semi-Definite (PSD) matrix sensing problems which involve reconstructing a low-rank PSD matrix from a few linear measurements. The underlying symmetry/PSDness is crucial to existing convergence and generalization guarantees for this problem. In this paper, we study a general overparameterized low-rank matrix sensing problem where one wishes to reconstruct an asymmetric rectangular low-rank matrix from a few linear measurements. We prove that an overparameterized model trained via factorized gradient descent converges to the low-rank matrix generating the measurements. We show that in this setting, factorized gradient descent enjoys two implicit properties: (1) coupling of the trajectory of gradient descent where the factors are coupled in various ways throughout the gradient update trajectory and (2) an algorithmic regularization property where the iterates show a propensity towards low-rank models despite the overparameterized nature of the factorized model. These two implicit properties in turn allow us to show that the gradient descent trajectory from small random initialization moves towards solutions that are both globally optimal and generalize well.
翻译:隐式平衡和正则化:对超参数不对称矩阵感知的一般化和收敛保证
近年来,人们在理解过参数化学习模型的基于梯度方法的收敛和泛化性能方面取得了重大进展。然而,许多方面,包括小随机初始化的作用,以及模型的各个参数如何在梯度更新过程中耦合以促进良好的泛化,仍然是一个谜团。最近的一系列论文已经开始研究这个角色非凸的对称正半定(PSD)矩阵感测问题的公式,该问题涉及从少数线性测量中重建低秩PSD矩阵。底层的对称性和PSD性质对于现有的收敛和泛化保证至关重要。在本文中,我们研究了一个通用的超参数低秩矩阵感应问题,其中希望从少数线性测量中重构不对称的矩形低秩矩阵。我们证明了通过因子化梯度下降训练的超参数模型收敛于生成测量的低秩矩阵。我们展示了在这种设置中,因子化梯度下降具有两个隐含的属性:(1)梯度下降轨迹的耦合,其中因子在梯度更新轨迹中以各种方式耦合;(2)算法的正则化属性,其中迭代显示出对低秩模型的倾向,尽管因子化模型的超参数化性质。这两个隐含的属性反过来又允许我们显示,小的随机初始化的梯度下降轨迹朝着全局最优解和良好的泛化解移动。