Stochastic differential equations (SDEs) have been shown recently to well characterize the dynamics of training machine learning models with SGD. This provides two opportunities for better understanding the generalization behaviour of SGD through its SDE approximation. First, under the SDE characterization, SGD may be regarded as the full-batch gradient descent with Gaussian gradient noise. This allows the application of the generalization bounds developed by Xu & Raginsky (2017) to analyzing the generalization behaviour of SGD, resulting in upper bounds in terms of the mutual information between the training set and the training trajectory. Second, under mild assumptions, it is possible to obtain an estimate of the steady-state weight distribution of SDE. Using this estimate, we apply the PAC-Bayes-like information-theoretic bounds developed in both Xu & Raginsky (2017) and Negrea et al. (2019) to obtain generalization upper bounds in terms of the KL divergence between the steady-state weight distribution of SGD with respect to a prior distribution. Among various options, one may choose the prior as the steady-state weight distribution obtained by SGD on the same training set but with one example held out. In this case, the bound can be elegantly expressed using the influence function (Koh & Liang, 2017), which suggests that the generalization of the SGD is related to the stability of SGD. Various insights are presented along the development of these bounds, which are subsequently validated numerically.
翻译:最近展示了SGD培训机器学习模式动态的沙变方程(SDEs),以很好地说明SGD培训机读模型的动态。这为更好地理解SGD通过其SDE近似法的概括性行为提供了两个机会。首先,根据SDE特征,SGD可被视为带有高斯梯度噪音的全包梯度下降。这允许应用Xu & Raginsky (2017年) 和Negrea等人(2019年) 开发的概括性界限,以分析SGD的概括性行为,从而在培训组和培训轨迹之间的相互信息分配上方界限。第二,根据轻度假设,有可能获得SDE稳定状态权重分布的估计。利用这一估计,我们应用了在Xu & Raginsky (2017年) 和Negrea等人(2019年) 开发的PAC-Bayes类类偏重度梯度梯度梯度梯度梯度梯度梯度梯度(KGDGD) 的平面分布与先前分布之间的上界限。在各种选项中,其中一个选择了SGDDGDGD的先前,但以SGDS-revn-revalimals的平比值分布在S-ralalald 上显示的平平比值的平比值上显示。