Recent works have shown that high probability metrics with stochastic gradient descent (SGD) exhibit informativeness and in some cases advantage over the commonly adopted mean-square error-based ones. In this work we provide a formal framework for the study of general high probability bounds with SGD, based on the theory of large deviations. The framework allows for a generic (not-necessarily bounded) gradient noise satisfying mild technical assumptions, allowing for the dependence of the noise distribution on the current iterate. Under the preceding assumptions, we find an upper large deviations bound for SGD with strongly convex functions. The corresponding rate function captures analytical dependence on the noise distribution and other problem parameters. This is in contrast with conventional mean-square error analysis that captures only the noise dependence through the variance and does not capture the effect of higher order moments nor interplay between the noise geometry and the shape of the cost function. We also derive exact large deviation rates for the case when the objective function is quadratic and show that the obtained function matches the one from the general upper bound hence showing the tightness of the general upper bound. Numerical examples illustrate and corroborate theoretical findings.
翻译:最近的工作表明,高概率的梯度梯度下沉度(SGD)指标显示,高概率指标与通常采用的平均平方误差值相比,具有信息性,在某些情况下也具有优势。在这项工作中,我们根据大偏差理论,为研究与SGD的普通高概率界限提供了一个正式框架。这个框架允许一种通用(非非必然封闭的)梯度噪音,满足温和的技术假设,允许噪音分布对当前环形的依赖性。在前面的假设下,我们发现为 SGD 设定的高度偏差具有很强的二次曲线功能。相应的率函数反映了对噪音分布和其他问题参数的分析依赖性。这与传统的平均差差差分析形成对照,这种分析仅通过差异捕捉噪音依赖性,而没有捕捉较高时间的影响,也没有捕捉噪音几何形状与成本函数形状之间的相互作用。在客观功能为二次曲线时,我们也得出了相当大的偏差率,并表明所获得的功能与一般上方框显示一般上方的近度功能相符,因此显示了一般上方框的紧凑。Numericical 示例和理论结论。