We extend the global convergence result of Chatterjee \cite{chatterjee2022convergence} by considering the stochastic gradient descent (SGD) for non-convex objective functions. With minimal additional assumptions that can be realized by finitely wide neural networks, we prove that if we initialize inside a local region where the \L{}ajasiewicz condition holds, with a positive probability, the stochastic gradient iterates converge to a global minimum inside this region. A key component of our proof is to ensure that the whole trajectories of SGD stay inside the local region with a positive probability. For that, we assume the SGD noise scales with the objective function, which is called machine learning noise and achievable in many real examples. Furthermore, we provide a negative argument to show why using the boundedness of noise with Robbins-Monro type step sizes is not enough to keep the key component valid.
翻译:我们通过考虑非凸目标函数的随机梯度下降(SGD),扩展了Chatterjee的全局收敛结果\cite{chatterjee2022convergence}。 我们假设我们在一个局部区域内进行初始化,在这个区域内\l{Lajasiewicz}条件成立的概率是正的,借此证明随机梯度迭代收敛于该区域内的全局最小值。我们证明的关键在于确保随机梯度下降的整个轨迹都以正的概率留在局部区域内。为此,我们假设SGD噪声与目标函数成比例,这被称为机器学习噪声,在许多实际示例中是可实现的。此外,我们还提供了一个负面论证来说明为什么使用罗宾斯-蒙罗类型的步长和噪声的有界性不足以保持关键成分的有效性。