Most existing analyses of (stochastic) gradient descent rely on the condition that for $L$-smooth costs, the step size is less than $2/L$. However, many works have observed that in machine learning applications step sizes often do not fulfill this condition, yet (stochastic) gradient descent still converges, albeit in an unstable manner. We investigate this unstable convergence phenomenon from first principles, and discuss key causes behind it. We also identify its main characteristics, and how they interrelate based on both theory and experiments, offering a principled view toward understanding the phenomenon.
翻译:对(随机)梯度梯度下降的现有分析大多取决于以下条件:对于美元平均成本而言,职档规模小于2美元/L$;然而,许多工作发现,在机器学习应用中,职档大小往往不能满足这一条件,但(随机)梯度下降仍然趋同,尽管情况不稳定。我们从最初的原则中调查这种不稳定的趋同现象,并讨论其背后的关键原因。我们还根据理论和实验确定其主要特点,以及它们如何相互关联,为理解这一现象提供原则性观点。