In this paper, we show that under over-parametrization several standard stochastic optimization algorithms escape saddle-points and converge to local-minimizers much faster. One of the fundamental aspects of over-parametrized models is that they are capable of interpolating the training data. We show that, under interpolation-like assumptions satisfied by the stochastic gradients in an over-parametrization setting, the first-order oracle complexity of Perturbed Stochastic Gradient Descent (PSGD) algorithm to reach an $\epsilon$-local-minimizer, matches the corresponding deterministic rate of $\tilde{\mathcal{O}}(1/\epsilon^{2})$. We next analyze Stochastic Cubic-Regularized Newton (SCRN) algorithm under interpolation-like conditions, and show that the oracle complexity to reach an $\epsilon$-local-minimizer under interpolation-like conditions, is $\tilde{\mathcal{O}}(1/\epsilon^{2.5})$. While this obtained complexity is better than the corresponding complexity of either PSGD, or SCRN without interpolation-like assumptions, it does not match the rate of $\tilde{\mathcal{O}}(1/\epsilon^{1.5})$ corresponding to deterministic Cubic-Regularized Newton method. It seems further Hessian-based interpolation-like assumptions are necessary to bridge this gap. We also discuss the corresponding improved complexities in the zeroth-order settings.
翻译:在本文中, 我们显示, 在过度平衡下, 几个标准的随机优化算法( PSGD) 的一阶或触角复杂度可以达到 $\ epsilon$- local-minimizer, 与 $\ tilde_ mathcal{ (1/\\\ epslon\\\\\2}} $ 过度平衡模型的一个基本方面是它们能够对培训数据进行内插。 我们显示, 在过度平衡环境下, 由超平衡梯梯梯度梯度梯度梯度梯度所满足的内推式假设中, 几个标准的随机精密度算法( PSGD) 以达到 $\ epslon$- local_ local- miniticlation( Plation_\\\\\\\\\\\\\\\\\ recialislational) 本地最小最小化的精度算法( ) 。 我们接下来分析Sclislus- commlation orizal orization rolation rolation exlation( ) exlation) 和( Clation) commisslation) roclation (cl) rol) rol) rolislislisl) ycol) 。 它似乎似乎不是更精确的精度比。