Performance of optimization on quadratic problems sensitively depends on the low-lying part of the spectrum. For large (effectively infinite-dimensional) problems, this part of the spectrum can often be naturally represented or approximated by power law distributions. In this paper we perform a systematic study of a range of classical single-step and multi-step first order optimization algorithms, with adaptive and non-adaptive, constant and non-constant learning rates: vanilla Gradient Descent, Steepest Descent, Heavy Ball, and Conjugate Gradients. For each of these, we prove that a power law spectral assumption entails a power law for convergence rate of the algorithm, with the convergence rate exponent given by a specific multiple of the spectral exponent. We establish both upper and lower bounds, showing that the results are tight. Finally, we demonstrate applications of these results to kernel learning and training of neural networks in the NTK regime.
翻译:对二次问题的优化性能敏感地取决于频谱的低洼部分。 对于大的(有效的无限维度)问题, 这部分频谱通常可以自然地被权力法分布所代表或近似。 在本文中, 我们对一系列经典的单步和多步第一顺序优化算法进行系统研究, 其适应性和非适应性、 恒定性和非连续性学习率: 香草梯子底部、 深底部、 重球 和 Conjugate 梯度。 对于其中的每一个问题, 我们证明, 权力法光谱假设包含一种算法趋同率的权力法, 其趋同率由特定的多个光谱引力给予的超快率。 我们建立了上下两条线, 显示结果很紧。 最后, 我们展示了这些结果在NTK 系统中的内核学习和培训神经网络的应用 。