Learned optimizers are algorithms that can themselves be trained to solve optimization problems. In contrast to baseline optimizers (such as momentum or Adam) that use simple update rules derived from theoretical principles, learned optimizers use flexible, high-dimensional, nonlinear parameterizations. Although this can lead to better performance in certain settings, their inner workings remain a mystery. How is a learned optimizer able to outperform a well tuned baseline? Has it learned a sophisticated combination of existing optimization techniques, or is it implementing completely new behavior? In this work, we address these questions by careful analysis and visualization of learned optimizers. We study learned optimizers trained from scratch on three disparate tasks, and discover that they have learned interpretable mechanisms, including: momentum, gradient clipping, learning rate schedules, and a new form of learning rate adaptation. Moreover, we show how the dynamics of learned optimizers enables these behaviors. Our results help elucidate the previously murky understanding of how learned optimizers work, and establish tools for interpreting future learned optimizers.
翻译:精学优化是能够自我培训解决优化问题的算法。 与使用源自理论原理的简单更新规则的基线优化(如动力或亚当)相比, 学习到的优化者使用灵活、 高维和非线性参数化。 虽然这可以在某些环境中带来更好的表现, 但他们的内部工作仍是一个谜。 一个学习到的优化者如何能超越一个非常协调的基线? 它是否学会了现有优化技术的精密组合, 或者正在采用全新的行为? 在这项工作中, 我们通过对有知识的优化者进行仔细的分析和直观化来解决这些问题。 我们研究的是从零到零学到的、 三种不同的任务的优化者, 发现他们已经学会了可解释的机制, 包括: 动力、 梯度剪裁、 学习速度表和一种新的学习速度适应形式。 此外, 我们展示了学习到的优化者的动态如何使这些行为得以实现。 我们的成果有助于阐明以前对学习到的优化者如何工作的模糊理解, 并且为未来学习到的优化者建立解释工具。