Working with any gradient-based machine learning algorithm involves the tedious task of tuning the optimizer's hyperparameters, such as its step size. Recent work has shown how the step size can itself be optimized alongside the model parameters by manually deriving expressions for "hypergradients" ahead of time. We show how to automatically compute hypergradients with a simple and elegant modification to backpropagation. This allows us to easily apply the method to other optimizers and hyperparameters (e.g. momentum coefficients). We can even recursively apply the method to its own hyper-hyperparameters, and so on ad infinitum. As these towers of optimizers grow taller, they become less sensitive to the initial choice of hyperparameters. We present experiments validating this for MLPs, CNNs, and RNNs. Finally, we provide a simple PyTorch implementation of this algorithm (see people.csail.mit.edu/kach/gradient-descent-the-ultimate-optimizer).
翻译:与任何基于梯度的机器学习算法一起工作涉及调整优化器的超参数(例如其步数大小)的繁琐任务。 最近的工作已经表明,通过手动为“ 超常度” 提前对“ 超常度” 生成表达式, 步数本身可以与模型参数优化。 我们展示了如何自动计算高梯度, 并简单优雅地对反偏移进行修改。 这让我们可以很容易地将该方法应用到其他优化器和超光谱计( 如动力系数 ) 。 我们甚至可以反复将该方法应用到其自身的超超高频度计上, 以及非永久性参数上。 随着这些最优化器的塔越来越高, 它们对于最初选择超常度计会变得不那么敏感 。 我们为 MLPs、 CNNIS 和 RNNS 演示了这个实验。 最后, 我们为这种算法提供了一个简单的PyTorch 应用( 见 pecial. csail. mit.edu/ kach/ grach/ graphen- the- prient- the- optial- putizer) 。