Machine learning training methods depend plentifully and intricately on hyperparameters, motivating automated strategies for their optimisation. Many existing algorithms restart training for each new hyperparameter choice, at considerable computational cost. Some hypergradient-based one-pass methods exist, but these either cannot be applied to arbitrary optimiser hyperparameters (such as learning rates and momenta) or take several times longer to train than their base models. We extend these existing methods to develop an approximate hypergradient-based hyperparameter optimiser which is applicable to any continuous hyperparameter appearing in a differentiable model weight update, yet requires only one training episode, with no restarts. We also provide a motivating argument for convergence to the true hypergradient, and perform tractable gradient-based optimisation of independent learning rates for each model parameter. Our method performs competitively from varied random hyperparameter initialisations on several UCI datasets and Fashion-MNIST (using a one-layer MLP), Penn Treebank (using an LSTM) and CIFAR-10 (using a ResNet-18), in time only 2-3x greater than vanilla training.
翻译:机器学习培训方法广泛和复杂地依赖于超参数,鼓励其优化的自动策略。许多现有的算法为每个新的超参数选择重新开始培训,费用很高,计算成本很高。有些基于超梯度的单行法存在,但这些方法既不能应用于任意的优化器超参数(如学习率和瞬时法),也不能比其基模型多花几倍时间来培训。我们扩展了这些现有方法,以开发一种近似超梯度的超梯度高参数优化器,适用于在可变模型重量更新中出现的任何连续超参数,但只需要一个培训插件,而没有重新启动。我们还为与真正的超梯度合并提供了激励性论证,并为每个模型参数的独立的学习率进行了可拉动梯度的梯度优化。我们的方法从几个UCI数据集和Fashain-MNIST(使用一层MLP)、Penn Treak Bank(使用LSTM)和CIFAR-10(使用ResNet-18)上的时间比范号培训大得多。