Training complex machine learning (ML) architectures requires a compute and time consuming process of selecting the right optimizer and tuning its hyper-parameters. A new paradigm of learning optimizers from data has emerged as a better alternative to hand-designed ML optimizers. We propose Mnemosyne optimizer, that uses Performers: implicit low-rank attention Transformers. It can learn to train entire neural network architectures including other Transformers without any task-specific optimizer tuning. We show that Mnemosyne: (a) generalizes better than popular LSTM optimizer, (b) in particular can successfully train Vision Transformers (ViTs) while meta--trained on standard MLPs and (c) can initialize optimizers for faster convergence in Robotics applications. We believe that these results open the possibility of using Transformers to build foundational optimization models that can address the challenges of regular Transformer training. We complement our results with an extensive theoretical analysis of the compact associative memory used by Mnemosyne.
翻译:培训复杂的机器学习(ML)结构要求有一个计算和耗时的过程来选择正确的优化器和调整其超参数。从数据中学习优化器的新模式已经出现,作为手工设计的ML优化器的更好替代物。我们建议使用表演者:低声调变换器的Mnemosyne优化器。它可以学习培训整个神经网络结构,包括其他没有特定任务优化调节的变异器。我们表明,Mnemosyne:(a)比流行的LSTM优化器更普遍化,(b)特别能够成功地培训视觉变异器(VITs),同时接受标准的MLPs和(c)元培训,可以初始化优化器,以便在机器人应用中更快地趋同。我们认为,这些结果为利用变异器建立基础优化模型以解决常规变异器培训的挑战开辟了可能性。我们用对结果进行广泛的理论分析来补充我们的成果。