The paper presents the formulation, implementation, and evaluation of the ArcGD optimiser. The evaluation is conducted initially on a non-convex benchmark function and subsequently on a real-world ML dataset. The initial comparative study using the Adam optimiser is conducted on a stochastic variant of the highly non-convex and notoriously challenging Rosenbrock function, renowned for its narrow, curved valley, across dimensions ranging from 2D to 1000D and an extreme case of 50,000D. Two configurations were evaluated to eliminate learning-rate bias: (i) both using ArcGD's effective learning rate and (ii) both using Adam's default learning rate. ArcGD consistently outperformed Adam under the first setting and, although slower under the second, achieved superior final solutions in most cases. In the second evaluation, ArcGD is evaluated against state-of-the-art optimizers (Adam, AdamW, Lion, SGD) on the CIFAR-10 image classification dataset across 8 diverse MLP architectures ranging from 1 to 5 hidden layers. ArcGD achieved the highest average test accuracy (50.7%) at 20,000 iterations, outperforming AdamW (46.6%), Adam (46.8%), SGD (49.6%), and Lion (43.4%), winning or tying on 6 of 8 architectures. Notably, while Adam and AdamW showed strong early convergence at 5,000 iterations, but regressed with extended training, whereas ArcGD continued improving, demonstrating generalization and resistance to overfitting without requiring early stopping tuning. Strong performance on geometric stress tests and standard deep-learning benchmarks indicates broad applicability, highlighting the need for further exploration. Moreover, it is also shown that a limiting variant of ArcGD can be interpreted as a sign-based momentum-like update, highlighting conceptual connections between the inherent mechanisms of ArcGD and the Lion optimiser.


翻译:本文提出了ArcGD优化器的公式推导、实现与评估。评估首先在一个非凸基准函数上进行,随后在一个真实世界机器学习数据集上进行。初步对比研究采用Adam优化器,在高度非凸且极具挑战性的Rosenbrock函数的随机变体上进行——该函数以其狭窄弯曲的谷地特性而闻名,测试维度涵盖从2D到1000D,以及50000D的极端情况。为消除学习率偏差,评估了两种配置:(i) 两者均使用ArcGD的有效学习率;(ii) 两者均使用Adam的默认学习率。在第一种设置下ArcGD始终优于Adam;在第二种设置下虽然收敛较慢,但在大多数情况下获得了更优的最终解。第二次评估中,ArcGD在CIFAR-10图像分类数据集上,针对8种不同架构的多层感知机(隐藏层数从1到5层)与前沿优化器(Adam、AdamW、Lion、SGD)进行对比。ArcGD在20000次迭代时取得了最高的平均测试准确率(50.7%),优于AdamW(46.6%)、Adam(46.8%)、SGD(49.6%)和Lion(43.4%),在8种架构中的6种上获胜或持平。值得注意的是,Adam和AdamW在5000次迭代时表现出较强的早期收敛性,但随着训练延长出现性能回退,而ArcGD持续改进,展现了泛化能力和抗过拟合特性,且无需早停调参。在几何压力测试和标准深度学习基准上的优异表现表明其广泛适用性,凸显了进一步探索的必要性。此外,研究还表明ArcGD的极限变体可解释为类符号动量更新,这揭示了ArcGD内在机制与Lion优化器之间的概念联系。

0
下载
关闭预览

相关内容

Top
微信扫码咨询专知VIP会员