Adam and AdaBelief compute and make use of elementwise adaptive stepsizes in training deep neural networks (DNNs) by tracking the exponential moving average (EMA) of the squared-gradient g_t^2 and the squared prediction error (m_t-g_t)^2, respectively, where m_t is the first momentum at iteration t and can be viewed as a prediction of g_t. In this work, we investigate if layerwise gradient statistics can be expoited in Adam and AdaBelief to allow for more effective training of DNNs. We address the above research question in two steps. Firstly, we slightly modify Adam and AdaBelief by introducing layerwise adaptive stepsizes in their update procedures via either pre- or post-processing. Our empirical results indicate that the slight modification produces comparable performance for training VGG and ResNet models over CIFAR10 and CIFAR100, suggesting that layer-wise gradient statistics play an important role towards the success of Adam and AdaBelief for at least certian DNN tasks. In the second step, we propose Aida, a new optimisation method, with the objective that the elementwise stepsizes within each layer have significantly smaller statistical variances, and the layerwise average stepsizes are much more compact across all the layers. Motivated by the fact that (m_t-g_t)^2 in AdaBelief is conservative in comparison to g_t^2 in Adam in terms of layerwise statistical averages and variances, Aida is designed by tracking a more conservative function of m_t and g_t than (m_t-g_t)^2 via layerwise vector projections. Experimental results show that Aida produces either competitive or better performance with respect to a number of existing methods including Adam and AdaBelief for a set of challenging DNN tasks.
翻译:Adam 和 AdaBelief 计算并使用元素性适应步骤来培训深神经网络(DNNS) 。 我们用两个步骤来解决上述研究问题。 首先, 我们略微修改 Adam 和 AdaBelief 的指数移动平均值(EMA), 通过预处理或后处理在其更新程序中引入分层性调整步骤(m_t-g_t) 。 我们的实验结果表明, m_t 是VGG和ResNet模型在 CIFAR10 和 CIFAR100 上的第一个动性能, 表明通过 亚达和 AdaBelief 的分层性梯度统计对于 DNNN 的成功起着重要作用。 在第二个步骤中, 我们略微修改 Adam 和 AdaBelief, 通过预处理前或后处理后处理, 在更新程序中引入分层性能的分级性能, 显示VGGGG和ResNet模型在 CIFAR10 和 CFAR100 上的可比较性能性能。 。 我们的分层性梯度性梯级统计性统计性统计性统计性统计性统计性数据中, 通过所有分级的分级的分级的分级化方法显示一个比 Ad_ drociental_ droaltialal_ drodealdeal_ disma 。