The best performing Binary Neural Networks (BNNs) are usually attained using Adam optimization and its multi-step training variants. However, to the best of our knowledge, few studies explore the fundamental reasons why Adam is superior to other optimizers like SGD for BNN optimization or provide analytical explanations that support specific training strategies. To address this, in this paper we first investigate the trajectories of gradients and weights in BNNs during the training process. We show the regularization effect of second-order momentum in Adam is crucial to revitalize the weights that are dead due to the activation saturation in BNNs. We find that Adam, through its adaptive learning rate strategy, is better equipped to handle the rugged loss surface of BNNs and reaches a better optimum with higher generalization ability. Furthermore, we inspect the intriguing role of the real-valued weights in binary networks, and reveal the effect of weight decay on the stability and sluggishness of BNN optimization. Through extensive experiments and analysis, we derive a simple training scheme, building on existing Adam-based optimization, which achieves 70.5% top-1 accuracy on the ImageNet dataset using the same architecture as the state-of-the-art ReActNet while achieving 1.1% higher accuracy. Code and models are available at https://github.com/liuzechun/AdamBNN.
翻译:最出色的二进制神经网络(BNNs)通常是利用亚当优化及其多阶段培训变式来实现的。然而,据我们所知,很少有研究探讨亚当为什么优于SGD等其他优化者,如SGD为BNN优化而优于其他优化者,或提供有助于具体培训战略的分析解释。为此,我们首先在本文中调查BNNs在培训过程中的梯度和重量轨迹。我们展示了亚当二进制动力的正规化效应对于重振因启动BNS饱和而死重数至关重要。我们发现,亚当通过其适应性学习率战略,更有能力处理BNNS的粗略损失表面,并且以更高的一般化能力达到更好的最佳优化。此外,我们检查了二进制网络中实际价值重量的诱人作用,并揭示了体重衰减对BNNW优化的稳定性和疲软性的影响。我们通过广泛的实验和分析,在现有的亚当制优化基础上制定了一个简单的培训计划,它实现了70.5%的顶级学习率战略。在图像网络/Rab-lam/reqirecom上实现了70.5%的最高精确度,同时在图像网络/Rab-lab/reab/lab/lax