A multiplicative constant scaling factor is often applied to the model output to adjust the dynamics of neural network parameters. This has been used as one of the key interventions in an empirical study of lazy and active behavior. However, we show that the combination of such scaling and a commonly used adaptive learning rate optimizer strongly affects the training behavior of the neural network. This is problematic as it can cause \emph{unintended behavior} of neural networks, resulting in the misinterpretation of experimental results. Specifically, for some scaling settings, the effect of the adaptive learning rate disappears or is strongly influenced by the scaling factor. To avoid the unintended effect, we present a modification of an optimization algorithm and demonstrate remarkable differences between adaptive learning rate optimization and simple gradient descent, especially with a small ($<1.0$) scaling factor.
翻译:倍增效应的常量缩放因子通常用于模型输出以调整神经网络参数的动态。 这已被用作对懒惰和主动行为进行实验性研究的关键干预措施之一。 然而,我们表明,这种缩放和常用的适应性学习率优化相结合,对神经网络的培训行为产生了强烈的影响。 这有问题,因为它可能导致神经网络的“emph{unitive asseration}”,导致对实验结果的错误解读。具体地说,对于某些缩放环境,适应性学习率的影响消失或受到缩放因素的强烈影响。为了避免意外影响,我们提出了优化算法的修改,并显示了适应性学习率优化和简单梯度下降之间的显著差异,特别是一个小的 < 1.0 美元) 缩放因子。