In typical neural network training, the gradients in the backward pass is determined by the forward pass. As a result, the two stages are coupled. However, it is often seen that neural networks perform worse when gradients explode or decline. To address this, numerous approaches like Gradient Clipping (GC) and Adaptive Gradient Clipping (AGC) have been developed to enhance the gradient behaviour of networks without normalization layers during backward passes. These techniques decouple the backward and forward passes and modify the gradients adaptively. A possible drawback of clipping approaches is that they must be calculated for each weight tensor in each layer. We offer the PowerGrad Transform (PGT), a comparable approach that alters and enhances the gradient flow behaviour in the backward pass but is calculated only in the final softmax layer. It is very computationally efficient and outperforms both GC and AGC, resulting in improved performance in networks without batch normalization. PGT is easy to integrate into existing networks, requiring just a few lines of code, and significantly increases performance in non-BN ResNets. The impact is more pronounced on big datasets like as ImageNet, when networks do not fit all of the training data and there is some training headroom. PGT makes it possible for the network to better fit the training data while simultaneously improving its performance on the test set.
翻译:在典型的神经网络培训中,后方通道的梯度由前方通道决定。 因此,后方通道的梯度是由前方通道决定的。 两个阶段相互交错。 但是,通常可以看到,当梯度爆炸或下降时,神经网络的性能更差。 要解决这个问题,已经开发了许多方法,如Gradient Clipping(GC)和适应性梯度缩压(AGC),以加强网络的梯度行为,而后方通道则不正常化。这些技术使后方和前方通道脱钩,并适应性能地修改梯度。 剪接方法的一个可能的缺点是,必须计算每层的重量拉高。我们向PowerGrad变换(PGT)提供一种可比较的方法,即改变和增强后方通道的梯度流行为,但只在最后的软体层中计算出。它非常高效且超越GC和AGC(AGC)的梯度,从而改进网络的性能。PGT很容易融入现有的网络,只需要几行代码,并大大提高非BN ResNet的性能。我们更明显地评价了大GT网络。 当它能够进行数据测试时,而使数据库更适合它适应于它。它的所有数据网络。 当它适应于它的时候,它在改进了所有的测试网络。