Batch normalization (BN) is a popular and ubiquitous method in deep learning that has been shown to decrease training time and improve generalization performance of neural networks. Despite its success, BN is not theoretically well understood. It is not suitable for use with very small mini-batch sizes or online learning. In this paper, we propose a new method called Batch Normalization Preconditioning (BNP). Instead of applying normalization explicitly through a batch normalization layer as is done in BN, BNP applies normalization by conditioning the parameter gradients directly during training. This is designed to improve the Hessian matrix of the loss function and hence convergence during training. One benefit is that BNP is not constrained on the mini-batch size and works in the online learning setting. Furthermore, its connection to BN provides theoretical insights on how BN improves training and how BN is applied to special architectures such as convolutional neural networks. For a theoretical foundation, we also present a novel Hessian condition number based convergence theory for a locally convex but not strong-convex loss, which is applicable to networks with a scale-invariant property.
翻译:批量正常化(BN)是深层学习中一种普遍流行的通用方法(BN),它表明可以减少培训时间,改善神经网络的通用性能。尽管它取得了成功,但BN在理论上并不容易理解。它不适合使用非常小的小批量尺寸或在线学习。在本文中,我们提出了一个名为BN的新方法,名为BNP。它不是像BN那样通过批量正常化(BNP),而是通过直接调整参数梯度来应用正常化。它的目的是改进损失函数的赫森矩阵,从而在培训期间实现趋同。一个好处是,BNNP不受小批量规模的限制,在网上学习环境中工作。此外,它与BN的连接提供了理论见解,说明BN如何改进培训,BN如何应用到特殊结构,例如革命神经网络。对于理论基础,我们还提出了一个基于新颖的赫西因条件号合并理论,它适用于具有规模属性的网络。