In training neural networks, batch normalization has many benefits, not all of them entirely understood. But it also has some drawbacks. Foremost is arguably memory consumption, as computing the batch statistics requires all instances within the batch to be processed simultaneously, whereas without batch normalization it would be possible to process them one by one while accumulating the weight gradients. Another drawback is that that distribution parameters (mean and standard deviation) are unlike all other model parameters in that they are not trained using gradient descent but require special treatment, complicating implementation. In this paper, I show a simple and straightforward way to address these issues. The idea, in short, is to add terms to the loss that, for each activation, cause the minimization of the negative log likelihood of a Gaussian distribution that is used to normalize the activation. Among other benefits, this will hopefully contribute to the democratization of AI research by means of lowering the hardware requirements for training larger models.
翻译:在培训神经网络中,批量正常化有许多好处,不是全部都能够理解。但也有某些缺点。最突出的是记忆消耗,因为计算批量统计要求批量内的所有情况同时处理,而没有批量正常化,就有可能在累积重量梯度的同时逐一处理;另一个缺点是,分配参数(平均和标准偏差)与所有其他模型参数不同,因为它们没有使用梯度下降进行训练,但需要特殊待遇,执行起来也比较复杂。在本文中,我展示了解决这些问题的简单和直截了当的方法。简而言之,这个想法是给损失增加一些术语,即每次激活都会导致将Gossian分配的负日志可能性最小化,用于使激活正常化。除其他好处外,这有望通过降低培训较大模型的硬件要求,促进AI研究的民主化。