A critically important, ubiquitous, and yet poorly understood ingredient in modern deep networks (DNs) is batch normalization (BN), which centers and normalizes the feature maps. To date, only limited progress has been made understanding why BN boosts DN learning and inference performance; work has focused exclusively on showing that BN smooths a DN's loss landscape. In this paper, we study BN theoretically from the perspective of function approximation; we exploit the fact that most of today's state-of-the-art DNs are continuous piecewise affine (CPA) splines that fit a predictor to the training data via affine mappings defined over a partition of the input space (the so-called "linear regions"). {\em We demonstrate that BN is an unsupervised learning technique that -- independent of the DN's weights or gradient-based learning -- adapts the geometry of a DN's spline partition to match the data.} BN provides a "smart initialization" that boosts the performance of DN learning, because it adapts even a DN initialized with random weights to align its spline partition with the data. We also show that the variation of BN statistics between mini-batches introduces a dropout-like random perturbation to the partition boundaries and hence the decision boundary for classification problems. This per mini-batch perturbation reduces overfitting and improves generalization by increasing the margin between the training samples and the decision boundary.
翻译:在现代深度网络(DNs)中,一个至关重要、无处不在且不为人知的临界值是批量正常化(BN),它集中和使地貌图正常化。迄今为止,只有有限的进展才理解为什么BN促进DN学习和推断性能;工作的重点完全在于显示BN平滑了DN损失的景观。在本文中,我们从功能近似的角度从理论上研究BN;我们利用以下事实,即当今最先进的DN(DN)是连续的折叠式(CPA)样板条,它通过为输入空间分区(所谓的“线性区域”)定义的松动绘图,与培训数据匹配预测数据。我们证明,BN是一种不受DN重量或梯度研究影响的学习方法,它从DN的权重角度调整了DN的测算法,以匹配数据。}BN提供一种“精准初始化的初始化的初始化模型”,它能提高DN学习的性能,因为它使DN的深度比值的深度比值与B的界限之间的比值更加精确的比重,因为它也显示一个随机的比值。