This paper underlines a subtle property of batch-normalization (BN): Successive batch normalizations with random linear transformations make hidden representations increasingly orthogonal across layers of a deep neural network. We establish a non-asymptotic characterization of the interplay between depth, width, and the orthogonality of deep representations. More precisely, under a mild assumption, we prove that the deviation of the representations from orthogonality rapidly decays with depth up to a term inversely proportional to the network width. This result has two main implications: 1) Theoretically, as the depth grows, the distribution of the representation -- after the linear layers -- contracts to a Wasserstein-2 ball around an isotropic Gaussian distribution. Furthermore, the radius of this Wasserstein ball shrinks with the width of the network. 2) In practice, the orthogonality of the representations directly influences the performance of stochastic gradient descent (SGD). When representations are initially aligned, we observe SGD wastes many iterations to orthogonalize representations before the classification. Nevertheless, we experimentally show that starting optimization from orthogonal representations is sufficient to accelerate SGD, with no need for BN.
翻译:本文强调了批次正常化的微妙属性(BN):连续批次正常化,其随机线性变异使深神经网络各层的隐性表达式日益正统。我们对深神经网络的深度、宽度和纵深表达体的正向性之间的相互作用建立了一种非抽象特征。更确切地说,根据一种温和的假设,我们证明,从正反正性表达式的偏差会随着深度迅速变异到与网络宽度的反比而迅速变异。这一结果具有两个主要影响:(1) 理论上,随着深度的扩大,表示式的分布 -- -- 在线性层之后 -- -- 与瓦塞斯坦-2球的分布在异地高斯分布区周围。此外,这个瓦塞斯坦球的半径会随着网络的宽度而收缩。(2) 在实践上,表示式的偏差会直接影响到心性梯度梯度下(SGD)的性能。在最初进行对比时,我们观察到SGD会把许多迭代数浪费到分类之前的或多位化的表示式。然而,我们实验性地显示开始从BGD或OGD型显示,不需要加速进行最优化。