Deep neural networks are notorious for defying theoretical treatment. However, when the number of parameters in each layer tends to infinity the network function is a Gaussian process (GP) and quantitatively predictive description is possible. Gaussian approximation allows to formulate criteria for selecting hyperparameters, such as variances of weights and biases, as well as the learning rate. These criteria rely on the notion of criticality defined for deep neural networks. In this work we describe a new way to diagnose (both theoretically and empirically) this criticality. To that end, we introduce partial Jacobians of a network, defined as derivatives of preactivations in layer $l$ with respect to preactivations in layer $l_0<l$. These quantities are particularly useful when the network architecture involves many different layers. We discuss various properties of the partial Jacobians such as their scaling with depth and relation to the neural tangent kernel (NTK). We derive the recurrence relations for the partial Jacobians and utilize them to analyze criticality of deep MLP networks with (and without) LayerNorm. We find that the normalization layer changes the optimal values of hyperparameters and critical exponents. We argue that LayerNorm is more stable when applied to preactivations, rather than activations due to larger correlation depth.
翻译:深心神经网络以无视理论处理而臭名昭著。 但是, 当每层参数的数量倾向于无限时, 网络功能就是一个高斯进程( GP), 数量预测描述是可能的。 高斯近似允许为选择超参数制定标准, 如重量和偏差的差异, 以及学习率。 这些标准依赖于深神经网络界定的关键度概念。 在这项工作中, 我们描述一种新的诊断方法( 在理论和实验两方面) 。 为此, 我们引入网络的部分 Jacobian, 网络功能被定义为层前活动前活动的衍生物 $l$l( $0 < l$) 。 当网络结构涉及许多不同层时, 这些数量特别有用。 我们讨论部分雅各人的各种特性, 例如它们与深心的神经内核网络( NTK) 的大小和关系。 我们为部分雅各人描述一种复现关系, 并利用它们来分析深度MLP网络的临界度。 与( 没有) 层Norm 的( $l) 。 我们发现, 当网络的精度深度比标准化更精确的深度时, 我们的层平整的比前的平整度要更精确的平整。