Deep neural networks are notorious for defying theoretical treatment. However, when the number of parameters in each layer tends to infinity the network function is a Gaussian process (GP) and quantitatively predictive description is possible. Gaussian approximation allows to formulate criteria for selecting hyperparameters, such as variances of weights and biases, as well as the learning rate. These criteria rely on the notion of criticality defined for deep neural networks. In this work we describe a new practical way to diagnose criticality. We introduce \emph{partial Jacobians} of a network, defined as derivatives of preactivations in layer $l$ with respect to preactivations in layer $l_0\leq l$. We derive recurrence relations for the norms of partial Jacobians and utilize these relations to analyze criticality of deep fully connected neural networks with LayerNorm and/or residual connections. We derive and implement a simple and cheap numerical test that allows to select optimal initialization for a broad class of deep neural networks. Using these tools we show quantitatively that proper stacking of the LayerNorm (applied to preactivations) and residual connections leads to an architecture that is critical for any initialization. Finally, we apply our methods to analyze the MLP-Mixer architecture and show that it is everywhere critical.
翻译:深神经网络以无视理论处理而臭名昭著。 但是, 当每个层的参数数量倾向于无限时, 网络功能就是一个高斯进程( GP), 数量预测描述是可能的。 高西亚近似可以制定选择超参数的标准, 如重量和偏差的差异, 以及学习率。 这些标准依赖于深神经网络界定的关键度概念。 我们在此工作中描述一种诊断临界度的新实用方法。 我们引入了一个网络的\ emph{ epart Jacobian} 。 我们使用这些工具, 定义为一个网络的预活动衍生物, 在层的预活动方面, 以 $l_ 0\leq l$ 。 我们为部分的教化规范建立复现关系, 并利用这些关系来分析与层Norm和/ 或残余连接的深度完全连接神经网络的临界度。 我们推出并实施一项简单和廉价的数字测试, 以便选择一个广泛的深神经网络的最佳初始化。 我们用这些工具来量化地显示 层 Norm 的正确堆积, 我们最终将分析一个关键结构 。