The ability to train randomly initialised deep neural networks is known to depend strongly on the variance of the weight matrices and biases as well as the choice of nonlinear activation. Here we complement the existing geometric analysis of this phenomenon with an information theoretic alternative. Lower bounds are derived for the mutual information between an input and hidden layer outputs. Using a mean field analysis we are able to provide analytic lower bounds as functions of network weight and bias variances as well as the choice of nonlinear activation. These results show that initialisations known to be optimal from a training point of view are also superior from a mutual information perspective.
翻译:据知,培训随机初始深神经网络的能力在很大程度上取决于重量矩阵和偏差的差异以及非线性激活的选择。这里我们用信息理论替代方法补充目前对这一现象的几何分析。输入和隐藏层输出之间的相互信息取自较低界限。我们利用一种中性的实地分析,能够提供分析下界线,作为网络重量和偏差的函数,以及非线性激活的选择。这些结果显示,从培训角度已知最佳的初始化从相互信息角度来说也更优越。