The activation function deployed in a deep neural network has great influence on the performance of the network at initialisation, which in turn has implications for training. In this paper we study how to avoid two problems at initialisation identified in prior works: rapid convergence of pairwise input correlations, and vanishing and exploding gradients. We prove that both these problems can be avoided by choosing an activation function possessing a sufficiently large linear region around the origin, relative to the bias variance $\sigma_b^2$ of the network's random initialisation. We demonstrate empirically that using such activation functions leads to tangible benefits in practice, both in terms test and training accuracy as well as training time. Furthermore, we observe that the shape of the nonlinear activation outside the linear region appears to have a relatively limited impact on training. Finally, our results also allow us to train networks in a new hyperparameter regime, with a much larger bias variance than has previously been possible.
翻译:在深层神经网络中部署的激活功能对网络初始化时的性能有很大影响,而这反过来又对培训产生影响。在本文中,我们研究如何避免在初始化过程中发现的两个问题:双向输入相关性的快速趋同,以及渐变和爆炸梯度。我们证明,这两个问题都可以通过选择一个在源头周围拥有足够大线性区域的激活功能来避免,相对于网络随机初始化的偏差差异$\sigma_b ⁇ 2美元。我们从经验上证明,使用这种激活功能在测试和培训准确性以及培训时间两方面在实践中都会带来实际好处。此外,我们观察到线性区域外非线性激活的形状对培训的影响似乎相对有限。最后,我们的结果还使我们能够在一个新的超参数系统中对网络进行培训,其偏差比以前大得多。