The selection of initial parameter values for gradient-based optimization of deep neural networks is one of the most impactful hyperparameter choices in deep learning systems, affecting both convergence times and model performance. Yet despite significant empirical and theoretical analysis, relatively little has been proved about the concrete effects of different initialization schemes. In this work, we analyze the effect of initialization in deep linear networks, and provide for the first time a rigorous proof that drawing the initial weights from the orthogonal group speeds up convergence relative to the standard Gaussian initialization with iid weights. We show that for deep networks, the width needed for efficient convergence to a global minimum with orthogonal initializations is independent of the depth, whereas the width needed for efficient convergence with Gaussian initializations scales linearly in the depth. Our results demonstrate how the benefits of a good initialization can persist throughout learning, suggesting an explanation for the recent empirical successes found by initializing very deep non-linear networks according to the principle of dynamical isometry.
翻译:深神经网络基于梯度优化的初始参数值选择是深层学习系统中影响最大的超参数选择之一,既影响趋同时间,也影响模型性性能。然而,尽管进行了大量的经验和理论分析,但关于不同初始化计划的具体效果的证明相对较少。在这项工作中,我们分析了深线网络初始化的效果,并首次提供了严格的证据,证明从正向组合中提取初始权重的速度加快了与标准高斯初始化和iid重量相对的趋同速度。我们表明,对于深层网络而言,与正向初始化有效融合到全球最低程度所需的宽度是独立于深度的,而与高斯初始化线性尺度在深度上有效融合所需的宽度则是线性。我们的结果表明,良好的初始化的好处如何在整个学习过程中得以持续,为根据动态等量原则初始化非常深的非线性网络所发现的最新经验成功提供了解释。