Existing analyses of optimization in deep learning are either continuous, focusing on (variants of) gradient flow, or discrete, directly treating (variants of) gradient descent. Gradient flow is amenable to theoretical analysis, but is stylized and disregards computational efficiency. The extent to which it represents gradient descent is an open question in deep learning theory. The current paper studies this question. Viewing gradient descent as an approximate numerical solution to the initial value problem of gradient flow, we find that the degree of approximation depends on the curvature along the latter's trajectory. We then show that over deep neural networks with homogeneous activations, gradient flow trajectories enjoy favorable curvature, suggesting they are well approximated by gradient descent. This finding allows us to translate an analysis of gradient flow over deep linear neural networks into a guarantee that gradient descent efficiently converges to global minimum almost surely under random initialization. Experiments suggest that over simple deep neural networks, gradient descent with conventional step size is indeed close to the continuous limit. We hypothesize that the theory of gradient flows will be central to unraveling mysteries behind deep learning.
翻译:深层学习的现有优化分析要么持续进行, 侧重于( 渐变) 梯度流, 要么离散, 直接处理( 渐变) 梯度下降。 渐变流可以进行理论分析, 但却是星系化的, 并且无视计算效率 。 它代表梯度下降的程度在深层学习理论中是一个尚未解决的问题 。 目前的论文研究这一问题 。 将梯度下降视为梯度流初始值问题的大致数字解决方案, 我们发现, 近似程度取决于后者轨迹的曲折。 我们然后显示, 超深神经网络, 具有同质激活功能、 梯度流轨轨迹的偏差, 享受有利的曲度。 这表明, 梯度流的理论非常接近于梯度下降, 以深层学习的奥秘性为核心。