The training of neural networks is a complex, high-dimensional, non-convex and noisy optimization problem whose theoretical understanding is interesting both from an applicative perspective and for fundamental reasons. A core challenge is to understand the geometry and topography of the landscape that guides the optimization. In this work, we employ standard Statistical Mechanics methods, namely, phase-space exploration using Langevin dynamics, to study this landscape for an over-parameterized fully connected network performing a classification task on random data. Analyzing the fluctuation statistics, in analogy to thermal dynamics at a constant temperature, we infer a clear geometric description of the low-loss region. We find that it is a low-dimensional manifold whose dimension can be readily obtained from the fluctuations. Furthermore, this dimension is controlled by the number of data points that reside near the classification decision boundary. Importantly, we find that a quadratic approximation of the loss near the minimum is fundamentally inadequate due to the exponential nature of the decision boundary and the flatness of the low-loss region. This causes the dynamics to sample regions with higher curvature at higher temperatures, while producing quadratic-like statistics at any given temperature. We explain this behavior by a simplified loss model which is analytically tractable and reproduces the observed fluctuation statistics.
翻译:神经网络训练是一个复杂、高维、非凸和嘈杂的优化问题,其理论理解是应用角度和基本原因都非常有趣的。一个核心挑战是理解指导优化的景观的几何和地形。在这项工作中,我们采用标准的统计力学方法,即使用Langevin动力学进行相空间探索,来研究一个在随机数据上执行分类任务的过度参数化的全连接网络的景观。通过分析波动统计数据,类比于常温下的热力学,我们推断出了对低损失区域的清晰几何描述。我们发现,它是一个低维流形,其维度可以从波动中轻松获得。而且,这个维度受到处于分类决策边界附近的数据点数量的控制。重要的是,我们发现,由于决策边界的指数特性和低损失区域的平坦性,最小值附近的损失的二次近似是根本不充分的。这导致动态在更高温度下采样具有更高曲率的区域,同时在任何给定温度下产生类似二次的统计数据。我们通过一个简化的损失模型来解释这种行为,该模型在分析上是可行的,并且可以复制所观察到的波动统计数据。