Stochastic gradient descent (SGD) is widely used in deep learning due to its computational efficiency, but a complete understanding of why SGD performs so well remains a major challenge. It has been observed empirically that most eigenvalues of the Hessian of the loss functions on the loss landscape of over-parametrized deep neural networks are close to zero, while only a small number of eigenvalues are large. Zero eigenvalues indicate zero diffusion along the corresponding directions. This indicates that the process of minima selection mainly happens in the relatively low-dimensional subspace corresponding to the top eigenvalues of the Hessian. Although the parameter space is very high-dimensional, these findings seems to indicate that the SGD dynamics may mainly live on a low-dimensional manifold. In this paper, we pursue a truly data driven approach to the problem of getting a potentially deeper understanding of the high-dimensional parameter surface, and in particular, of the landscape traced out by SGD by analyzing the data generated through SGD, or any other optimizer for that matter, in order to possibly discover (local) low-dimensional representations of the optimization landscape. As our vehicle for the exploration, we use diffusion maps introduced by R. Coifman and coauthors.
翻译:由于计算效率的提高,在深层学习中广泛使用石化梯度下落(SGD),但完全理解SGD为何表现如此出色仍然是一个重大挑战。从经验上看,Hessian公司在超分化的深神经网络损失地貌上损失功能的大部分损耗值接近于零,而只有少量的精度值是巨大的。Zero eigen值表明在相应的方向上零扩散。这表明微型选择过程主要发生在相对相对低维的与Hessian公司最高电子价值相对应的亚空间中。尽管参数空间非常高,但这些结果似乎表明SGD动态可能主要生活在一个低维的方块上。在本文中,我们用真正数据驱动的方法解决了对高维参数表面,特别是SGD通过分析通过SGD生成的数据或其他任何优化器生成的景观,从而对SGD生成的数据进行分析,从而有可能通过我们探索(地方)低维面图像,通过我们的飞行器进行我们探索,从而探索(我们)使用低维图像。