We develop information-geometric techniques to analyze the trajectories of the predictions of deep networks during training. By examining the underlying high-dimensional probabilistic models, we reveal that the training process explores an effectively low-dimensional manifold. Networks with a wide range of architectures, sizes, trained using different optimization methods, regularization techniques, data augmentation techniques, and weight initializations lie on the same manifold in the prediction space. We study the details of this manifold to find that networks with different architectures follow distinguishable trajectories but other factors have a minimal influence; larger networks train along a similar manifold as that of smaller networks, just faster; and networks initialized at very different parts of the prediction space converge to the solution along a similar manifold.
翻译:我们开发了信息几何技术,分析了深层网络在训练过程中预测轨迹。通过检查底层高维概率模型,我们发现训练过程探索了一个实际上是低维流形的空间。不同架构、规模、使用不同优化方法、正则化技术、数据增强技术和权重初始化的网络在预测空间上位于同一流形上。我们研究了这种流形的细节,发现不同结构的网络遵循可区分的轨迹,但其他因素影响很小;更大的网络沿着与更小网络相似的流形训练,只是速度更快;在预测空间的非常不同的部分初始化的网络沿着类似的流形收敛到解决方案。