MLPGradientFlow is a software package to solve numerically the gradient flow differential equation $\dot \theta = -\nabla \mathcal L(\theta; \mathcal D)$, where $\theta$ are the parameters of a multi-layer perceptron, $\mathcal D$ is some data set, and $\nabla \mathcal L$ is the gradient of a loss function. We show numerically that adaptive first- or higher-order integration methods based on Runge-Kutta schemes have better accuracy and convergence speed than gradient descent with the Adam optimizer. However, we find Newton's method and approximations like BFGS preferable to find fixed points (local and global minima of $\mathcal L$) efficiently and accurately. For small networks and data sets, gradients are usually computed faster than in pytorch and Hessian are computed at least $5\times$ faster. Additionally, the package features an integrator for a teacher-student setup with bias-free, two-layer networks trained with standard Gaussian input in the limit of infinite data. The code is accessible at https://github.com/jbrea/MLPGradientFlow.jl.
翻译:MLPGradidentFlow 是一个软件包,用于从数字上解析梯度流差方程式$\dt\theta = -\nabla\ mathcal L(theta;\mathcal D)$, 其中$\theta$是多层透视参数的参数, $\mathcal D$是某些数据集, $\nabla\mathcal L$ 是一个损失函数的梯度。 我们从数字上显示, 以 Runge- Kutta 方案为基础的适应性第一或更高级集成方法比与 Adam 优化者 的梯度下降的精度和趋同速度要好。 然而, 我们发现牛顿的方法和近似 BFGS 的近似方法更适合有效和准确地找到固定点( $\\ mathcal L$的本地和全球迷你卡) 。 对于小型网络和数据集来说, 梯度的计算速度通常比 Pytorch 和 Hesian 的计算速度要快 5\ times 。 此外, 包显示一个教师-tudstststents suplegradustrupt supdustrept supdustrutuppeutupt of with bel- bel- fl- flpal- dealdaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldddddddddaldddddal.