SGD结构特征的 SGD 学习曲线 (Learning Curves for SGD on Structured Features)

from arxiv, Added new analysis of optimal batchsize and learning rate. Provided theoretical learning curves for case where test and training measures are different and apply to predicting errors for test/train splits on real datasets. Also provided a new bound for non-Gaussian features based on a regularity condition proposed by Varre et al 2021 arXiv:2102.03183

The generalization performance of a machine learning algorithm such as a neural network depends in a non-trivial way on the structure of the data distribution. To analyze the influence of data structure on test loss dynamics, we study an exactly solveable model of stochastic gradient descent (SGD) on mean square loss which predicts test loss when training on features with arbitrary covariance structure. We solve the theory exactly for both Gaussian features and arbitrary features and we show that the simpler Gaussian model accurately predicts test loss of nonlinear random-feature models and deep neural networks trained with SGD on real datasets such as MNIST and CIFAR-10. We show that the optimal batch size at a fixed compute budget is typically small and depends on the feature correlation structure, demonstrating the computational benefits of SGD with small batch sizes. Lastly, we extend our theory to the more usual setting of stochastic gradient descent on a fixed subsampled training set, showing that both training and test error can be accurately predicted in our framework on real data.

翻译：神经网络等机器学习算法的普遍性能取决于数据分布结构的非边际方式。为了分析数据结构对测试损失动态的影响,我们研究了一种完全可以解决的关于平均平方损失的随机梯度梯度下降模型(SGD),该模型预测在培训具有任意共变结构的特征时会测试损失。我们准确地为高萨特征和任意特征解决了理论问题,并且我们表明,较简单的高斯模型准确地预测了非线性随机性能模型和由SGD培训的关于诸如MNIST和CIFAR-10等真实数据集的深线性神经网络的测试损失。我们表明,固定计算预算的最佳批量规模一般为小,取决于特征相关结构,展示了SGD小批量的计算效益。最后,我们将我们的理论扩大到固定的子抽样培训集中更常见的随机梯度梯度梯度下降设置,表明在实际数据框架中可以准确预测培训和测试错误。