The generalization of machine learning models has a complex dependence on the data, model and learning algorithm. We study train and test performance, as well as the generalization gap given by the mean of their difference over different data set samples to understand their ``typical" behavior. We derive an expression for the gap as a function of the covariance between the model parameter distribution and the train loss, and another expression for the average test performance, showing test generalization only depends on data-averaged parameter distribution and the data-averaged loss. We show that for a large class of model parameter distributions a modified generalization gap is always non-negative. By specializing further to parameter distributions produced by stochastic gradient descent (SGD), along with a few approximations and modeling considerations, we are able to predict some aspects about how the generalization gap and model train and test performance vary as a function of SGD noise. We evaluate these predictions empirically on the Cifar10 classification task based on a ResNet architecture.
翻译:机器学习模型的概括化对数据、模型和学习算法具有复杂的依赖性。 我们研究的是培训和测试性能,以及它们在不同数据集样本的平均值上的差别所给出的普遍化差距,以了解其“典型”行为。 我们从模型参数分布和火车损失之间的共差函数和平均测试性能的另一个表达中得出差距的表达方式,显示测试性一般化仅取决于数据平均参数分布和数据平均损失。 我们显示,对于一大批模型参数分布而言,经过修改的一般化差距总是非负差。 通过进一步专门研究由随机梯度梯度下降产生的参数分布,加上一些近似和建模考虑,我们能够预测一些方面,说明一般化差距和模型列车和测试性能如何因SGD噪音的函数而变化。 我们根据ResNet结构对Cifar10分类任务的经验性评估这些预测。