While classical in many theoretical settings - and in particular in statistical physics-inspired works - the assumption of Gaussian i.i.d. input data is often perceived as a strong limitation in the context of statistics and machine learning. In this study, we redeem this line of work in the case of generalized linear classification, a.k.a. the perceptron model, with random labels. We argue that there is a large universality class of high-dimensional input data for which we obtain the same minimum training loss as for Gaussian data with corresponding data covariance. In the limit of vanishing regularization, we further demonstrate that the training loss is independent of the data covariance. On the theoretical side, we prove this universality for an arbitrary mixture of homogeneous Gaussian clouds. Empirically, we show that the universality holds also for a broad range of real datasets.
翻译:在许多理论环境中,特别是在统计物理学启发的作品中,虽然在许多理论环境中,特别是在统计物理启发的作品中,典型地假设Gaussian i.d. 输入数据往往被视为统计和机器学习方面的一个重大限制。在本研究中,我们用随机标签来弥补了这种普遍的线性分类(a.k.a.c.perceptron model)的工作。我们争辩说,存在着一个庞大的普遍性的高维输入数据类别,我们获得与高斯的数据相同的最低培训损失,并具有相应的数据变量。在取消正规化的限度内,我们进一步证明培训损失与数据变量无关。在理论方面,我们证明这种普遍性是同质高斯云的任意混合。我们生动地表明,这种普遍性也适用于广泛的真实数据集。</s>