While classical in many theoretical settings, the assumption of Gaussian i.i.d. inputs is often perceived as a strong limitation in the analysis of high-dimensional learning. In this study, we redeem this line of work in the case of generalized linear classification with random labels. Our main contribution is a rigorous proof that data coming from a range of generative models in high-dimensions have the same minimum training loss as Gaussian data with corresponding data covariance. In particular, our theorem covers data created by an arbitrary mixture of homogeneous Gaussian clouds, as well as multi-modal generative neural networks. In the limit of vanishing regularization, we further demonstrate that the training loss is independent of the data covariance. Finally, we show that this universality property is observed in practice with real datasets and random labels.
翻译:在许多理论环境中,Gaussian i. d. i. i. d. 投入的假设虽然在许多理论环境中是古典的,但通常被视为对高维学习分析的强烈限制。在本研究中,我们用随机标签来弥补一般线性分类的这种工作。我们的主要贡献是严谨证明,来自高层次一系列基因模型的数据与高层次数据具有相同的最低培训损失程度,并附有相应的数据共变性。特别是,我们的理论包括由单质高斯云以及多模式型神经网络任意混合产生的数据。在正常化的限度内,我们进一步证明培训损失独立于数据变异性。最后,我们表明,这种普遍性属性在实践中与真实的数据集和随机标签一样。