From the sampling of data to the initialisation of parameters, randomness is ubiquitous in modern Machine Learning practice. Understanding the statistical fluctuations engendered by the different sources of randomness in prediction is therefore key to understanding robust generalisation. In this manuscript we develop a quantitative and rigorous theory for the study of fluctuations in an ensemble of generalised linear models trained on different, but correlated, features in high-dimensions. In particular, we provide a complete description of the asymptotic joint distribution of the empirical risk minimiser for generic convex loss and regularisation in the high-dimensional limit. Our result encompasses a rich set of classification and regression tasks, such as the lazy regime of overparametrised neural networks, or equivalently the random features approximation of kernels. While allowing to study directly the mitigating effect of ensembling (or bagging) on the bias-variance decomposition of the test error, our analysis also helps disentangle the contribution of statistical fluctuations, and the singular role played by the interpolation threshold that are at the roots of the "double-descent" phenomenon.
翻译:从数据抽样到参数初始化,随机性在现代机器学习实践中是无处不在的。了解预测中各种随机性来源造成的统计波动,因此是理解稳健的概括性的关键。在这个手稿中,我们为研究一组关于不同但相互关联的高二分位特征的通用线性模型的混合体的波动研究,制定了一个定量和严格的理论。特别是,我们完整地描述了实验风险的无症状联合分布,在高维限度内,一般共流损失和常规化的最小值最小值。我们的结果包括了一套丰富的分类和回归任务,例如过度平衡的神经网络的懒惰制度,或等同的内核的随机特征近似。在允许直接研究聚合(或包扎)对试验错误的偏差分解作用的减轻作用的同时,我们的分析还有助于消除统计波动和位于“双层”现象根源的内层临界值所起的单项作用。