This article provides, through theoretical analysis, an in-depth understanding of the classification performance of the empirical risk minimization framework, in both ridge-regularized and unregularized cases, when high dimensional data are considered. Focusing on the fundamental problem of separating a two-class Gaussian mixture, the proposed analysis allows for a precise prediction of the classification error for a set of numerous data vectors $\mathbf{x} \in \mathbb R^p$ of sufficiently large dimension $p$. This precise error depends on the loss function, the number of training samples, and the statistics of the mixture data model. It is shown to hold beyond Gaussian distribution under some additional non-sparsity condition of the data statistics. Building upon this quantitative error analysis, we identify the simple square loss as the optimal choice for high dimensional classification in both ridge-regularized and unregularized cases, regardless of the number of training samples.
翻译:本文通过理论分析,深入了解在考虑高维数据时,在山脊正规化和非正规化情况下,实验风险最小化框架的分类性能,以高斯混合物分为两个等级这一根本问题为重点,建议的分析可以精确预测一组数量足够大维的众多数据矢量的分类错误$\mathbf{x}\in\mathbb R ⁇ p$ p$。这一精确的错误取决于损失功能、培训样品的数量以及混合物数据模型的统计。它显示,在高斯分布之外,数据统计还存在一些额外的非分类性条件。根据这种定量错误分析,我们确定简单的平方损失是高维分类的最佳选择,无论培训样品的数量如何。