We study the supervised clustering problem under the two-component anisotropic Gaussian mixture model in high dimensions and in the non-asymptotic setting. We first derive a lower and a matching upper bound for the minimax risk of clustering in this framework. We also show that in the high-dimensional regime, the linear discriminant analysis (LDA) classifier turns out to be sub-optimal in the minimax sense. Next, we characterize precisely the risk of $\ell_2$-regularized supervised least squares classifiers. We deduce the fact that the interpolating solution may outperform the regularized classifier, under mild assumptions on the covariance structure of the noise. Our analysis also shows that interpolation can be robust to corruption in the covariance of the noise when the signal is aligned with the "clean" part of the covariance, for the properly defined notion of alignment. To the best of our knowledge, this peculiar phenomenon has not yet been investigated in the rapidly growing literature related to interpolation. We conclude that interpolation is not only benign but can also be optimal, and in some cases robust.
翻译:我们用高维和非无症状环境来研究两个成分的厌食高斯混合模型下的受监督组群问题。 我们首先从这个框架的微型聚群风险中得出一个较低和相匹配的上限。 我们还表明,在高维系统中,线性分裂分析(LDA)分类器在小型最大意义上是次最佳的。 其次,我们准确地描述美元/ ell_ 2美元被常规化的受监督最低方位分类器的风险。 我们推断出,在对噪音的共变结构的温和假设下,内插法可能优于常规分类器。 我们的分析还表明,在信号与“清洁”的共变异部分相一致时,内插法对于噪音的共变可能非常有力。 根据我们的最佳了解,在与内插有关的迅速增长的文献中,这一特殊现象还没有被调查过。 我们的结论是,内插法不仅良性,而且在某些案例中,内插法是最佳和稳健的。