This paper considers binary classification of high-dimensional features under a postulated model with a low-dimensional latent Gaussian mixture structure and non-vanishing noise. A generalized least squares estimator is used to estimate the direction of the optimal separating hyperplane. The estimated hyperplane is shown to interpolate on the training data. While the direction vector can be consistently estimated as could be expected from recent results in linear regression, a naive plug-in estimate fails to consistently estimate the intercept. A simple correction, that requires an independent hold-out sample, renders the procedure minimax optimal in many scenarios. The interpolation property of the latter procedure can be retained, but surprisingly depends on the way the labels are encoded.
翻译:本文考虑在假设有低维潜在高斯混合结构和非零噪声的模型下进行高维特征的二元分类。使用广义最小二乘估计器来估计最佳分离超平面的方向。研究表明,估计的超平面在训练数据上插值。虽然方向向量可以从最近线性回归的结果中得到一致估计,但插补法的插补估计无法一致地估计截距。一个简单的校正方法,需要一个独立的保留样本,在许多情况下实现了最小化最大惩罚值。后一种方法可以保留插值特性,但令人惊讶的是,这取决于标签的编码方式。