We propose a new method for high-dimensional semi-supervised learning problems based on the careful aggregation of the results of a low-dimensional procedure applied to many axis-aligned random projections of the data. Our primary goal is to identify important variables for distinguishing between the classes; existing low-dimensional methods can then be applied for final class assignment. Motivated by a generalized Rayleigh quotient, we score projections according to the traces of the estimated whitened between-class covariance matrices on the projected data. This enables us to assign an importance weight to each variable for a given projection, and to select our signal variables by aggregating these weights over high-scoring projections. Our theory shows that the resulting Sharp-SSL algorithm is able to recover the signal coordinates with high probability when we aggregate over sufficiently many random projections and when the base procedure estimates the whitened between-class covariance matrix sufficiently well. The Gaussian EM algorithm is a natural choice as a base procedure, and we provide a new analysis of its performance in semi-supervised settings that controls the parameter estimation error in terms of the proportion of labeled data in the sample. Numerical results on both simulated data and a real colon tumor dataset support the excellent empirical performance of the method.
翻译:我们提出了一种基于数据的许多轴对齐随机投影的低维过程的仔细聚合方法,用于高维半监督学习问题。我们的主要目标是确定区分类别的重要变量;随后可以应用现有的低维方法进行最终的分类。我们以广义瑞利商为基础,根据投影数据上估计的白化类间协方差矩阵对其进行打分。这使我们能够为给定投影中的每个变量分配重要性权重,并通过聚合这些权重来选择我们的信号变量。我们的理论表明,当我们在足够多的随机投影上聚合,并且当基础过程足够好地估计白化的类间协方差矩阵时,得到的Sharp-SSL算法有很高的概率能够恢复信号坐标。高斯EM算法是作为基础过程的自然选择,并提供其在半监督环境下性能的新分析,以控制样本中标记数据的比例中的参数估计误差。在模拟数据和真实结肠肿瘤数据集上的数值结果支持该方法的出色实证表现。