With rapid advances in information technology, massive datasets are collected in all fields of science, such as biology, chemistry, and social science. Useful or meaningful information is extracted from these data often through statistical learning or model fitting. In massive datasets, both sample size and number of predictors can be large, in which case conventional methods face computational challenges. Recently, an innovative and effective sampling scheme based on leverage scores via singular value decompositions has been proposed to select rows of a design matrix as a surrogate of the full data in linear regression. Analogously, variable screening can be viewed as selecting rows of the design matrix. However, effective variable selection along this line of thinking remains elusive. In this article, we bridge this gap to propose a weighted leverage variable screening method by utilizing both the left and right singular vectors of the design matrix. We show theoretically and empirically that the predictors selected using our method can consistently include true predictors not only for linear models but also for complicated general index models. Extensive simulation studies show that the weighted leverage screening method is highly computationally efficient and effective. We also demonstrate its success in identifying carcinoma related genes using spatial transcriptome data.
翻译:随着信息技术的迅猛发展,在生物学、化学和社会科学等科学的所有领域都收集了大量的数据集。从这些数据中往往通过统计学习或模型安装从这些数据中提取有用或有意义的信息。在大规模数据集中,样本大小和预测数都可能很大,在这种情况下,常规方法将面临计算方面的挑战。最近,根据通过单值分解法得出的杠杆分数,提出了一个创新和有效的抽样计划,选择设计矩阵的行作为线性回归全面数据的替代数据。模拟研究表明,可变筛选可被视为选择设计矩阵的行。然而,沿这一思维线的有效变量选择仍然难以找到。在本篇文章中,我们通过使用设计矩阵的左向和右向单向矢量,弥合这一差距,以提出加权杠杆变量筛选方法。我们从理论上和从经验上表明,使用我们的方法选择的预测器可以始终包括真实的预测器,不仅用于线性模型,而且用于复杂的一般指数模型。广泛的模拟研究表明,加权杠杆筛选方法在计算上非常有效率和有效。我们还表明,在利用空间记录数据确定与癌有关的基因方面,我们成功地证明了它是否成功。