This work is devoted to the finite sample prediction risk analysis of a class of linear predictors of a response $Y\in \mathbb{R}$ from a high-dimensional random vector $X\in \mathbb{R}^p$ when $(X,Y)$ follows a latent factor regression model generated by a unobservable latent vector $Z$ of dimension less than $p$. Our primary contribution is in establishing finite sample risk bounds for prediction with the ubiquitous Principal Component Regression (PCR) method, under the factor regression model, with the number of principal components adaptively selected from the data -- a form of theoretical guarantee that is surprisingly lacking from the PCR literature. To accomplish this, we prove a master theorem that establishes a risk bound for a large class of predictors, including the PCR predictor as a special case. This approach has the benefit of providing a unified framework for the analysis of a wide range of linear prediction methods, under the factor regression setting. In particular, we use our main theorem to recover known risk bounds for the minimum-norm interpolating predictor, which has received renewed attention in the past two years, and a prediction method tailored to a subclass of factor regression models with identifiable parameters. This model-tailored method can be interpreted as prediction via clusters with latent centers. To address the problem of selecting among a set of candidate predictors, we analyze a simple model selection procedure based on data-splitting, providing an oracle inequality under the factor model to prove that the performance of the selected predictor is close to the optimal candidate. We conclude with a detailed simulation study to support and complement our theoretical results.
翻译:这项工作致力于对一个反应型线性预测器的一组线性预测器进行有限的抽样预测风险分析,该预测器来自一个高维随机矢量 $Xxxxxxx{R ⁇ p$,如果(X,Y)$遵循一个潜在因素回归模型,该模型由一个无法观测的潜在矢量产生,其尺寸小于美元。我们的主要贡献是在系数回归模型下,用一个无处不在的主要组成部分回归(PCR)方法为预测建立有限的抽样风险框架,其中主要组成部分从数据中随机选择 -- -- 一种令人惊讶地缺乏的理论保证形式。为了实现这一点,我们证明一个主理论模型,它为大量预测器,包括PCRR预测器,其尺寸小于美元。这个方法的好处是提供一个统一的框架,在系数回归模型中,我们使用主理论来恢复已知的模型最接近的风险框,从数据中选择了数据,在PCRCR:通过精确的预测器进行精确的精确度分析,我们得到的精细的精确的精确的精确的模型和精确的精确的精确的预测器。