There are a variety of settings where vague prior information may be available on the importance of predictors in high-dimensional regression settings. Examples include ordering on the variables offered by their empirical variances (which is typically discarded through standardisation), the lag of predictors when fitting autoregressive models in time series settings, or the level of missingness of the variables. Whilst such orderings may not match the true importance of variables, we argue that there is little to be lost, and potentially much to be gained, by using them. We propose a simple scheme involving fitting a sequence of models indicated by the ordering. We show that the computational cost for fitting all models when ridge regression is used is no more than for a single fit of ridge regression, and describe a strategy for Lasso regression that makes use of previous fits to greatly speed up fitting the entire sequence of models. We propose to select a final estimator by cross-validation and provide a general result on the quality of the best performing estimator on a test set selected from among a number $M$ of competing estimators in a high-dimensional linear regression setting. Our result requires no sparsity assumptions and shows that only a $\log M$ price is incurred compared to the unknown best estimator. We demonstrate the effectiveness of our approach when applied to missing or corrupted data, and time series settings. An R package is available on github.
翻译:在多种情况下,可能事先掌握关于预测器在高维回归环境下的重要性的信息模糊不清,例如:订购其实验性差异提供的变量(通常通过标准化而放弃),在时间序列设置中安装自动回归模型时预测器的滞后,或变量的缺失程度。虽然这些订单可能与变量的真正重要性不相符,但我们认为,使用这些变量几乎没有什么损失,而且可能获得很多。我们提出了一个简单方案,涉及安装由订单标明的模型序列。我们表明,在使用峰值回归时,所有模型的安装计算成本不仅限于一个适合峰值回归的公式,并描述一个拉索回归战略,利用先前的回归战略大大加快整个模型序列的匹配速度。我们提议通过交叉校验选定一个最终的估算器,并提供一个总体结果,即最佳估算器的质量来自从数个 $M 中选择的测试器。在高维度线性回归设置中,我们显示的计算成本回归的计算成本并不高,我们的结果要求以未知的价格假设或时间序列来显示我们最不确定的数据。