在线性回归模型中选择变量:选择最佳子集并不总是最佳选择 (Variable selection in linear regression models: choosing the best subset is not always the best choice)

Variable selection in linear regression settings is a much discussed problem. Best subset selection (BSS) is often considered the intuitive 'gold standard', with its use being restricted only by its NP-hard nature. Alternatives such as the least absolute shrinkage and selection operator (Lasso) or the elastic net (Enet) have become methods of choice in high-dimensional settings. A recent proposal represents BSS as a mixed integer optimization problem so that much larger problems have become feasible in reasonable computation time. We present an extensive neutral comparison assessing the variable selection performance, in linear regressions, of BSS compared to forward stepwise selection (FSS), Lasso and Enet. The simulation study considers a wide range of settings that are challenging with regard to dimensionality (with respect to the number of observations and variables), signal-to-noise ratios and correlations between predictors. As main measure of performance, we used the best possible F1-score for each method to ensure a fair comparison irrespective of any criterion for choosing the tuning parameters, and results were confirmed by alternative performance measures. Somewhat surprisingly, it was only in settings where the signal-to-noise ratio was high and the variables were (nearly) uncorrelated that BSS reliably outperformed the other methods, even in low-dimensional settings. Further, the FSS's performance was nearly identical to BSS. Our results shed new light on the usual presumption of BSS being, in principle, the best choice for variable selection. Especially for correlated variables, alternatives like Enet are faster and appear to perform better in practical settings.

翻译：线性回归设置中的变量选择是一个大讨论的问题。最佳子集选择( BSS) 通常被视为直观的“ 黄金标准 ”, 其使用仅受NP- 硬性限制。其它选择方法, 如最小绝对缩缩缩和选择操作器( Lasso) 或弹性网( Enet), 在高维设置中已成为选择方法。最近的一项提案将 BSS 视为混合整数优化问题, 从而在合理的计算时间里, 问题更大得多。我们提供了一个广泛的中性比较, 评估BSS 的变量选择性能, 在线性能选择( FSS 、 Lasso 和 Enet ) 的直观性能“ Gold 标准 ” 和“ Gold 标准 ” 。令人惊讶的是, 模拟研究认为在维度( 观察和变量的数量、信号比预测器、信号比值要低) 以及预测器之间的关联性关系很大。作为主要绩效衡量标准, 我们使用最好的FSS, 无论选择什么标准, 和结果都得到了确认。。令人惊讶地说, 最接近的SSS 直观的变异性, 的值比比的变值比比比比更接近于我们更接近性的的的的的的的的性变值比比比更接近于的的的的更接近于性。在的的的的性的的的性的的的的性性性性性性性的性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性性