We propose a new variable selection algorithm, subsample-ordered least-angle regression (solar), and its coordinate descent generalization, solar-cd. Solar re-constructs lasso paths using the $L_0$ norm and averages the resulting solution paths across subsamples. Path averaging retains the ranking information of the informative variables while averaging out sensitivity to high dimensionality, improving variable selection stability, efficiency, and accuracy. We prove that: (i) with a high probability, path averaging perfectly separates informative variables from redundant variables on the average $L_0$ path; (ii) solar variable selection is consistent and accurate; and (iii) the probability that solar omits weak signals is controllable for finite sample size. We also demonstrate that: (i) solar yields, with less than $1/3$ of the lasso computation load, substantial improvements over lasso in terms of the sparsity (64-84\% reduction in redundant variable selection) and accuracy of variable selection; (ii) compared with the lasso safe/strong rule and variable screening, solar largely avoids selection of redundant variables and rejection of informative variables in the presence of complicated dependence structures; (iii) the sparsity and stability of solar conserves residual degrees of freedom for data-splitting hypothesis testing, improving the accuracy of post-selection inference on weak signals with limited $n$; (iv) replacing lasso with solar in bootstrap selection (e.g., bolasso or stability selection) produces a multi-layer variable ranking scheme that improves selection sparsity and ranking accuracy with the computation load of only one lasso realization; and (v) given the computation resources, solar bootstrap selection is substantially faster (98\% lower computation time) than the theoretical maximum speedup for parallelized bootstrap lasso (confirmed by Amdahl's law).
翻译:我们提出一个新的变量选择算法, 亚模量顺序最小的回归( 索拉度), 以及它协调的下降一般化, 太阳能cd 。 太阳能重新构建 lasso 路径, 使用$_ 0 标准, 并平均 子样的解决方案路径。 路径平均保留信息变量的排名信息, 同时平均对高维的敏感度, 提高选择稳定性、 效率和准确性。 我们证明:( 一) 概率高, 路径平均完美地将信息变量与平均 $L_ 0 路径上的冗余变量区分开来;(二) 太阳能变量的选择是一致和准确的;(三) 太阳能省略弱信号的概率选择概率, 对于有限的样本大小来说是可以控制。 我们还表明:(一) 太阳能的收益, 低于1/3 的计算负荷, 大大改善弹性的顺序, (64- 84 调值的变量选择) 和变量选择的准确性;(二) 与安全/ 坚固规则以及变量的筛选相比, 太阳变量的选择是一致和准确性的, 避免选择一个更精确性, 稳定值的变值的变值的变值的变值的变值, 和变值的变值的变值的变值的变值的变值的变值的变值的变值, 变值的变值的变值的变值的变值的变值的变值的变值的变值的变值的变值的变值, 的变值的变值的变值的变值的变值的变值的变值的变值的变值的变值的变值的变值的变值的变值的变值, 和变值的变值的变值的变值的变值的变值的变值的变值的变值的变值的变值的变值的变值, 的变值的变值的变值的变值的变值的变值的变值的变值的变值的变值的变值的变值的变值的变值的变值的变值的变值, 的变值的变值的变值的变值的变值的变值的变值的变值的变值的变值的变值的变值的变值