We propose a new algorithm for variable selection in high-dimensional data, called subsample-ordered least-angle regression (solar). Solar relies on the average $L_0$ solution path computed across subsamples and alleviates several known high-dimensional issues with lasso and least-angle regression. We illustrate in simulations that, with the same computation load, solar yields substantial improvements over lasso in terms of the sparsity (37-64\% reduction in the average number of selected variables), stability and accuracy of variable selection. Moreover, solar supplemented with the hold-out average (an adaptation of classical post-OLS tests) successfully purges almost all of the redundant variables while retaining all of the informative variables. Using simulations and real-world data, we also illustrate numerically that sparse solar variable selection is robust to complicated dependence structures and harsh settings of the irrepresentable condition. Moreover, replacing lasso with solar in an ensemble system (e.g., the bootstrap ensemble), significantly reduces the computation load (at least 96\% fewer subsample repetitions) of the bootstrap ensemble and improves selection sparsity. We provide a Python parallel computing package for solar (solarpy) in the supplementary file and https://github.com/isaac2math/solar.
翻译:我们为高维数据中的变量选择提出了一个新的算法,称为子抽样顺序最小角回归(索拉尔)。太阳能依赖在子样本中计算出的平均$L_0美元解决方案路径,并缓解了Lasso和最小角回归的一些已知高维问题。我们在模拟中用同样的计算负荷来说明,太阳能在宽度(选定变量平均数量减少37-64 ⁇ )、稳定性和变量选择的准确性方面比拉索产生显著的改善。此外,太阳能还依靠暂停平均(古典后OLS测试的调整)成功地清除了几乎所有冗余变量,同时保留了所有信息变量。我们使用模拟和真实世界数据,还用数字方式说明,稀少的太阳变量选择对于复杂的依赖结构和无法反映的严酷环境是强大的。此外,在堆积系统中(例如靴套)用太阳能代替拉索(例如),大大降低了计算负荷(至少96 ⁇ 次子粘固性重复性),同时为Schestegreasima2号/Wegasirmassal selsimal ASlemental ASlimental adal ASlimentalgaslemental.