Selective Inference (SI) has been actively studied in the past few years for conducting inference on the features of linear models that are adaptively selected by feature selection methods such as Lasso. The basic idea of SI is to make inference conditional on the selection event. Unfortunately, the main limitation of the original SI approach for Lasso is that the inference is conducted not only conditional on the selected features but also on their signs -- this leads to loss of power because of over-conditioning. Although this limitation can be circumvented by considering the union of such selection events for all possible combinations of signs, this is only feasible when the number of selected features is sufficiently small. To address this computational bottleneck, we propose a parametric programming-based method that can conduct SI without conditioning on signs even when we have thousands of active features. The main idea is to compute the continuum path of Lasso solutions in the direction of a test statistic, and identify the subset of the data space corresponding to the feature selection event by following the solution path. The proposed parametric programming-based method not only avoids the aforementioned computational bottleneck but also improves the performance and practicality of SI for Lasso in various respects. We conduct several experiments to demonstrate the effectiveness and efficiency of our proposed method.
翻译:在过去几年里,对通过Lasso等特征选择方法适应性选择的线性模型的特征进行了积极研究,以对线性模型的特征进行推断。SI的基本想法是将选择活动作为推断条件。不幸的是,最初的SI对Lasso采用的方法的主要限制是,不仅以选定的特征为条件,而且以其标志为条件进行推断,这会导致因过度调控而丧失权力。虽然这一限制可以通过考虑将所有可能的标志组合的这种选择活动合并起来来规避,但只有在选定的特征数量足够小的情况下,这种限制才可行。为了解决这一计算瓶颈问题,我们建议一种基于参数的方案编制方法,可以进行SI,而无需以迹象为条件,即使我们具有数千个积极特征。主要想法是,在测试统计方向上,将Lasso解决方案的连续路径进行,并查明与根据解决方案路径选择特征的事件相对应的数据空间的子集。拟议的参数编程法方法不仅避免上述的计算瓶颈,而且还能提高我们所提议的各种实验的性能和实用性。