The early solution path, which tracks the first few variables that enter the model of a selection procedure, is of profound importance to scientific discovery. In practice, it is often statistically intangible to identify all the important features with no false discovery, let alone the intimidating expense of experiments to test their significance. Such realistic limitation calls for statistical guarantee for the early discovery of a model selector to navigate scientific adventure on the sea of big data. In this paper, we focus on the early solution path of best subset selection (BSS), where the sparsity constraint is set to be lower than the true sparsity. Under a sparse high-dimensional linear model, we establish the sufficient and (near) necessary condition for BSS to achieve sure early selection, or equivalently, zero false discovery throughout its entire early path. Essentially, this condition boils down to a lower bound of the minimum projected signal margin that characterizes the fundamental gap in signal capturing between sure selection models and those with spurious discovery. Defined through projection operators, this margin is independent of the restricted eigenvalues of the design, suggesting the robustness of BSS against collinearity. On the numerical aspect, we choose CoSaMP (Compressive Sampling Matching Pursuit) to approximate the BSS solutions, and we show that the resulting early path exhibits much lower false discovery rate (FDR) than LASSO, MCP and SCAD, especially in presence of highly correlated design. Finally, we apply CoSaMP to perform preliminary feature screening for the knockoff filter to enhance its power.
翻译:早期解决方案路径跟踪进入选择程序模式的最初几个变量,对科学发现具有深远的重要意义。在实践中,在统计上往往无形地确定所有重要特征,而没有虚假发现,更不用说测试其重要性的恐吓性实验费用。这种现实性限制要求及早发现模型选选手,以在海中探索科学冒险。在本文中,我们侧重于最佳子集选择(BSS)的早期解决方案路径,即,放大限制定得低于真正的广度。在一个稀疏的高度线性线性模型下,我们为BSS确定所有重要特征,而不出现虚假发现,而没有虚假发现,这往往在统计上是无形的。 基本上,这一条件要求及早发现模型选择一个模型选择者在海中进行科学冒险探索的最小预测值。 在通过投影操作者来界定,这一差值独立于设计中有限的精度值,表明BSS与相近度线性线性线模型的精度和(近度)必要条件,使BSS在整个早期路径上实现早期早期的零度发现。在数字性设计中,我们选择CSMAMSP的早期路径,从而显示我们最终的CMADR的深度路径。