The early solution path, which tracks the first few variables that enter the model of a selection procedure, is of profound importance to scientific discoveries. In practice, it is often statistically hopeless to identify all the important features with no false discovery, let alone the intimidating expense of experiments to test their significance. Such realistic limitation calls for statistical guarantee for the early discoveries of a model selector. In this paper, we focus on the early solution path of best subset selection (BSS), where the sparsity constraint is set to be lower than {the true sparsity}. Under a sparse high-dimensional linear model, we establish the sufficient and (near) necessary condition for BSS to achieve sure early selection, or equivalently, zero false discovery throughout its early path. Essentially, this condition boils down to a lower bound of the minimum projected signal margin that characterizes the gap of the captured signal strength between sure selection models and those with spurious discoveries. Defined through projection operators, this margin is independent of the restricted eigenvalues of the design, suggesting the robustness of BSS against collinearity. Moreover, our model selection guarantee tolerates reasonable optimization error and thus applies to near best subsets. Finally, to overcome the computational hurdle of BSS under high dimension, we propose the "screen then select" (STS) strategy to reduce dimension for BSS. Our numerical experiments show that the resulting early path exhibits much lower false discovery rate (FDR) than LASSO, MCP and SCAD, especially in the presence of highly correlated design. We also investigate the early paths of the iterative hard thresholding algorithms, which are greedy computational surrogates for BSS, and which yield comparable FDR as our STS procedure.
翻译:早期解决方案路径跟踪进入选择程序模式的最初几个变量,对于科学发现具有深远的重要意义。在实践中,在统计上往往毫无希望,无法在没有虚假发现的情况下确定所有重要特征,更不要说测试其重要性的实验的恐吓成本。这种现实限制要求为模型选择器的早期发现提供统计保障。在本文中,我们侧重于最佳子集选择(BSS)的早期解决方案路径,其中的紧张性制约定得低于{真正的粒子选择}。在一个稀疏的高维线性线性模型下,我们为BSS建立了足够和(近于)必要的条件,以确保早期选择,或在整个早期路径中实现相等的零虚假发现。基本上,这一条件将降低到最低预测信号差的界限,以显示所捕捉到的信号强度在肯定选择模型和那些有尖锐的发现。在投影操作中,这种差幅独立于设计中有限的igen值,表明BSS的稳健性与直线性。此外,我们模型选择的直径直径直线性设计,特别保证在最早期的SA值中,从而显示我们最接近最接近的递缩的S的S的SL的SL,最终的递值,从而显示我们最接近的SL的SA值。