终结- Knockoff 过滤器: 带有假发现率控制的快速高维变量选择 (The Terminating-Knockoff Filter: Fast High-Dimensional Variable Selection with False Discovery Rate Control)

from arxiv, 44 pages, 13 figures, 2 tables, replacement of first version (changes: new template, minor reformulations), submitted to the Annals of Statistics [previously submitted to Journal of Machine Learning Research and the editors-in-chief promptly suggested (without a review) that the paper would fit much better into a statistics journal]

We propose the Terminating-Knockoff (T-Knock) filter, a fast variable selection method for high-dimensional data. The T-Knock filter controls a user-defined target false discovery rate (FDR) while maximizing the number of selected variables. This is achieved by fusing the solutions of multiple early terminated random experiments. The experiments are conducted on a combination of the original predictors and multiple sets of randomly generated knockoff predictors. A finite sample proof based on martingale theory for the FDR control property is provided. Numerical simulations show that the FDR is controlled at the target level while allowing for a high power. We prove under mild conditions that the knockoffs can be sampled from any univariate distribution. The computational complexity of the proposed method is derived and it is demonstrated via numerical simulations that the sequential computation time is multiple orders of magnitude lower than that of the strongest benchmark methods in sparse high-dimensional settings. The T-Knock filter outperforms state-of-the-art methods for FDR control on a simulated genome-wide association study (GWAS), while its computation time is more than two orders of magnitude lower than that of the strongest benchmark methods.

翻译：我们提出终止- Knockoff (T- Knock) 过滤器, 这是一种用于高维数据的快速变量选择方法。 T- Knock 过滤器控制了一个用户定义的目标错误发现率(FDR), 并同时将选定变量的数量最大化。这是通过使用多个早期终止随机实验的解决方案而实现的。实验是在原始预测器和多套随机生成的入门预测器的组合下进行的。提供了基于 FDR 控制属性的martingale 理论的有限样本证明。数字模拟显示 FDR 控制在目标级别上, 同时又允许高功率。我们证明, 在温和的条件下, 任何单向分布的错误发现率都可以进行抽样。所提议的方法的计算复杂性是通过数字模拟显示的, 连续计算时间的量级数比稀薄高的高度环境中最强的基准方法的数级要低。 T- Knock 过滤器在模拟基因组全协会的模拟研究( GWA) 中显示FDR 控制的最先进方法, 而其基准量比最强的测算方法要快。