We propose the Terminating-Knockoff (T-Knock) filter, a fast variable selection method for high-dimensional data. The T-Knock filter controls a user-defined target false discovery rate (FDR) while maximizing the number of selected true positives. This is achieved by fusing the solutions of multiple early terminated random experiments. The experiments are conducted on a combination of the original data and multiple sets of randomly generated knockoff variables. A finite sample proof based on martingale theory for the FDR control property is provided. Numerical simulations show that the FDR is controlled at the target level while allowing for a high power. We prove under mild conditions that the knockoffs can be sampled from any univariate distribution. The computational complexity of the proposed method is derived and it is demonstrated via numerical simulations that the sequential computation time is multiple orders of magnitude lower than that of the strongest benchmark methods in sparse high-dimensional settings. The T-Knock filter outperforms state-of-the-art methods for FDR control on a simulated genome-wide association study (GWAS), while its computation time is more than two orders of magnitude lower than that of the strongest benchmark methods.
翻译:我们提出终止- Knockoff (T- Knock) 过滤器, 这是一种用于高维数据的快速变量选择方法。 T- Knock 过滤器控制了一个用户定义的目标错误发现率(FDR), 并同时将选定的真实正数最大化。 这是通过使用多个早期终止随机实验的解决方案而实现的。 实验是结合原始数据和多套随机生成的滚动变量进行的。 提供了基于 FDR 控制属性的martingale 理论的有限样本证明。 数字模拟显示 FDR 控制在目标级别, 同时又允许高功率。 我们证明, 在温和的条件下, 击倒可以从任何单项分布中取样。 所提议方法的计算复杂性是通过数字模拟显示的, 连续计算时间比稀薄高维度环境中最强的基准方法的倍数级要低。 T- Knock 过滤器在模拟基因组全域组合研究( GWAWAS) 中显示FDDR控制的最先进方法, 而其基准值比最强的两次测量时间要高。