We propose the Terminating-Random Experiments (T-Rex) selector, a fast variable selection method for high-dimensional data. The T-Rex selector controls a user-defined target false discovery rate (FDR) while maximizing the number of selected variables. This is achieved by fusing the solutions of multiple early terminated random experiments. The experiments are conducted on a combination of the original predictors and multiple sets of randomly generated dummy predictors. A finite sample proof based on martingale theory for the FDR control property is provided. Numerical simulations confirm that the FDR is controlled at the target level while allowing for a high power. We prove under mild conditions that the dummies can be sampled from any univariate probability distribution with finite expectation and variance. The computational complexity of the proposed method is linear in the number of variables. The T-Rex selector outperforms state-of-the-art methods for FDR control on a simulated genome-wide association study (GWAS), while its sequential computation time is more than two orders of magnitude lower than that of the strongest benchmark methods. The open source R package TRexSelector containing the implementation of the T-Rex selector is available on CRAN.
翻译:我们提出终止- 兰多姆实验( T- Rex) 选择器, 这是一种用于高维数据的快速变量选择方法。 T- Rex 选择器控制了一个用户定义的目标错误发现率( FDR), 并同时最大限度地增加选定变量的数量。 这是通过使用多个早期终止随机实验的解决方案实现的。 实验是在原始预测器和多组随机生成的模拟模拟预测器的组合下进行的。 提供了基于FDR控制属性的martingale理论的有限样本证据。 数字模拟证实FDR控制在目标水平上,同时允许高功率。 我们证明, 在温和的条件下, 能够从任何有一定期望和差异的单向概率分布中样本。 提议的方法的计算复杂性在变量数中是线性。 T- Rex 选择了用于模拟基因组全域联系( GWASS) 的FDR 控制的最新方法, 而其连续计算时间比最强的基准方法低两个数量级。 包含 RRECT 的开放源包 。