We propose the Terminating-Random Experiments (T-Rex) selector, a fast variable selection method for high-dimensional data. The T-Rex selector controls a user-defined target false discovery rate (FDR) while maximizing the number of selected variables. This is achieved by fusing the solutions of multiple early terminated random experiments. The experiments are conducted on a combination of the original predictors and multiple sets of randomly generated dummy predictors. A finite sample proof based on martingale theory for the FDR control property is provided. Numerical simulations confirm that the FDR is controlled at the target level while allowing for a high power. We prove under mild conditions that the dummies can be sampled from any univariate probability distribution with existing finite expectation and variance. The computational complexity of the proposed method is linear in the number of variables. The T-Rex selector outperforms state-of-the-art methods for FDR control on a simulated genome-wide association study (GWAS), while its sequential computation time is more than two orders of magnitude lower than that of the strongest benchmark methods. The open source R package TRexSelector containing the implementation of the T-Rex selector is available on CRAN.
翻译:我们建议使用终止- 兰多姆实验( T- Rex) 选择器, 这是一种用于高维数据的快速变量选择方法。 T- Rex 选择器控制了一个用户定义的目标错误发现率( FDR), 并同时将选定变量的数量最大化。 这是通过使用多个早期终止随机实验的解决方案实现的。 实验是结合原始预测器和多组随机生成的假预测器进行的。 提供了基于FDR控制属性的马丁格尔理论的有限样本证据。 数字模拟证实FDR控制在目标水平上,同时允许高功率。 我们证明, 在温和的条件下, 能够根据现有有限预期和差异从任何单向概率分布进行抽样。 拟议方法的计算复杂性在变量数中是线性。 T- Rex 选择了用于模拟基因组全域联系( GWASS) 的FDR 控制的最新方法, 而其顺序计算时间比现有最强的基准方法要低两个级。 包含 RRECT 的开放源软件 。