Model-X knockoffs allows analysts to perform feature selection using almost any machine learning algorithm while still provably controlling the expected proportion of false discoveries. To apply model-X knockoffs, one must construct synthetic variables, called knockoffs, which effectively act as controls during feature selection. The gold standard for constructing knockoffs has been to minimize the mean absolute correlation (MAC) between features and their knockoffs, but, surprisingly, we prove this procedure can be powerless in extremely easy settings, including Gaussian linear models with correlated exchangeable features. The key problem is that minimizing the MAC creates strong joint dependencies between the features and knockoffs, which allow machine learning algorithms to partially or fully reconstruct the effect of the features on the response using the knockoffs. To improve the power of knockoffs, we propose generating knockoffs which minimize the reconstructability (MRC) of the features, and we demonstrate our proposal for Gaussian features by showing it is computationally efficient, robust, and powerful. We also prove that certain MRC knockoffs minimize a natural definition of estimation error in Gaussian linear models. Furthermore, in an extensive set of simulations, we find many settings with correlated features in which MRC knockoffs dramatically outperform MAC-minimizing knockoffs and no settings in which MAC-minimizing knockoffs outperform MRC knockoffs by more than a very slight margin. We implement our methods and a host of others from the knockoffs literature in a new open source python package knockpy.
翻译:模型- X 进门让分析师能够使用几乎任何机器学习算法进行特征选择,同时仍然可以想象控制预期的虚假发现比例。 要应用模型- X 进门, 就必须构建合成变量, 称为“ 入门 ”, 这在特性选择期间可以有效地起到控制作用。 建造入门的黄金标准是最大限度地降低特征及其出门之间的绝对相关性( MAC ), 但是, 令人惊讶的是, 我们证明这一程序在极其容易的环境下是无能为力的, 包括高斯的线性模型, 具有相关可互换性。 关键的问题是, 最大限度地减少MAC 和 击门之间产生强大的联合依赖性, 从而让机器学习算法能够部分或完全重建功能对使用击门的响应的效果。 为了提高击门的力量, 我们提议产生出出出出门的击门, 通过计算效率、 强力、 和 强大。 我们还证明, 某些 MRC 的入门游戏会最大限度地减少高斯线性模型中估算错误的自然来源。 此外, 在广泛的模拟组合中, 我们发现许多的MAC 和MAC 的模型中, 我们用许多的模拟模型来进行。