The goal of controlled feature selection is to discover the features a response depends on while limiting the proportion of false discoveries to a predefined level. Recently, multiple methods have been proposed that use deep learning to generate knockoffs for controlled feature selection through the Model-X knockoff framework. We demonstrate, however, that these methods often fail to control the false discovery rate (FDR). There are two reasons for this shortcoming. First, these methods often learn inaccurate models of features. Second, the "swap" property, which is required for knockoffs to be valid, is often not well enforced. We propose a new procedure called FlowSelect that remedies both of these problems. To more accurately model the features, FlowSelect uses normalizing flows, the state-of-the-art method for density estimation. To circumvent the need to enforce the swap property, FlowSelect uses a novel MCMC-based procedure to directly compute p-values for each feature. Asymptotically, FlowSelect controls the FDR exactly. Empirically, FlowSelect controls the FDR well on both synthetic and semi-synthetic benchmarks, whereas competing knockoff-based approaches fail to do so. FlowSelect also demonstrates greater power on these benchmarks. Additionally, using data from a genome-wide association study of soybeans, FlowSelect correctly infers the genetic variants associated with specific soybean traits.
翻译:受控特性选择的目标是发现响应取决于的特征, 而同时将虚假发现的比例限制在预设水平上, 取决于响应的特性。 最近, 提出了多种方法, 通过模型- X 的淘汰框架, 利用深层次学习产生对受控特性选择的淘汰。 然而, 我们证明, 这些方法往往无法控制虚假发现率( FDR ) 。 出现这一缺陷的原因有两个。 首先, 这些方法往往会学习不准确的特征模型。 其次, “ 抽取” 属性( 击出才能有效) 往往没有很好地执行。 我们提议了一个新的程序, 叫做 FlowS 选择, 以补救这两个问题。 为了更准确地模拟这些功能, FlookS 选择使用正常的流量, 即最先进的密度估计方法。 为了避免执行互换属性, FlookS 选择使用新的基于 MC 程序直接计算每个特性的 p- 价值模型。 简单来说, 滚动控制 FDR 精确地说,, 抽动Slect 控制FDW, 在合成和半合成和半合成同步基准上都控制FDRestal- slestalestalestalbislateal bestal bes 。