Controlled feature selection aims to discover the features a response depends on while limiting the false discovery rate (FDR) to a predefined level. Recently, multiple deep-learning-based methods have been proposed to perform controlled feature selection through the Model-X knockoff framework. We demonstrate, however, that these methods often fail to control the FDR for two reasons. First, these methods often learn inaccurate models of features. Second, the "swap" property, which is required for knockoffs to be valid, is often not well enforced. We propose a new procedure called FlowSelect that remedies both of these problems. To more accurately model the features, FlowSelect uses normalizing flows, the state-of-the-art method for density estimation. To circumvent the need to enforce the swap property, FlowSelect uses a novel MCMC-based procedure to calculate p-values for each feature directly. Asymptotically, FlowSelect computes valid p-values. Empirically, FlowSelect consistently controls the FDR on both synthetic and semi-synthetic benchmarks, whereas competing knockoff-based approaches do not. FlowSelect also demonstrates greater power on these benchmarks. Additionally, FlowSelect correctly infers the genetic variants associated with specific soybean traits from GWAS data.
翻译:受控特性选择旨在发现响应取决于的特征,同时将虚假发现率限制在预设水平,同时发现响应取决于的特征。最近,提出了多项基于深学习的多种方法,以通过模型-X的淘汰框架进行受控特性选择。然而,我们证明,这些方法往往由于两个原因无法控制FDR。首先,这些方法往往会学习不准确的特征模型。第二,“抽取”属性(这是出击有效所需的)往往没有很好地执行。我们提议了一个称为流程的新程序,即选择这些问题的补救方法。为了更准确地模拟这些特征,FlowSelect使用正常流、最先进的密度估计方法来进行受控特性选择。为避免执行互换属性的需要,FlowSelect使用基于新型的 MMC程序直接计算每个特性的p值。从本质上看,FlowSelect对有效的production-valy。我们建议,FDR将持续控制于合成和半合成基准,而根据相竞争的KFDR方法则使用正态的流动方法,而不是与GLEFS的基因变量相关的基准。