This paper introduces a class of asymptotically most powerful knockoff statistics based on a simple principle: that we should prioritize variables in order of our ability to distinguish them from their knockoffs. Our contribution is threefold. First, we argue that feature statistics should estimate "oracle masked likelihood ratios," which are Neyman-Pearson statistics for discriminating between features and knockoffs using partially observed (masked) data. Second, we introduce the masked likelihood ratio (MLR) statistic, a knockoff statistic that estimates the oracle MLR. We show that MLR statistics are asymptotically average-case optimal, i.e., they maximize the expected number of discoveries made by knockoffs when averaging over a user-specified prior on unknown parameters. Our optimality result places no explicit restrictions on the problem dimensions or the unknown relationship between the response and covariates; instead, we assume a "local dependence" condition which depends only on simple quantities that can be calculated from the data. Third, in simulations and three real data applications, we show that MLR statistics outperform state-of-the-art feature statistics, including in settings where the prior is highly misspecified. We implement MLR statistics in the open-source python package knockpy; our implementation is often (although not always) faster than computing a cross-validated lasso.
翻译:本文基于一个简单的原则, 引入了一组无症状的、 最强大的淘汰率统计, 其依据是简单的原则: 我们应优先考虑变量, 以便区分变量和它们的淘汰。 我们的贡献是三重的。 首先, 我们争论的是, 特征统计应该估算“ 隐蔽的隐蔽概率比率 ”, 即 Neyman- Pearson 统计数据, 使用部分观察( 虚构) 数据区分特性和淘汰。 其次, 我们引入了隐蔽概率比率( MLR) 统计, 一种用于估计 oracle MLR 的数据。 我们显示, MLR统计数据在平均情况下是尽可能优化的, 也就是说, 当平均超过用户之前指定的未知参数时, 它们会最大限度地增加出错的预期发现数量。 我们的最佳性结果并没有明确限制问题层面, 或反应和变量之间未知的关系。 相反, 我们假设一个“ 本地依赖” 条件, 仅取决于从数据中可以计算到的简单数量。 第三, 在公开的模拟和三个真实数据应用中, 我们显示 MLR 统计数据超越状态状态状态的状态状态, 特征统计通常不是在先前的版本中执行。