Conditional randomization tests (CRTs) assess whether a variable $x$ is predictive of another variable $y$, having observed covariates $z$. CRTs require fitting a large number of predictive models, which is often computationally intractable. Existing solutions to reduce the cost of CRTs typically split the dataset into a train and test portion, or rely on heuristics for interactions, both of which lead to a loss in power. We propose the decoupled independence test (DIET), an algorithm that avoids both of these issues by leveraging marginal independence statistics to test conditional independence relationships. DIET tests the marginal independence of two random variables: $F(x \mid z)$ and $F(y \mid z)$ where $F(\cdot \mid z)$ is a conditional cumulative distribution function (CDF). These variables are termed "information residuals." We give sufficient conditions for DIET to achieve finite sample type-1 error control and power greater than the type-1 error rate. We then prove that when using the mutual information between the information residuals as a test statistic, DIET yields the most powerful conditionally valid test. Finally, we show DIET achieves higher power than other tractable CRTs on several synthetic and real benchmarks.
翻译:条件随机化检验(CRTs)评估在观察到协变量$z$的情况下变量$x$是否预测变量$y$。CRTs需要拟合大量的预测模型,这通常是计算上不可行的。现有解决方案通常将数据集分成训练部分和测试部分,或者依赖于交互的启发式方法,这两种方法都会导致功率损失。我们提出了解耦的独立性检验(DIET),一种算法,通过利用边缘独立统计量测试条件独立关系来避免这两个问题。DIET测试两个随机变量的边际独立性: $F(x \mid z)$和$F(y \mid z)$其中$F(\cdot \mid z)$是一个条件累积分布函数(CDF)。这些变量被称为“信息残差”。我们给出DIET实现有限样本类型1误差控制和功率大于类型1误差率的充分条件。然后我们证明,当使用信息残差之间的互信息作为检验统计量时,DIET产生了最强的条件有效测试。最后,我们表明,在几个合成和真实基准测试中,DIET的功率优于其他可处理的CRTs。