We propose the holdout randomization test (HRT), an approach to feature selection using black box predictive models. The HRT is a specialized version of the conditional randomization test (CRT; Candes et al., 2018) that uses data splitting for feasible computation. The HRT works with any predictive model and produces a valid $p$-value for each feature. To make the HRT more practical, we propose a set of extensions to maximize power and speed up computation. In simulations, these extensions lead to greater power than a competing knockoffs-based approach, without sacrificing control of the error rate. We apply the HRT to two case studies from the scientific literature where heuristics were originally used to select important features for predictive models. The results illustrate how such heuristics can be misleading relative to principled methods like the HRT. Code is available at https://github.com/tansey/hrt.
翻译:我们提议采用黑盒预测模型来突出选择特征的方法,即缓冲随机测试(HRT),这是使用黑盒预测模型进行选择的一种方法。 HRT是有条件随机测试(CRT;Candes等人,2018年)的专门版本,使用数据分离进行可行的计算。HRT与任何预测模型一起工作,为每个特性产生有效的美元价值。为了使HRT更加实用,我们提议一套扩展,以最大限度地增加电力和加速计算。在模拟中,这些扩展导致的动力大于相互竞争的超市法,同时不牺牲对误差率的控制。我们将HRT应用到科学文献中两个案例研究中,其中最初使用超自然学来选择预测模型的重要特征。结果说明此类超自然学如何会误导像 HRT 那样的原则性方法。可在 https://github.com/tansey/hrt上查阅。