Classical false discovery rate (FDR) controlling procedures offer strong and interpretable guarantees but often lack flexibility to work with complex data. By contrast, machine learning-based classification algorithms have superior performances on modern datasets but typically fall short of error-controlling guarantees. In this paper, we make these two meet by introducing a new adaptive novelty detection procedure with FDR control, called AdaDetect. It extends the scope of recent works of multiple testing literature to the high dimensional setting, notably the one in Yang et al. (2021). We prove that AdaDetect comes with finite sample guarantees: it controls the FDR strongly and approximates the oracle in terms of the power, with explicit remainder terms that are small under mild conditions. In practice, AdaDetect can be used in combination with any machine learning-based classifier, which allows the user to choose the most relevant classification approach. We illustrate this with classical real-world datasets, for which random forest and neural network classifiers are particularly efficient. The versatility of our method is also shown with an astrophysical application.
翻译:经典的虚假发现率(FDR)控制程序提供了强有力的、可解释的保障,但往往缺乏运用复杂数据的灵活性。相比之下,基于机器学习的分类算法在现代数据集上表现优异,但通常不具有控制错误的保证。在本文件中,我们通过采用新的适应性新颖的检测程序与FDR控制法(称为Adaseta)进行接触。它将最近多种测试文献的作品的范围扩大到高维环境,特别是杨等人(2021年)的作品。我们证明Ada检测具有有限的样本保证:它严格控制FDR,并且以权力的精度接近甲骨骼,在温和的条件下,其明显的剩余条件很小。实际上,Ada检测可以与任何基于机器的分类方法相结合,使用户能够选择最相关的分类方法。我们用传统的真实世界数据集来说明这一点,其中随机的森林和神经网络分类方法效率特别高。我们的方法的多变性用一个天物理应用程序来说明。