This paper studies the construction of p-values for nonparametric outlier detection, taking a multiple-testing perspective. The goal is to test whether new independent samples belong to the same distribution as a reference data set or are outliers. We propose a solution based on conformal inference, a broadly applicable framework which yields p-values that are marginally valid but mutually dependent for different test points. We prove these p-values are positively dependent and enable exact false discovery rate control, although in a relatively weak marginal sense. We then introduce a new method to compute p-values that are both valid conditionally on the training data and independent of each other for different test points; this paves the way to stronger type-I error guarantees. Our results depart from classical conformal inference as we leverage concentration inequalities rather than combinatorial arguments to establish our finite-sample guarantees. Furthermore, our techniques also yield a uniform confidence bound for the false positive rate of any outlier detection algorithm, as a function of the threshold applied to its raw statistics. Finally, the relevance of our results is demonstrated by numerical experiments on real and simulated data.
翻译:本文研究用于非参数外向检测的 p 值的构建, 采用多重测试视角。 目标是测试新的独立样本是否属于与参考数据集相同的分布或外向值。 我们建议了一个基于一致推断的解决方案, 这个广泛适用的框架产生p 值, 其有效性微乎其微, 但在不同测试点上是相互依存的。 我们证明这些 p 值具有积极的依赖性, 并允许精确的虚假发现率控制, 尽管在相对薄弱的边际意义上。 然后我们引入一种新的方法来计算 p 值, 该方法既以培训数据为条件,又对不同的测试点独立; 这为强化类型I错误的保证铺平了道路。 我们的结果偏离了传统的一致推断, 因为我们利用浓度不平等而不是组合论来确立我们的有限抽样保证。 此外, 我们的技术还产生了一种一致的信心, 将任何外部检测算法的假正率作为适用于原始统计的临界值的函数。 最后, 我们结果的相关性通过对真实和模拟数据进行数字实验来证明。