We study the problem of Out-of-Distribution (OOD) detection, that is, detecting whether a learning algorithm's output can be trusted at inference time. While a number of tests for OOD detection have been proposed in prior work, a formal framework for studying this problem is lacking. We propose a definition for the notion of OOD that includes both the input distribution and the learning algorithm, which provides insights for the construction of powerful tests for OOD detection. We propose a multiple hypothesis testing inspired procedure to systematically combine any number of different statistics from the learning algorithm using conformal p-values. We further provide strong guarantees on the probability of incorrectly classifying an in-distribution sample as OOD. In our experiments, we find that threshold-based tests proposed in prior work perform well in specific settings, but not uniformly well across different types of OOD instances. In contrast, our proposed method that combines multiple statistics performs uniformly well across different datasets and neural networks.
翻译:我们研究分配外(OOD)检测问题,即检测学习算法产出能否在推论时间得到信任。虽然在先前的工作中曾提出过一些OOD检测测试,但缺乏研究这一问题的正式框架。我们为OOD概念提出了一个定义,其中既包括输入分布,也包括学习算法,为构建强有力的OOD检测检测测试提供见解。我们提议了一个多重假设测试激励程序,以便系统地结合与使用符合的 p-价值的学习算法的任何不同统计数据。我们进一步保证了将分配内抽样错误地归类为OOOD的可能性。在我们的实验中,我们发现先前工作中提出的基于门槛的测试在特定环境中运行良好,但在不同类型的OOD实例之间并不统一。相反,我们提出的将多种统计数据结合起来的方法在不同数据集和神经网络之间运作得非常一致。