Out-of-distribution (OOD) detection methods assume that they have test ground truths, i.e., whether individual test samples are in-distribution (IND) or OOD. However, in the real world, we do not always have such ground truths, and thus do not know which sample is correctly detected and cannot compute the metric like AUROC to evaluate the performance of different OOD detection methods. In this paper, we are the first to introduce the unsupervised evaluation problem in OOD detection, which aims to evaluate OOD detection methods in real-world changing environments without OOD labels. We propose three methods to compute Gscore as an unsupervised indicator of OOD detection performance. We further introduce a new benchmark Gbench, which has 200 real-world OOD datasets of various label spaces to train and evaluate our method. Through experiments, we find a strong quantitative correlation betwwen Gscore and the OOD detection performance. Extensive experiments demonstrate that our Gscore achieves state-of-the-art performance. Gscore also generalizes well with different IND/OOD datasets, OOD detection methods, backbones and dataset sizes. We further provide interesting analyses of the effects of backbones and IND/OOD datasets on OOD detection performance. The data and code will be available.
翻译:分解(OOD)检测方法假定它们具有测试地面真实性,即,单个测试样品是分布式(IND)还是OOD。然而,在现实世界中,我们并不总是有这样的地面真实性,因此我们不知道哪些样本是正确检测的,因此不能像AUROC那样计算测量指标来评价不同OOD检测方法的性能。在本文中,我们首先在OOD检测中引入了不受监督的评估问题,目的是在没有OOOD标签的实际情况变化环境中评价OOOD检测方法。我们提出了三种计算Gscore作为OD检测性能不受监督的指标的方法。我们进一步引入了一个新的基准Gbench,该基准有200个各种标签空间的真实的OODD数据集,用于培训和评估我们的方法。我们通过实验发现一个很强的定量相关性Betwen Gscore 和OD检测性能。广泛的实验表明,我们的Gsco将达到最新性能。Gsco还把Gscrequenation Gros 与不同的INND/OD 数据基数分析。