We reconsider the evaluation of OOD detection methods for image recognition. Although many studies have been conducted so far to build better OOD detection methods, most of them follow Hendrycks and Gimpel's work for the method of experimental evaluation. While the unified evaluation method is necessary for a fair comparison, there is a question of if its choice of tasks and datasets reflect real-world applications and if the evaluation results can generalize to other OOD detection application scenarios. In this paper, we experimentally evaluate the performance of representative OOD detection methods for three scenarios, i.e., irrelevant input detection, novel class detection, and domain shift detection, on various datasets and classification tasks. The results show that differences in scenarios and datasets alter the relative performance among the methods. Our results can also be used as a guide for practitioners for the selection of OOD detection methods.
翻译:我们重新考虑对OOD探测方法的评价,以辨别图像。虽然迄今为止已经进行了许多研究,以建立更好的OOD探测方法,但大多数研究都遵循Hendrycks和Gimpel关于实验性评价方法的工作。虽然统一评价方法对于公平比较是必要的,但有一个问题,即它的任务和数据集的选择是否反映了现实世界的应用,以及评价结果能否概括到OOD检测应用的其他情景。在本文件中,我们实验性地评估了代表OOOD探测方法在三种情景(即不相关的输入检测、新分类检测和域位转移检测)中的各种数据集和分类任务方面的绩效。结果显示,情景和数据集的差异改变了这些方法的相对性能。我们的结果也可以用作操作人员选择OD检测方法的指南。