Despite agreement on the importance of detecting out-of-distribution (OOD) examples, there is little consensus on the formal definition of OOD examples and how to best detect them. We categorize these examples by whether they exhibit a background shift or a semantic shift, and find that the two major approaches to OOD detection, model calibration and density estimation (language modeling for text), have distinct behavior on these types of OOD data. Across 14 pairs of in-distribution and OOD English natural language understanding datasets, we find that density estimation methods consistently beat calibration methods in background shift settings, while performing worse in semantic shift settings. In addition, we find that both methods generally fail to detect examples from challenge data, highlighting a weak spot for current methods. Since no single method works well across all settings, our results call for an explicit definition of OOD examples when evaluating different detection methods.
翻译:尽管人们一致认为探测分配外(OOD)实例的重要性,但对于OOD实例的正式定义和如何最佳地探测这些实例,几乎没有共识。我们将这些实例分类,方法是显示背景变化还是语义变化,发现OOD检测、模型校准和密度估计(文本语言模型)的两个主要方法在这些类型的OOD数据上有着不同的行为。在分布内和OOD英语自然语言理解数据集的14对中,我们发现密度估计方法始终比背景变化环境中的校准方法强,而在语义转变环境中则表现得更差。此外,我们发现这两种方法一般都无法从挑战数据中发现实例,突出当前方法的弱点。由于没有一种方法在所有环境中都能很好地发挥作用,我们的结果要求在评估不同的探测方法时对OOD示例作出明确的定义。