Anomaly detection methods identify examples that do not follow the expected behaviour, typically in an unsupervised fashion, by assigning real-valued anomaly scores to the examples based on various heuristics. These scores need to be transformed into actual predictions by thresholding, so that the proportion of examples marked as anomalies equals the expected proportion of anomalies, called contamination factor. Unfortunately, there are no good methods for estimating the contamination factor itself. We address this need from a Bayesian perspective, introducing a method for estimating the posterior distribution of the contamination factor of a given unlabeled dataset. We leverage on outputs of several anomaly detectors as a representation that already captures the basic notion of anomalousness and estimate the contamination using a specific mixture formulation. Empirically on 22 datasets, we show that the estimated distribution is well-calibrated and that setting the threshold using the posterior mean improves the anomaly detectors' performance over several alternative methods. All code is publicly available for full reproducibility.
翻译:异常的检测方法通常以不受监督的方式,通过将实际价值的异常分数分配给基于各种超自然学的示例,确定与预期行为不相符合的范例。这些分数需要通过阈值转换为实际预测,以便标为异常的示例比例等于异常的预期比例,称为污染系数。不幸的是,没有很好的方法来估计污染系数本身。我们从巴伊西亚角度处理这一需要,采用一种方法来估计某一未贴标签数据集的污染系数的后方分布。我们利用若干异常探测器的输出作为代表,已经掌握了异常现象的基本概念,并使用特定的混合物配方估计了污染情况。我们偶尔在22个数据集中显示,估计的分布情况非常精确,使用外表值设定阈值意味着在几种替代方法上提高异常探测器的性能。所有代码都可以公开提供,以便完全恢复。