Anomaly detection is a widely explored domain in machine learning. Many models are proposed in the literature, and compared through different metrics measured on various datasets. The most popular metrics used to compare performances are F1-score, AUC and AVPR. In this paper, we show that F1-score and AVPR are highly sensitive to the contamination rate. One consequence is that it is possible to artificially increase their values by modifying the train-test split procedure. This leads to misleading comparisons between algorithms in the literature, especially when the evaluation protocol is not well detailed. Moreover, we show that the F1-score and the AVPR cannot be used to compare performances on different datasets as they do not reflect the intrinsic difficulty of modeling such data. Based on these observations, we claim that F1-score and AVPR should not be used as metrics for anomaly detection. We recommend a generic evaluation procedure for unsupervised anomaly detection, including the use of other metrics such as the AUC, which are more robust to arbitrary choices in the evaluation protocol.
翻译:异常探测是机器学习中广泛探索的领域。 许多模型在文献中提出,并通过不同数据集测量的不同度量来比较。 用于比较性能的最流行的衡量标准是F1-分数、AUC和AVPR。 在本文中,我们表明F1分数和AVPR对污染率非常敏感。 其中一个后果是,可以通过修改火车测试分解程序人为地增加它们的值。 这导致对文献中的算法进行误导性比较,特别是在评价协议不够详细的情况下。 此外,我们表明,F1分数和AVPR不能用来比较不同数据集的性能,因为它们没有反映模拟这类数据的内在困难。 根据这些观察,我们声称F1分数和AVPR不应被用作异常检测的衡量标准。 我们建议采用通用评价程序,用于未经监督的异常检测,包括使用诸如AUC等其他指标,这些指标对于评估协议中的任意选择更为可靠。