In recent years, specific evaluation metrics for time series anomaly detection algorithms have been developed to handle the limitations of the classical precision and recall. However, such metrics are heuristically built as an aggregate of multiple desirable aspects, introduce parameters and wipe out the interpretability of the output. In this article, we first highlight the limitations of the classical precision/recall, as well as the main issues of the recent event-based metrics -- for instance, we show that an adversary algorithm can reach high precision and recall on almost any dataset under weak assumption. To cope with the above problems, we propose a theoretically grounded, robust, parameter-free and interpretable extension to precision/recall metrics, based on the concept of ``affiliation'' between the ground truth and the prediction sets. Our metrics leverage measures of duration between ground truth and predictions, and have thus an intuitive interpretation. By further comparison against random sampling, we obtain a normalized precision/recall, quantifying how much a given set of results is better than a random baseline prediction. By construction, our approach keeps the evaluation local regarding ground truth events, enabling fine-grained visualization and interpretation of algorithmic results. We compare our proposal against various public time series anomaly detection datasets, algorithms and metrics. We further derive theoretical properties of the affiliation metrics that give explicit expectations about their behavior and ensure robustness against adversary strategies.
翻译:近年来,针对时间序列异常检测算法制定了具体的评价指标,以处理古典精确度和回顾的局限性。然而,这类指标是建立在理论上的、强有力的、无参数的、可解释的扩展的精确度/召回度量,其基础是多种可取的方面,引入参数并消除产出的可解释性。在本篇文章中,我们首先强调古典精确度/召回的局限性,以及最近基于事件的指标的主要问题 -- -- 例如,我们表明,对立算法可以达到很高的精确度,并在假设薄弱的情况下回顾几乎所有数据集。为了应对上述问题,我们提议在理论基础上、强有力、无参数和可解释的扩展至精确/召回度度量度量度,其依据是 " 与地面真相和预测各组之间的匹配性 " 概念,引入参数,并消除产出的可解释性。我们的指标利用了对地面真相和预测之间的持续时间尺度,从而得出一个不直观的解释。我们通过进一步与随机抽样比较,我们获得了标准化的精确/召回度/召回,量化一套结果比随机基线预测要好多少。我们的方法使当地关于地面真相事件的评价与精确度/召回度,我们对照了对正标度的排序的模型分析结果,我们进一步对比了。