In recent years, proposed studies on time-series anomaly detection (TAD) report high F1 scores on benchmark TAD datasets, giving the impression of clear improvements. However, most studies apply a peculiar evaluation protocol called point adjustment (PA) before scoring. In this paper, we theoretically and experimentally reveal that the PA protocol has a great possibility of overestimating the detection performance; that is, even a random anomaly score can easily turn into a state-of-the-art TAD method. Therefore, the comparison of TAD methods with F1 scores after the PA protocol can lead to misguided rankings. Furthermore, we question the potential of existing TAD methods by showing that an untrained model obtains comparable detection performance to the existing methods even without PA. Based on our findings, we propose a new baseline and an evaluation protocol. We expect that our study will help a rigorous evaluation of TAD and lead to further improvement in future researches.
翻译:近年来,关于时间序列异常探测(TAD)的拟议研究报告说,基准的TAD数据集的F1得分很高,给人以明显改进的印象;然而,大多数研究在评分之前采用称为点调整(PA)的特殊评估程序;在本文中,我们理论上和实验性地发现,PA协议极有可能高估检测性能;也就是说,即使随机异常分数也很容易转化为最先进的TAD方法;因此,将TPD方法与PA协议之后的F1得分进行比较,可能导致错误的排名;此外,我们质疑现有的TAD方法的潜力,因为显示未经培训的模式即使没有PA,也能取得与现有方法相似的检测性能。我们根据我们的调查结果,提出了新的基线和评价程序。我们期望我们的研究将有助于严格评估TAD,并导致未来研究的进一步改进。