Access to high-quality data repositories and benchmarks have been instrumental in advancing the state of the art in many experimental research domains. While advanced analytics tasks over time series data have been gaining lots of attention, lack of such community resources severely limits scientific progress. In this paper, we present Exathlon, the first comprehensive public benchmark for explainable anomaly detection over high-dimensional time series data. Exathlon has been systematically constructed based on real data traces from repeated executions of large-scale stream processing jobs on an Apache Spark cluster. Some of these executions were intentionally disturbed by introducing instances of six different types of anomalous events (e.g., misbehaving inputs, resource contention, process failures). For each of the anomaly instances, ground truth labels for the root cause interval as well as those for the extended effect interval are provided, supporting the development and evaluation of a wide range of anomaly detection (AD) and explanation discovery (ED) tasks. We demonstrate the practical utility of Exathlon's dataset, evaluation methodology, and end-to-end data science pipeline design through an experimental study with three state-of-the-art AD and ED techniques.
翻译:在许多实验性研究领域,获得高质量的数据储存库和基准对提高最新水平至关重要。虽然先进的分析任务随着时间序列而引起人们的极大关注,但缺乏这种社区资源严重限制了科学进步。在本文件中,我们介绍了Exathlon,这是用于对高维时间序列数据进行可解释异常检测的第一个综合公共基准;Exathlon是根据在Apache Spark群中反复执行大规模流处理工作而得出的真实数据痕迹系统构建的。其中一些处决是故意受到干扰的,因为引入了六种不同类型的异常事件(如行为不当的投入、资源争议、过程失败等)。对于每一种异常事件,都提供了根根部间和长效间隔的地面真相标签,支持开发和评估范围广泛的异常检测(AD)和解释发现(ED)任务。我们用三种状态的AD和ED技术进行实验研究,展示了Exathlon数据集、评估方法和终端至终端数据管道设计的实际效用。