Exathlon:时间序列中可解释的异常探测基准 (Exathlon: A Benchmark for Explainable Anomaly Detection over Time Series)

Access to high-quality data repositories and benchmarks have been instrumental in advancing the state of the art in many experimental research domains. While advanced analytics tasks over time series data have been gaining lots of attention, lack of such community resources severely limits scientific progress. In this paper, we present Exathlon, the first comprehensive public benchmark for explainable anomaly detection over high-dimensional time series data. Exathlon has been systematically constructed based on real data traces from repeated executions of large-scale stream processing jobs on an Apache Spark cluster. Some of these executions were intentionally disturbed by introducing instances of six different types of anomalous events (e.g., misbehaving inputs, resource contention, process failures). For each of the anomaly instances, ground truth labels for the root cause interval as well as those for the extended effect interval are provided, supporting the development and evaluation of a wide range of anomaly detection (AD) and explanation discovery (ED) tasks. We demonstrate the practical utility of Exathlon's dataset, evaluation methodology, and end-to-end data science pipeline design through an experimental study with three state-of-the-art AD and ED techniques.

翻译：在许多实验性研究领域,获得高质量的数据储存库和基准对提高最新水平至关重要。虽然先进的分析任务随着时间序列而引起人们的极大关注,但缺乏这种社区资源严重限制了科学进步。在本文件中,我们介绍了Exathlon,这是用于对高维时间序列数据进行可解释异常检测的第一个综合公共基准;Exathlon是根据在Apache Spark群中反复执行大规模流处理工作而得出的真实数据痕迹系统构建的。其中一些处决是故意受到干扰的,因为引入了六种不同类型的异常事件(如行为不当的投入、资源争议、过程失败等)。对于每一种异常事件,都提供了根根部间和长效间隔的地面真相标签,支持开发和评估范围广泛的异常检测(AD)和解释发现(ED)任务。我们用三种状态的AD和ED技术进行实验研究,展示了Exathlon数据集、评估方法和终端至终端数据管道设计的实际效用。

相关内容

异常检测

关注 102

在数据挖掘中，异常检测（英语：anomaly detection）对不符合预期模式或数据集中其他项目的项目、事件或观测值的识别。通常异常项目会转变成银行欺诈、结构缺陷、医疗问题、文本错误等类型的问题。异常也被称为离群值、新奇、噪声、偏差和例外。特别是在检测滥用与网络入侵时，有趣性对象往往不是罕见对象，但却是超出预料的突发活动。这种模式不遵循通常统计定义中把异常点看作是罕见对象，于是许多异常检测方法（特别是无监督的方法）将对此类数据失效，除非进行了合适的聚集。相反，聚类分析算法可能可以检测出这些模式形成的微聚类。有三大类异常检测方法。[1] 在假设数据集中大多数实例都是正常的前提下，无监督异常检测方法能通过寻找与其他数据最不匹配的实例来检测出未标记测试数据的异常。监督式异常检测方法需要一个已经被标记“正常”与“异常”的数据集，并涉及到训练分类器（与许多其他的统计分类问题的关键区别是异常检测的内在不均衡性）。半监督式异常检测方法根据一个给定的正常训练数据集创建一个表示正常行为的模型，然后检测由学习模型生成的测试实例的可能性。

【2020新书】数据科学与机器学习导论，220页pdf

专知会员服务

81+阅读 · 2020年9月14日

【KDD2020】多任务多关系嵌入的Twitter意识形态检测，TIMME-Twitter Ideology-detection via Multi-task Multi-relational Embedding

专知会员服务

18+阅读 · 2020年6月8日

【开放书】预测模型:探索、解释和调试，以人为本的可解释机器学习，Predictive Models: Explore, Explain, and Debug，Human-Centered Interpretable Machine Learning

专知会员服务

37+阅读 · 2019年12月26日

【ECML-PKDD 2019】多维时间序列和事件日志的模式挖掘和异常检测框架（A framework for pattern mining and anomalydetection in multi-dimensional time series andevent logs）

专知会员服务

38+阅读 · 2019年12月1日