Anomaly detection research works generally propose algorithms or end-to-end systems that are designed to automatically discover outliers in a dataset or a stream. While literature abounds concerning algorithms or the definition of metrics for better evaluation, the quality of the ground truth against which they are evaluated is seldom questioned. In this paper, we present a systematic analysis of available public (and additionally our private) ground truth for anomaly detection in the context of network environments, where data is intrinsically temporal, multivariate and, in particular, exhibits spatial properties, which, to the best of our knowledge, we are the first to explore. Our analysis reveals that, while anomalies are, by definition, temporally rare events, their spatial characterization clearly shows some type of anomalies are significantly more popular than others. We find that simple clustering can reduce the need for human labeling by a factor of 2x-10x, that we are first to quantitatively analyze in the wild.
翻译:异常探测研究通常会提出算法或端到端系统,设计这些算法或端到端系统是为了在数据集或流中自动发现外部线。虽然关于算法或为更好地评估而界定衡量尺度的文献很多,但很少对其评估的地面真理的质量提出质疑。在本文中,我们对现有公共(以及我们的私人)地面真理进行系统分析,以便在网络环境中发现异常现象,在网络环境中,数据是内在的时间、多变量,特别是展示空间特性,据我们所知,我们是第一个进行探索的。我们的分析表明,虽然从定义上看,异常现象是暂时罕见的事件,但其空间特征显然表明某些异常类型比其他异常类型更受欢迎。我们发现,简单的组合可以减少人类标签的需要,减少2x-10x系数,我们首先在野外进行定量分析。