Detecting anomalies over real-world datasets remains a challenging task. Data annotation is an intensive human labor problem, particularly in sequential datasets, where the start and end time of anomalies are not known. As a result, data collected from sequential real-world processes can be largely unlabeled or contain inaccurate labels. These characteristics challenge the application of anomaly detection techniques based on supervised learning. In contrast, Multiple Instance Learning (MIL) has been shown effective on problems with incomplete knowledge of labels in the training dataset, mainly due to the notion of bags. While largely under-leveraged for anomaly detection, MIL provides an appealing formulation for anomaly detection over real-world datasets, and it is the primary contribution of this paper. In this paper, we propose an MIL-based formulation and various algorithmic instantiations of this framework based on different design decisions for key components of the framework. We evaluate the resulting algorithms over four datasets that capture different physical processes along different modalities. The experimental evaluation draws out several observations. The MIL-based formulation performs no worse than single instance learning on easy to moderate datasets and outperforms single-instance learning on more challenging datasets. Altogether, the results show that the framework generalizes well over diverse datasets resulting from different real-world application domains.
翻译:在现实世界的数据集中检测异常现象仍是一项艰巨的任务。数据说明是一个密集的人类劳动问题,特别是在连续的数据集中,异常现象的开始和结束时间尚不为人知。因此,从相继的实际世界进程中收集的数据在很大程度上可以不贴标签或含有不准确的标签。这些特征对基于监督学习的异常现象检测技术的应用提出了挑战。相比之下,多例学习(MIL)被显示对培训数据集中标签知识不全的问题是有效的,这主要是由于袋式概念造成的。尽管在异常现象检测方面基本上没有被低估,但MIL为在真实世界数据集中异常现象检测提供了一种诱人配方,这是本文的主要贡献。在本文件中,我们根据对框架关键组成部分的不同设计决定,提出了基于MIL的配方和各种算法的对这个框架的快速反应。我们评估了在四个数据集中得出的算法,这些算出不同的物理过程是不同的模式。实验性评估了几项观察结果。基于MIL的配方在中学习中度数据集方面没有比单例更差,从中学习容易到真实的数据集,而使单一的域显示不同的单一数据框架更具有挑战性。