Distance-based outlier detection is widely adopted in many fields, e.g., data mining and machine learning, because it is unsupervised, can be employed in a generic metric space, and does not have any assumptions of data distributions. Data mining and machine learning applications face a challenge of dealing with large datasets, which requires efficient distance-based outlier detection algorithms. Due to the popularization of computational environments with large memory, it is possible to build a main-memory index and detect outliers based on it, which is a promising solution for fast distance-based outlier detection. Motivated by this observation, we propose a novel approach that exploits a proximity graph. Our approach can employ an arbitrary proximity graph and obtains a significant speed-up against state-of-the-art. However, designing an effective proximity graph raises a challenge, because existing proximity graphs do not consider efficient traversal for distance-based outlier detection. To overcome this challenge, we propose a novel proximity graph, MRPG. Our empirical study using real datasets demonstrates that MRPG detects outliers significantly faster than the state-of-the-art algorithms.
翻译:在许多领域,例如数据挖掘和机器学习,由于数据挖掘和机器学习是不受监督的,因此广泛采用远程外部探测法,因为数据挖掘和机器学习可以用于通用的计量空间,而且没有数据分布的任何假设。数据挖掘和机器学习应用面临着处理大型数据集的挑战,这需要高效的远程外部探测算法。由于利用大量内存的计算环境普及,因此有可能建立主要模拟索引,并在此基础上探测外源,这是快速远程外源探测的一个有希望的解决方案。受这一观察的启发,我们提出了一个利用近距离图的新办法。我们的方法可以使用任意的近距离图,对最新数据进行大幅度的加速。然而,设计有效的近端图提出了挑战,因为现有的近距离图并不认为远距离外源探测有高效的跨度。为了克服这一挑战,我们提出了一个新的近距离图,即MRPG。我们使用实际数据集进行的经验研究显示,MPGS的外源比状态算法要快得多。