Density-based clustering aims to find groups of similar objects (i.e., clusters) in a given dataset. Applications include, e.g., process mining and anomaly detection. It comes with two user parameters ({\epsilon}, MinPts) that determine the clustering result, but are typically unknown in advance. Thus, users need to interactively test various settings until satisfying clusterings are found. However, existing solutions suffer from the following limitations: (a) Ineffective pruning of expensive neighborhood computations. (b) Approximate clustering, where objects are falsely labeled noise. (c) Restricted parameter tuning that is limited to {\epsilon} whereas MinPts is constant, which reduces the explorable clusterings. (d) Inflexibility in terms of applicable data types and distance functions. We propose FINEX, a linear-space index that overcomes these limitations. Our index provides exact clusterings and can be queried with either of the two parameters. FINEX avoids neighborhood computations where possible and reduces the complexities of the remaining computations by leveraging fundamental properties of density-based clusters. Hence, our solution is effcient and flexible regarding data types and distance functions. Moreover, FINEX respects the original and straightforward notion of density-based clustering. In our experiments on 12 large real-world datasets from various domains, FINEX frequently outperforms state-of-the-art techniques for exact clustering by orders of magnitude.
翻译:密度聚类旨在在给定数据集中找到相似对象(即聚类)。 应用程序包括流程挖掘和异常检测。它需要两个用户参数({\epsilon},MinPts),这些参数确定聚类结果,但通常事先不知道。因此,用户需要交互式测试各种设置,直到找到令人满意的聚类。但是,现有解决方案存在以下限制:(a)无效修剪昂贵的邻域计算。(b)近似聚类,其中对象错误地标记为噪音。(c)参数调整受到限制,该限制仅限于{\epsilon},而MinPts是恒定的,这降低了可探索聚类。(d)应用程序数据类型和距离函数的刚性。我们提出了FINEX,一种线性空间索引,可克服这些限制。我们的索引提供精确的聚类结果,并且可以使用两个参数之一进行查询。FINEX尽可能避免邻域计算,并通过利用基本的密度聚类特性减少了剩余计算的复杂性。因此,我们的解决方案是高效且灵活的,关于数据类型和距离函数。此外,FINEX遵守原始而简单的密度聚类概念。在我们对来自各个领域的12个大型实际数据集的实验中,FINEX经常比各种项法技术优越数个数量级。