Clustering has many important applications in computer science, but real-world datasets often contain outliers. The presence of outliers can make the clustering problems to be much more challenging. In this paper, we propose a framework for solving three representative center-based clustering with outliers problems: $k$-center/median/means clustering with outliers. The framework actually is very simple, where we just need to take a small uniform sample from the input and run an existing approximation algorithm on the sample. However, our analysis is fundamentally different from the previous (uniform and non-uniform) sampling based ideas. To explain the effectiveness of uniform sampling in theory, we introduce a "significance" criterion and prove that the performance of our framework depends on the significance degree of the given instance. In particular, the sample size can be independent of the input data size $n$ and the dimensionality $d$, if we assume the given instance is sufficiently "significant", which is in fact a fairly appropriate assumption in practice. Due to its simplicity, the uniform sampling approach also enjoys several significant advantages over the non-uniform sampling approaches. The experiments suggest that our framework can achieve comparable clustering results with existing methods, but is much easier to implement and can greatly reduce the running times. To the best of our knowledge, this is the first work that systematically studies the effectiveness of uniform sampling from both theoretical and experimental aspects.
翻译:集群在计算机科学中有许多重要的应用, 但真实世界的数据集往往包含外源。 外源的存在可以使群集问题更具挑战性。 在本文中, 我们提出了一个解决三个有外源问题的代表性中心群的框架: $k$- center/ midn/ means croup with autliers。 这个框架其实非常简单, 我们只需要从输入中提取一个小型的统一样本, 并在抽样中运行一个现有的近似算法。 但是, 我们的分析与以前( 统一和非统一) 基于抽样的构想有根本的不同。 为了在理论上解释统一取样的有效性, 我们引入了一个“ 质性” 标准, 并证明我们框架的绩效取决于特定实例的重要性。 特别是, 抽样规模可以独立于输入数据大小 $n 美元 和 美元 的维度, 如果我们假设给定的例子足够“ 重大 ”, 而在实践上是一个相当恰当的假设。 由于它的简单性, 统一取样方法在非统一取样方法上也享有一些重大的好处。 我们的框架能够有系统化地进行这样的实验性研究, 。