带集合异常点的聚类及其在关系型聚类中的应用 (Clustering with Set Outliers and Applications in Relational Clustering)

We introduce and study the $k$-center clustering problem with set outliers, a natural and practical generalization of the classical $k$-center clustering with outliers. Instead of removing individual data points, our model allows discarding up to $z$ subsets from a given family of candidate outlier sets $\mathcal{H}$. Given a metric space $(P,\mathsf{dist})$, where $P$ is a set of elements and $\mathsf{dist}$ a distance metric, a family of sets $\mathcal{H}\subseteq 2^P$, and parameters $k, z$, the goal is to compute a set of $k$ centers $S\subseteq P$ and a family of $z$ sets $H\subseteq \mathcal{H}$ to minimize $\max_{p\in P\setminus(\bigcup_{h\in H} h)} \min_{s\in S}\mathsf{dist}(p,s)$. This abstraction captures structured noise common in database applications, such as faulty data sources or corrupted records in data integration and sensor systems. We present the first approximation algorithms for this problem in both general and geometric settings. Our methods provide tri-criteria approximations: selecting up to $2k$ centers and $2f z$ outlier sets (where $f$ is the maximum number of sets that a point belongs to), while achieving $O(1)$-approximation in clustering cost. In geometric settings, we leverage range and BBD trees to achieve near-linear time algorithms. In many real applications $f=1$. In this case we further improve the running time of our algorithms by constructing small \emph{coresets}. We also provide a hardness result for the general problem showing that it is unlikely to get any sublinear approximation on the clustering cost selecting less than $f\cdot z$ outlier sets. We demonstrate that this model naturally captures relational clustering with outliers: outliers are input tuples whose removal affects the join output. We provide approximation algorithms for both, establishing a tight connection between robust clustering and relational query evaluation.

翻译：本文提出并研究了带集合异常点的k中心聚类问题，这是经典带异常点的k中心聚类问题的自然且实用的推广。与移除单个数据点不同，本模型允许从给定的候选异常点集合族$\mathcal{H}$中丢弃最多$z$个子集。给定度量空间$(P,\mathsf{dist})$，其中$P$是元素集合，$\mathsf{dist}$是距离度量，集合族$\mathcal{H}\subseteq 2^P$，以及参数$k, z$，目标是计算一个包含$k$个中心的集合$S\subseteq P$和一个包含$z$个集合的族$H\subseteq \mathcal{H}$，以最小化$\max_{p\in P\setminus(\bigcup_{h\in H} h)} \min_{s\in S}\mathsf{dist}(p,s)$。该抽象模型捕捉了数据库应用中常见的结构化噪声，例如数据集成和传感器系统中的故障数据源或损坏记录。我们针对该问题在一般场景和几何场景下首次提出了近似算法。我们的方法提供了三准则近似：选择最多$2k$个中心和$2f z$个异常点集合（其中$f$是一个点所属的最大集合数），同时实现聚类成本的$O(1)$近似。在几何场景中，我们利用范围树和BBD树实现了近线性时间算法。在许多实际应用中$f=1$。在此情况下，我们通过构建小型\emph{核心集}进一步提升了算法的运行时间。我们还为一般问题提供了一个硬度结果，表明在选择少于$f\cdot z$个异常点集合的情况下，不太可能获得聚类成本的任何次线性近似。我们证明了该模型自然地捕捉了带异常点的关系型聚类：异常点是那些移除后会影响连接输出的输入元组。我们为两者都提供了近似算法，从而在鲁棒聚类和关系型查询评估之间建立了紧密联系。