We study the problem of constructing coresets for $(k, z)$-clustering when the input dataset is corrupted by stochastic noise drawn from a known distribution. In this setting, evaluating the quality of a coreset is inherently challenging, as the true underlying dataset is unobserved. To address this, we investigate coreset construction using surrogate error metrics that are tractable and provably related to the true clustering cost. We analyze a traditional metric from prior work and introduce a new error metric that more closely aligns with the true cost. Although our metric is defined independently of the noise distribution, it enables approximation guarantees that scale with the noise level. We design a coreset construction algorithm based on this metric and show that, under mild assumptions on the data and noise, enforcing an $\varepsilon$-bound under our metric yields smaller coresets and tighter guarantees on the true clustering cost than those obtained via classical metrics. In particular, we prove that the coreset size can improve by a factor of up to $\mathrm{poly}(k)$, where $n$ is the dataset size. Experiments on real-world datasets support our theoretical findings and demonstrate the practical advantages of our approach.
翻译:本文研究在输入数据集受到已知分布随机噪声污染时,为$(k, z)$聚类问题构建核心集的方法。在此设定下,由于真实底层数据集不可观测,评估核心集的质量具有本质上的挑战性。为解决该问题,我们研究使用代理误差度量进行核心集构建,这些度量在计算上可行且与真实聚类成本存在可证明的关联。我们分析了先前工作中提出的传统度量,并引入了一种与真实成本更紧密对齐的新误差度量。尽管我们的度量定义独立于噪声分布,但它能够获得随噪声水平变化的近似保证。基于此度量,我们设计了一种核心集构建算法,并证明在数据和噪声的温和假设下,通过我们的度量施加$\varepsilon$界约束,相较于经典度量方法,能够获得更小的核心集以及对真实聚类成本更紧的保证。特别地,我们证明核心集规模最多可改善$\mathrm{poly}(k)$倍,其中$n$为数据集大小。在真实数据集上的实验验证了我们的理论发现,并证明了所提方法的实际优势。