Coresets are among the most popular paradigms for summarizing data. In particular, there exist many high performance coresets for clustering problems such as $k$-means in both theory and practice. Curiously, there exists no work on comparing the quality of available $k$-means coresets. In this paper we perform such an evaluation. There currently is no algorithm known to measure the distortion of a candidate coreset. We provide some evidence as to why this might be computationally difficult. To complement this, we propose a benchmark for which we argue that computing coresets is challenging and which also allows us an easy (heuristic) evaluation of coresets. Using this benchmark and real-world data sets, we conduct an exhaustive evaluation of the most commonly used coreset algorithms from theory and practice.
翻译:核心数据集是总结数据最受欢迎的范式之一。 特别是,在理论和实践上都有许多高性能核心数据集用于分组问题, 如美元汇率。 奇怪的是, 在比较可用美元汇率核心数据集的质量方面没有做任何工作。 在本文中,我们进行了这样的评估。 目前还没有已知的算法来测量候选人核心数据集的扭曲情况。 我们提供了一些证据,说明这在计算上可能很困难。 为了补充这一点,我们提出了一个基准,即计算核心数据集具有挑战性,也使我们能够对核心数据集进行简单(超常)的评估。 我们利用这个基准和现实世界数据集,从理论和实践上对最常用的核心数据集进行详尽的评估。