A coreset is a point set containing information about geometric properties of a larger point set. A series of previous works show that in many machine learning problems, especially in clustering problems, coreset could be very useful to build efficient algorithms. Two main measures of an coreset construction algorithm's performance are the running time of the algorithm and the size of the coreset output by the algorithm. In this paper we study the construction of coresets for the $(k,z)$-clustering problem, which is a generalization of $k$-means and $k$-median problem. By properly designing a sketching-based distance estimation data structure, we propose faster algorithms that construct coresets with matching size of the state-of-the-art results.
翻译:核心数据集是一个集点, 包含关于大点数组几何特性的信息。 先前的一系列工程显示, 在许多机器学习问题中, 特别是在集群问题中, 核心数据集对于建立高效的算法可能非常有用。 核心数据集构建算法的两种主要性能是算法运行时间和由算法决定的核心数据集输出大小。 在本文中, 我们研究$( k,z) $( z) 组合问题的核心数据集的构建, 即 $k$- under和 $k- medn 问题的一般化。 通过正确设计基于绘图的远距估计数据结构, 我们建议了更快的算法, 构建与最新结果大小相匹配的核心数据集 。