Offline k-means clustering was studied extensively, and algorithms with a constant approximation are available. However, online clustering is still uncharted. New factors come into play: the ordering of the dataset and whether the number of points, n, is known in advance or not. Their exact effects are unknown. In this paper we focus on the online setting where the decisions are irreversible: after a point arrives, the algorithm needs to decide whether to take the point as a center or not, and this decision is final. How many centers are needed and sufficient to achieve constant approximation in this setting? We show upper and lower bounds for all the different cases. These bounds are exactly the same up to a constant, thus achieving optimal bounds. For example, for k-means cost with constant k>1 and random order, Theta(log n) centers are enough to achieve a constant approximation, while the mere a priori knowledge of n reduces the number of centers to a constant. These bounds hold for any distance function that obeys a triangle-type inequality.
翻译:离线 k 角度组群已经进行了广泛的研究, 并且可以使用恒定近似值的算法 。 但是, 在线组群仍然没有被探索。 新的因素正在起作用: 数据集的顺序以及点数( n) 是否事先已知。 它们的确切效果未知。 在本文中, 我们关注决定不可逆转的在线设置 : 在点到达后, 算法需要决定是否将点作为中心点, 而此决定是最终的 。 在这个设置中, 有多少中心需要且足以实现恒定近似? 我们为所有不同的情况显示上下界限 。 这些界限完全相同到一个常数, 从而达到最佳的界限 。 例如, 对于以恒定 k>1 和随机顺序计算的 k 平均成本, Theta (log n) 中心足以实现恒定近, 而仅仅对 n 的先期知识将中心数减少到一个常数 。 这些边框将维持在符合三角形不平等性的任何远程函数 。