We study k-median clustering under the sequential no-substitution setting. In this setting, a data stream is sequentially observed, and some of the points are selected by the algorithm as cluster centers. However, a point can be selected as a center only immediately after it is observed, before observing the next point. In addition, a selected center cannot be substituted later. We give the first algorithm for this setting that obtains a constant approximation factor on the optimal risk under a random arrival order, an exponential improvement over previous work. This is also the first constant approximation guarantee that holds without any structural assumptions on the input data. Moreover, the number of selected centers is only quasi-linear in k. Our algorithm and analysis are based on a careful risk estimation that avoids outliers, a new concept of a linear bin division, and a multiscale approach to center selection.
翻译:我们根据连续的无替代设置研究 k 中位群集。 在此设置中, 数据流会按顺序观测, 有些点会由算法选择为集集中心。 但是, 点只有在观察后才能立即选择为中心, 然后再观察下一点 。 此外, 选中的中心以后无法替换 。 我们给此设置的第一个算法在随机抵达顺序下获得关于最佳风险的恒定近似系数, 比先前的工作有指数性改进 。 这还是第一个不变的近似保证, 在输入数据上没有任何结构性假设 。 此外, 所选中心的数量只是 k 的准线性 。 我们的算法和分析基于谨慎的风险估计, 避免外围点, 一个线性垃圾分解的新概念, 以及一个多尺度的中心选择方法 。