We consider the problem of clustering in the learning-augmented setting, where we are given a data set in $d$-dimensional Euclidean space, and a label for each data point given by an oracle indicating what subsets of points should be clustered together. This setting captures situations where we have access to some auxiliary information about the data set relevant for our clustering objective, for instance the labels output by a neural network. Following prior work, we assume that there are at most an $\alpha \in (0,c)$ for some $c<1$ fraction of false positives and false negatives in each predicted cluster, in the absence of which the labels would attain the optimal clustering cost $\mathrm{OPT}$. For a dataset of size $m$, we propose a deterministic $k$-means algorithm that produces centers with improved bound on clustering cost compared to the previous randomized algorithm while preserving the $O( d m \log m)$ runtime. Furthermore, our algorithm works even when the predictions are not very accurate, i.e. our bound holds for $\alpha$ up to $1/2$, an improvement over $\alpha$ being at most $1/7$ in the previous work. For the $k$-medians problem we improve upon prior work by achieving a biquadratic improvement in the dependence of the approximation factor on the accuracy parameter $\alpha$ to get a cost of $(1+O(\alpha))\mathrm{OPT}$, while requiring essentially just $O(md \log^3 m/\alpha)$ runtime.
翻译:我们考虑的是学习- 放大设置中的组群问题, 给每个预测组群中大约1美元以下部分的假正数和假负数, 由甲骨文给每个数据点贴标签, 表示哪些子点应该集中在一起。 对于一个大小的数据集, 我们建议使用一个神经网络的标签输出, 与先前的随机算法相比, 集中成本的束缚性 $k$- 手段算法, 但要保持美元( d m\log m) 的运行时间。 此外, 我们的算法工作, 即使预测并不十分准确, 也要求美元 美元 的精确度 。 在前一 美元 的 美元 方面, 我们的精确度 3 美元 。</s>