We study an active cluster recovery problem where, given a set of $n$ points and an oracle answering queries like "are these two points in the same cluster?", the task is to recover exactly all clusters using as few queries as possible. We begin by introducing a simple but general notion of margin between clusters that captures, as special cases, the margins used in previous work, the classic SVM margin, and standard notions of stability for center-based clusterings. Then, under our margin assumptions we design algorithms that, in a variety of settings, recover all clusters exactly using only $O(\log n)$ queries. For the Euclidean case, $\mathbb{R}^m$, we give an algorithm that recovers arbitrary convex clusters, in polynomial time, and with a number of queries that is lower than the best existing algorithm by $\Theta(m^m)$ factors. For general pseudometric spaces, where clusters might not be convex or might not have any notion of shape, we give an algorithm that achieves the $O(\log n)$ query bound, and is provably near-optimal as a function of the packing number of the space. Finally, for clusterings realized by binary concept classes, we give a combinatorial characterization of recoverability with $O(\log n)$ queries, and we show that, for many concept classes in Euclidean spaces, this characterization is equivalent to our margin condition. Our results show a deep connection between cluster margins and active cluster recoverability.
翻译:我们研究一个活跃的集束回收问题, 如果有一组美元点和甲骨文回答“ 是同一组中的这两个点 ”, 任务就是使用尽可能少的查询来恢复所有组群 。 我们首先引入一个简单而笼统的组群间差差值概念, 以特殊情况、 以往工作中使用的差值、 经典 SVM 差值 和基于中心组群的稳定性标准概念 。 然后, 根据我们的差值假设, 我们设计了一种算法, 在各种环境中, 仅仅使用 $( log n) 来恢复所有组群的差值查询 。 对于 Eucliidean 案, $\ mathb{R\ m$, 我们给出一种算法, 以任意的 convex 组群集群, 在多元时间里, 以比目前最好的算法低 $ Theta( m) 系数。 对于一般伪称空间空间, 可能不是同质或没有形状概念的, 我们给出一种算法, 来实现 $ n 等值的组 质值 质值 。 对于我们的分类值 质值 质值, 我们的分类的分类值, 最后显示一个循环的分类值 的分类值 。