The seminal paper by Mazumdar and Saha \cite{MS17a} introduced an extensive line of work on clustering with noisy queries. Yet, despite significant progress on the problem, the proposed methods depend crucially on knowing the exact probabilities of errors of the underlying fully-random oracle. In this work, we develop robust learning methods that tolerate general semi-random noise obtaining qualitatively the same guarantees as the best possible methods in the fully-random model. More specifically, given a set of $n$ points with an unknown underlying partition, we are allowed to query pairs of points $u,v$ to check if they are in the same cluster, but with probability $p$, the answer may be adversarially chosen. We show that information theoretically $O\left(\frac{nk \log n} {(1-2p)^2}\right)$ queries suffice to learn any cluster of sufficiently large size. Our main result is a computationally efficient algorithm that can identify large clusters with $O\left(\frac{nk \log n} {(1-2p)^2}\right) + \text{poly}\left(\log n, k, \frac{1}{1-2p} \right)$ queries, matching the guarantees of the best known algorithms in the fully-random model. As a corollary of our approach, we develop the first parameter-free algorithm for the fully-random model, answering an open question by \cite{MS17a}.
翻译:由 Mazumdar 和 Saha & cite{MS17a} 撰写的开创性文件 引入了与噪音查询组群的广泛工作路线。 然而,尽管在问题上取得了显著的进展, 提议的方法关键地取决于是否了解根本的完全随机的异常差错的确切概率。 在这项工作中, 我们开发了强大的学习方法, 能够容忍普通半随机噪音获得质量上与完全随机模式中最佳可能方法相同的保证。 更具体地说, 鉴于一组免费的点, 并且有一个未知的开放式分区, 我们允许查询一对点 $u,v$ 来检查它们是否在同一组群中, 但以概率 $p$, 答案可能是对抗性的 。 我们展示了理论上的 left (forc{nk{nk{p} kpright) 询问足以学习任何足够大范围的群集。 我们的主要结果是计算高效的算法模型, 能够通过 $Oleft (\ sraft) {nknlog n}, k_\\\\\\\\\\\ preck track track track track track track track track a track track as a maisal dead dead as as as squlations develom sqs develgres devels devel as as made a as made) as made.