Motivated by applications in crowdsourced entity resolution in database, signed edge prediction in social networks and correlation clustering, Mazumdar and Saha [NIPS 2017] proposed an elegant theoretical model for studying clustering with a faulty oracle. In this model, given a set of $n$ items which belong to $k$ unknown groups (or clusters), our goal is to recover the clusters by asking pairwise queries to an oracle. This oracle can answer the query that ``do items $u$ and $v$ belong to the same cluster?''. However, the answer to each pairwise query errs with probability $\varepsilon$, for some $\varepsilon\in(0,\frac12)$. Mazumdar and Saha provided two algorithms under this model: one algorithm is query-optimal while time-inefficient (i.e., running in quasi-polynomial time), the other is time efficient (i.e., in polynomial time) while query-suboptimal. Larsen, Mitzenmacher and Tsourakakis [WWW 2020] then gave a new time-efficient algorithm for the special case of $2$ clusters, which is query-optimal if the bias $\delta:=1-2\varepsilon$ of the model is large. It was left as an open question whether one can obtain a query-optimal, time-efficient algorithm for the general case of $k$ clusters and other regimes of $\delta$. In this paper, we make progress on the above question and provide a time-efficient algorithm with nearly-optimal query complexity (up to a factor of $O(\log^2 n)$) for all constant $k$ and any $\delta$ in the regime when information-theoretic recovery is possible. Our algorithm is built on a connection to the stochastic block model.
翻译:在数据库中,Mazumdar 和 Saha [NIPS 2017] 提议了一个优雅的理论模型,用于研究有缺陷的组合。在这个模型中,如果有一组属于美元未知的组(或组)的美元项目,我们的目标是通过向甲骨文询问对等质查询来恢复组。这个甲骨文可以解答“在社交网络和相关组合中签名的精度预测 $和 $ ” 。然而,对每对口查询的答案有误,概率为$( vareepslon) 和 saah [NIPS 2017] 。在这个模型中,考虑到一组属于美元未知的组( 或组) 的美元, 我们的目标是通过对质查询来恢复组的组( 即, 运行在准球骨质质质质质质质质询问。 在“ 美元- 美元- 美元- 数值 ” (WWWTO) 和“ 美元- 美元- 美元- 数位数 ” 的解算方法是一个新的时间- 。 当一个特殊的解算算数 1 。