Grouping together similar elements in datasets is a common task in data mining and machine learning. In this paper, we study streaming and parallel algorithms for correlation clustering, where each pair of elements is labeled either similar or dissimilar. The task is to partition the elements and the objective is to minimize disagreements, that is, the number of dissimilar elements grouped together and similar items that get separated. Our main contribution is a semi-streaming algorithm that achieves a $(3 + \varepsilon)$-approximation to the minimum number of disagreements using a single pass over the stream. Our approach builds on the analysis of the PIVOT algorithm by Ailon, Charikar, and Newman [JACM'08] that obtains a $3$-approximation in the centralized setting. Our design allows us to sparsify the input graph by ignoring a large portion of the nodes and edges without a large extra cost as compared to the analysis of PIVOT. This sparsification makes our technique applicable on several models of massive graph processing, such as semi-streaming and Massively Parallel Computing (MPC), where sparse graphs can typically be handled much more efficiently. For the semi-streaming model, our approach yields a single-pass algorithm that works in the adaptive-order setting. This improves on the approximation ratio of the recent single-pass $5$-approximation algorithm and on the number of passes of the recent $O(1/\varepsilon)$-pass $(3 + \varepsilon)$-approximation algorithm [Behnezhad, Charikar, Ma, Tan FOCS'22, SODA'23]. For linear-memory MPC, we get an $O(1)$-round algorithm where the round complexity is independent of $\varepsilon$, which only appears in the memory demand.
翻译:一种单通半流算法用于$(3+\varepsilon)$-近似相关聚类
翻译后的摘要:
本文研究了流式处理和并行算法在相关聚类问题中的应用,其中每对元素被标记为相似或不相似,任务是将元素分组,并旨在最小化分歧数量,即分组在一起的不相似元素数量和分离在一起的相似元素数量。本文的主要贡献是一种半流式算法,通过一遍流式处理实现$(3+\varepsilon)$-近似的最小分歧数。我们的方法建立在Ailon、Charikar和Newman在JACM'08中对PIVOT算法的分析基础上,该算法在集中式环境中获得$3$-近似结果。我们的设计允许我们通过忽略大部分节点和边缘来稀疏化输入图,而不会对PIVOT的分析造成太大的额外代价。这种稀疏化使我们的技术适用于多种大规模图形处理模型,例如半流式处理和Massively Parallel Computing(MPC),在这些模型中,稀疏图通常可以更有效地处理。对于半流式模型,我们的方法得到了一个自适应顺序下的单遍算法。这提高了最近的单遍$5$-近似算法的近似比率和最近的$O(1/\varepsilon)$遍$(3+\varepsilon)$-近似算法[Behnezhad、Charikar、Ma、Tan FOCS'22,SODA'23]的通行次数。对于线性内存MPC,我们获得了一个$O(1)$轮的算法,其中轮数复杂度不依赖于$\varepsilon$,它仅出现在内存需求中。