Hierarchical clustering over graphs is a fundamental task in data mining and machine learning with applications in domains such as phylogenetics, social network analysis, and information retrieval. Specifically, we consider the recently popularized objective function for hierarchical clustering due to Dasgupta. Previous algorithms for (approximately) minimizing this objective function require linear time/space complexity. In many applications the underlying graph can be massive in size making it computationally challenging to process the graph even using a linear time/space algorithm. As a result, there is a strong interest in designing algorithms that can perform global computation using only sublinear resources. The focus of this work is to study hierarchical clustering for massive graphs under three well-studied models of sublinear computation which focus on space, time, and communication, respectively, as the primary resources to optimize: (1) (dynamic) streaming model where edges are presented as a stream, (2) query model where the graph is queried using neighbor and degree queries, (3) MPC model where the graph edges are partitioned over several machines connected via a communication channel. We design sublinear algorithms for hierarchical clustering in all three models above. At the heart of our algorithmic results is a view of the objective in terms of cuts in the graph, which allows us to use a relaxed notion of cut sparsifiers to do hierarchical clustering while introducing only a small distortion in the objective function. Our main algorithmic contributions are then to show how cut sparsifiers of the desired form can be efficiently constructed in the query model and the MPC model. We complement our algorithmic results by establishing nearly matching lower bounds that rule out the possibility of designing better algorithms in each of these models.
翻译:图表的层次分组是数据挖掘和机器学习的一项基本任务,其应用领域包括植物内分泌、社交网络分析和信息检索。 具体地说, 我们考虑最近由于 Dasgupta 而在等级分组方面普及的客观功能。 先前的( 约) 最小化这个目标功能的算法需要线性时间/ 空间复杂性。 在许多应用中, 基础图形的大小可能很大, 使得它即使在使用线性时间/ 空间算法的情况下, 也难以计算处理图形。 因此, 人们非常有兴趣设计能够仅使用亚线性资源进行全球计算的各种算法。 这项工作的重点是在三个经过仔细研究的亚线性计算模型下, 研究大规模图表的等级组合, 分别侧重于空间、 时间和 通信。 先前的( 约) 最小化的算法需要优化:(1) ( 动态) 流动) 模型, 边际作为流体显示图形使用边际和度查询模式的查询。 (3) MPC 模型只能用来补充通过通信频道进行的全球计算。 我们设计的图形边际边缘。 我们设计的亚线组为在三个模型中进行分级组合组合组合的分级组合,, 在所有三个模型中, 的精度计算中, 将精度分析中, 将精度的精度的精度的精度分析结果显示的精度的精度变的精度变的精度变的精度变到的精度计算, 的精度的精度计算, 将精度计算法, 将精度的精度计算法在上, 将精度的精度计算到我们的精度的精度的精度的精度的精度的精度的精度的精度的精度的精度的精度计算结果, 。