The Hierarchical Clustering (HC) problem consists of building a hierarchy of clusters to represent a given dataset. Motivated by the modern large-scale applications, we study the problem in the \streaming model, in which the memory is heavily limited and only a single or very few passes over the input are allowed. Specifically, we investigate whether a good hierarchical clustering can be obtained, or at least whether we can approximately estimate the value of the optimal hierarchy. To measure the quality of a hierarchy, we use the HC minimization objective introduced by Dasgupta. Assuming that the input is an $n$-vertex weighted graph whose edges arrive in a stream, we derive the following results on space-vs-accuracy tradeoffs: * With $O(n\cdot \text{polylog}\,{n})$ space, we develop a single-pass algorithm, whose approximation ratio matches the currently best offline algorithm. * When the space is more limited, namely, $n^{1-o(1)}$, we prove that no algorithm can even estimate the value of optimum HC tree to within an $o(\frac{\log{n}}{\log\log{n}})$ factor, even when allowed $\text{polylog}{\,{n}}$ passes over the input. * In the most stringent setting of $\text{polylog}\,{n}$ space, we rule out algorithms that can even distinguish between "highly"-vs-"poorly" clusterable graphs, namely, graphs that have an $n^{1/2-o(1)}$ factor gap between their HC objective value. * Finally, we prove that any single-pass streaming algorithm that computes an optimal HC tree requires to store almost the entire input even if allowed exponential time. Our algorithmic results establish a general structural result that proves that cut sparsifiers of input graph can preserve cost of "balanced" HC trees to within a constant factor. Our lower bound results include a new streaming lower bound for a novel problem "One-vs-Many-Expanders", which can be of independent interest.
翻译:高级集成 (HC) 问题包括建立一组组群以代表给定的数据集 。 在现代大型应用程序的驱动下, 我们研究流式模型中的问题, 该模型的内存非常有限, 只允许一次或很少通过输入 。 具体地说, 我们调查是否可以获得良好的等级组群, 或者至少我们可以估计最佳等级组的值 。 为了测量一个等级的质量, 我们使用 Dassalcial 推出的 最小化 HC 目标 。 假设输入是美元( 美元) 的垂直加权图, 其边缘在流中到达一个流中, 我们用以下结果来计算: $( n) (c) 数字组的内值 。 我们开发了一个单流式算法, 它的近似比率可以与目前最好的离线式算法结果相匹配 。 * 当空间更有限时, 即, 美元/ 流/ 美元( i) 数字组的内, 我们证明甚至无法计算最佳的 HC 直径值值值值值值值值值值, 美元( 美元) 最终确定一个 美元 。