A fundamental procedure in the analysis of massive datasets is the construction of similarity graphs. Such graphs play a key role for many downstream tasks, including clustering, classification, graph learning, and nearest neighbor search. For these tasks, it is critical to build graphs which are sparse yet still representative of the underlying data. The benefits of sparsity are twofold: firstly, constructing dense graphs is infeasible in practice for large datasets, and secondly, the runtime of downstream tasks is directly influenced by the sparsity of the similarity graph. In this work, we present $\textit{Stars}$: a highly scalable method for building extremely sparse graphs via two-hop spanners, which are graphs where similar points are connected by a path of length at most two. Stars can construct two-hop spanners with significantly fewer similarity comparisons, which are a major bottleneck for learning based models where comparisons are expensive to evaluate. Theoretically, we demonstrate that Stars builds a graph in nearly-linear time, where approximate nearest neighbors are contained within two-hop neighborhoods. In practice, we have deployed Stars for multiple data sets allowing for graph building at the $\textit{Tera-Scale}$, i.e., for graphs with tens of trillions of edges. We evaluate the performance of Stars for clustering and graph learning, and demonstrate 10~1000-fold improvements in pairwise similarity comparisons compared to different baselines, and 2~10-fold improvement in running time without quality loss.
翻译:分析大规模数据集的基本程序是构建相似的图形。 这样的图表对于许多下游任务具有关键作用, 包括分组、 分类、 图表学习和最近的邻居搜索。 对于这些任务, 构建稀少但仍然代表基础数据的图表至关重要 。 宽度的好处有两个方面: 首先, 构建密度强的图形在实践中对大型数据集来说是不可行的, 其次, 下游任务的运行时间直接受相似比较图的宽度影响 。 在这项工作中, 我们展示了 $\ textit{ Star} $ : 一种非常可缩放的方法, 用于通过两个 hop 的打字器构建极为稀疏的图表。 对于这些任务, 使用两个相近的打字器来构建相似的图表, 最短的距离。 恒星可以构建两张牌的打字线, 而对于用来评估成本昂贵的以学习为基础的模型来说, 这是一个主要的瓶颈。 从理论上看, 我们证明星级在近线性改进了一个图表, 将近近邻的改善时间包含在两个 hock- browd- browdal shal shal shal shal 。 在 rodeal rode 中, we sal liver dreal lade dre dre dreal dreal.