Nearest neighbor (NN) search is inherently computationally expensive in high-dimensional spaces due to the curse of dimensionality. As a well-known solution, locality-sensitive hashing (LSH) is able to answer c-approximate NN (c-ANN) queries in sublinear time with constant probability. Existing LSH methods focus mainly on building hash bucket-based indexing such that the candidate points can be retrieved quickly. However, existing coarse-grained structures fail to offer accurate distance estimation for candidate points, which translates into additional computational overhead when having to examine unnecessary points. This in turn reduces the performance of query processing. In contrast, we propose a fast and accurate in-memory LSH framework, called PM-LSH, that aims to compute c-ANN queries on large-scale, high-dimensional datasets. First, we adopt a simple yet effective PM-tree to index the data points. Second, we develop a tunable confidence interval to achieve accurate distance estimation and guarantee high result quality. Third, we propose an efficient algorithm on top of the PM-tree to improve the performance of computing c-ANN queries. In addition, we extend PM-LSH to support closest pair (CP) search in high-dimensional spaces. We again adopt the PM-tree to organize the points in a lowdimensional space, and we propose a branch and bound algorithm together with a radius pruning technique to improve the performance of computing c-approximate closest pair (c-ACP) queries. Extensive experiments with real-world data offer evidence that PM-LSH is capable of outperforming existing proposals with respect to both efficiency and accuracy for both NN and CP search.
翻译:近邻( NN) 搜索在高维空间本身就具有计算成本。 由于维度的诅咒, 高维空间的搜索本身就具有计算成本。 作为一种众所周知的解决方案, 地点敏感散列( LSH) 能够以恒定的概率在亚线性时间内解答 C- 近NN( c- ANN) 质询。 现有的 LSH 方法主要侧重于构建基于散桶的索引, 这样可以快速检索候选点。 但是, 现有的粗粗粗的图像结构无法为候选人点提供准确的距离估计, 从而在检查不必要的点时会变成额外的计算间接费用。 这反过来降低了查询处理的性能。 相反, 我们提议快速准确的 LSH 框架( 称为 PM- LSH 框架), 旨在对大规模、 高维度的数据集进行计算。 首先, 我们采用简单而有效的 PM- 树来为数据点索引提供精确的准确的距离估计, 保证高质量的结果质量。 第三, 我们提议在PMCP 上的一个顶级和最精确的轨道上进行高效的算算, 我们提出一个最精确的轨道搜索。