Both supervised and unsupervised machine learning algorithms have been used to learn partition-based index structures for approximate nearest neighbor (ANN) search. Existing supervised algorithms formulate the learning task as finding a partition in which the nearest neighbors of a training set point belong to the same partition element as the point itself, so that the nearest neighbor candidates can be retrieved by naive lookup or backtracking search. We formulate candidate set selection in ANN search directly as a multilabel classification problem where the labels correspond to the nearest neighbors of the query point, and interpret the partitions as partitioning classifiers for solving this task. Empirical results suggest that the natural classifier based on this interpretation leads to strictly improved performance when combined with any unsupervised or supervised partitioning strategy. We also prove a sufficient condition for consistency of a partitioning classifier for ANN search, and illustrate the result by verifying this condition for chronological $k$-d trees.
翻译:受监管和不受监管的机器学习算法都被用来学习基于分区的索引结构,用于近邻的搜索。现有的受监管算法将学习任务表述为寻找一个分区,使培训设置点的近邻与点本身属于同一分区元素,以便通过天真地看望或反跟踪搜索来检索最近的邻国候选人。我们在ANN搜索中将候选人设置为多标签分类问题,标签与查询点的近邻相对应,并将分区解释为解决这项任务的分区分类师。经验性结果显示,以这种解释为基础的自然分类器在与任何未经监督或监督的分区战略相结合时,可以严格地改进性能。我们还证明为ANN搜索的分区分类师的一致性提供了充分的条件,并通过核实按时间轴 $-kd 树的情况来说明结果。