Approximate nearest neighbor search (ANN) is a common way to retrieve relevant search results, especially now in the context of large language models and retrieval augmented generation. One of the most widely used algorithms for ANN is based on constructing a multi-layer graph over the dataset, called the Hierarchical Navigable Small World (HNSW). While this algorithm supports insertion of new data, it does not support deletion of existing data. Moreover, deletion algorithms described by prior work come at the cost of increased query latency, decreased recall, or prolonged deletion time. In this paper, we propose a new theoretical framework for graph-based ANN based on random walks. We then utilize this framework to analyze a randomized deletion approach that preserves hitting time statistics compared to the graph before deleting the point. We then turn this theoretical framework into a deterministic deletion algorithm, and show that it provides better tradeoff between query latency, recall, deletion time, and memory usage through an extensive collection of experiments.


翻译:近似最近邻搜索(ANN)是一种常用的检索相关搜索结果的方法,在当前大语言模型和检索增强生成的背景下尤为重要。最广泛使用的ANN算法之一是基于在数据集上构建多层图,称为分层可导航小世界(HNSW)。虽然该算法支持新数据的插入,但不支持现有数据的删除。此外,先前工作描述的删除算法往往以增加查询延迟、降低召回率或延长删除时间为代价。本文提出了一种基于随机游走的图近似最近邻搜索新理论框架。我们利用该框架分析了一种随机删除方法,该方法在删除数据点后能保持与原始图相近的命中时间统计特性。随后,我们将该理论框架转化为确定性删除算法,并通过大量实验证明其在查询延迟、召回率、删除时间和内存使用之间实现了更优的权衡。

0
下载
关闭预览

相关内容

互联网
Top
微信扫码咨询专知VIP会员